The contentious discussion over the validity of Google researchers' claim that machine learning agents could achieve superhuman results in creating plans for computer chips entered a new, more public phase Tuesday (March 28), with a leading researcher in design automation finding the Google technology did not perform as its authors claimed in a paper published nearly two years ago in Nature.
The dispute around the Nature paper's claims has bubbled for nearly a year in prepared public statements and GitHub code repositories and FAQ sections; researchers directly involved in the situation have declined to speak extemporaneously for the public record. Even some subject matter experts have not wished to speak openly, given Google's dominant position in its ability to distribute research resources to academic computer scientists. However, Tuesday's presentation by Andrew Kahng, a prominent University of California, San Diego researcher in the field of electronic design automation (EDA), at the 2023 ACM/IEEE International Symposium on Physical Design, could elevate the issue to a more open avenue of argument among industry and academic experts.
Briefly stated, the authors of the Nature paper claimed their reinforcement learning (RL) agents could revolutionize the labor-intensive task of floorplanning—the architecting of the incredibly intricate network of memory components (called macro blocks) and logic circuitry (standard cells) on a chip. "Our method generates manufacturable chip floorplans in under six hours, compared to the strongest baseline, which requires months of intense effort by human experts," the authors wrote.
Kahng served as a peer reviewer for the paper, and also wrote an encapsulation for the news and views section of the journal, quoting science fiction author Arthur C. Clarke's observation that any sufficiently advanced technology is indistinguishable from magic.
"To long-time practitioners in the fields of chip design and design automation, (lead author Azalia) Mirhoseini and colleagues' results can indeed seem magical," Kahng wrote.
How open is open?
Science is not magic, however, and the Google paper's claims took the research community by storm. At the conclusion of his summation, Kahng wrote, "We can therefore expect the semiconductor industry to redouble its interest in replicating the authors' work, and to pursue a host of similar applications throughout the chip-design process."
For researchers who presumably were interested in trying to replicate those results, the Google team noted at the end of the paper that "the data supporting the findings of this study are available within the paper and the Extended Data," and that "the code used to generate these data is available from the corresponding authors upon reasonable request."
Aye, there's the rub. What is "reasonable" when the imperatives of proprietary intellectual property and legitimate wider research interests collide? Google researchers committed what they said was an open source framework that reproduces the Nature paper's methodology, called Circuit Training, to GitHub in January 2022.
In the paper ("Assessment of Reinforcement Learning for Macro Placement") Kahng presented Tuesday, however, he noted that Google did not open-source all the data or code necessary to confirm its stated results (more than a year after the Circuit Training GitHub was launched). This necessitated a lengthy, painstakingly documented reverse-engineering process, which included consultation with Google engineers.
"To date, the bulk of data used by Nature authors has not been released, and key portions of source code remain hidden behind APIs. This has motivated our efforts toward open, transparent implementation and assessment of Nature and CT (Circuit Training)," Kahng and his colleagues wrote. Specifically, in a slide deck of the conference presentation, Kahng noted the Google release omitted a format translator and simulated annealing (a computational method that mimics the physical process of annealing), which prohibited a native approach for outside researchers to examine the Google paper's claims.
Ultimately, Kahng and his colleagues found the RL approach outlined in the Google paper did not vastly outperform or even match traditional methods: "The solutions typically produced by human experts and SA (simulated annealing) are superior to those generated by the RL framework in the majority of cases we tested," they concluded.
Yet the paper's lead authors are still saying the comparisons are not quite apples-to-apples.
In a March 24 statement published on the home page of Anna Goldie, who was co-lead author of the Nature paper, she and Mirhoseini (both of whom, according to personal web pages, have since left Google) say they believe Kahng's paper "mischaracterizes" their work, and offer both a high-level technical defense as well as contextual information about the rarity of open-sourcing code in commercial electronic design automation. They contend that one aspect of the Kahng team's paper compared CT to Nvidia's AutoDMP and "(presumably) the latest version of CMP, a black-box, closed-source commercial autoplacer. Neither of these methods were available when we released our paper in 2020."
They also contended that Kahng's group did not pre-train the RL agent: "A learning-based method will of course take longer to learn and perform worse if it has never seen a chip before!" they wrote.
However, in an updated entry on the Kahng's group's GitHub FAQ, they wrote, "We did not use pre-trained models in our study. Note that it is impossible to replicate the pre-training described in the Nature paper, for two reasons: (1) the data set used for pre-training consists of 20 TPU blocks which are not open-sourced, and (2) the code for pre-training is not released either."
Patrick Madden, associate professor of computer science at Binghamton University, and Moshe Vardi, University Professor and Karen Ostrum George Distinguished Service Professor in Computational Engineering at Rice University (and former editor-in-chief of Communications), each addressed the imbroglio from their respective expert viewpoints, and each questioned the logic behind publishing the Nature paper.
Madden, for instance, wrote a paper about benchmarking standard cell placement in 2001 that served as a sort of clarion call to improve what was then a jumble to some sort of recognizable norm: "Not everyone was measuring the same things in the same way," he wrote in an email accompanying a link to the paper he sent Communications. "This was before widespread Internet, with a lot of stuff having to be snail-mailed on CDs, tapes, and floppies. In many ways, it's not surprising we had some confusion."
There is no dearth of recognized benchmarks in design now, though, and Google's reluctance to use those benchmarks in the paper trouble Madden.
"Everybody has a secret sauce—everybody—so we have open public benchmarks and I can run whatever I want to privately, and everybody can do the same thing, and then we show each other these artifacts," said Madden, a former co-chair of ACM SIGDA and a former member of the ACM Publications Board. "I have been doing benchmarking for a long time. There are things we can't share, don't want to reveal, but I can run an experiment and everybody else will say, 'yeah, I see what you did'. That is the heartburn I have with this Google paper.
"Google is a very large company. I do not want to be in a fight with Google. But I also sort of feel an obligation to not look the other way."
Vardi said the editors of Nature made a mistake in publishing the paper, citing astronomer Carl Sagan's maxim that "extraordinary claims need extraordinary proof."
"It was a huge claim," Vardi said. "The paper made quite a splash, but I look at it as an editor and I would not have published this. Not because the claim is not justified—but where is the evidence?
"In my opinion, the onus is on the editors of Nature to either explain their decision or retract the paper. In my opinion, they made a mistake in the first place in publishing it."
Vardi noted Kahng has been meticulous and non-judgmental in his efforts, and Google has yet to fully open-source the data and code it used to make its claims. "We are now approaching two years since the paper was published. Now the merit has been examined and Andrew has done very careful work. And, I am paraphrasing his work here, the claims are not warranted."
Nature declined to comment about the status of the Google paper specifically, citing confidentiality. In general, a spokesperson said, "When concerns are raised about any paper published in the journal, we look into them carefully following an established process. This process involves consultation with the authors and, where appropriate, seeking advice from peer reviewers and other external experts. Once we have enough information to make a decision, we follow up with the response that is most appropriate and that provides clarity for our readers as to the outcome."
Additionally, the timing of any action the journal might take on the paper could be influenced by a wrongful termination suit filed by former Google AI researcher Satrajit Chatterjee.
In his amended complaint filed Feb. 21, Chatterjee outlines in detail charges that research he and colleagues conducted while he was still at Google showed the Nature results were not true; essentially, that methodological flaws in the project tilted the scales considerably in the favor of the RL technology and, when examined on a level playing field, that the results of the experiment were "decidedly mixed." The case is continuing in Santa Clara County Superior Court; Google subsequently requested the amended complaint be conditionally sealed, saying it contained confidential material, but it was still available at the time this story was reported.
Vardi said Google may be holding out on resolving the paper's status because it is defending itself in a high-stakes whistleblower action, and settling out of court with Chatterjee prior to taking action to withdraw the paper could be a step Google might advocate. "But I have to give them the other option, that they still believe in the merit of the paper but they still want the case to go away first before they do anything about it," he said. "But I would say right now the ball is in Nature's court; they have to explain why they accepted the paper."
When asked directly at the ISPD presentation if he would like to see the Nature paper amended or retracted, Kahng said, "One principle we have tried to stick to throughout is to not make value judgments – to provide an example of clear, transparent, open assessment, to hopefully resolve and calm and put into the rearview mirror, a very fraught controversy.
"I would say that amended or retracted is for Nature and Google authors to determine."
Glimmers of hope?
As the fate of the Google paper remains unknown, the dispute may also lead to wider ramifications for the publication process—for instance, promises of providing code at some undetermined point may be a non-starter for future "blockbuster" claims. In the conclusion of their paper, Kahng and his colleagues wrote that the difficulty of reproducing methods and results of the Nature paper, and the effort spent on their own evaluation project, "highlight potential benefits of a 'papers with code' culture change in the academic EDA field."
Additionally, open source EDA collaborations such as the DARPA-funded Project OpenRoad, which aims to develop tools for 24-hour, no-human-in-the-loop hardware layout generation, may get more notice.
Already, small changes have made some tasks for evaluating claims easier. In their assessment paper, Kahng and his colleagues wrote that "policy changes" at EDA tool vendors Cadence and Synopsys "permit our methods and results to be reproducible and shareable in the open, toward advancement of research in the field." Neither company wished to expand on what those changes are.
In the final of four "For The Record" commentaries Kahng posted about the progress his team made on their project, posted March 26, he expressed hope the situation is closer to resolution.
"This has been a long journey, starting with service as Reviewer #3 of the Nature paper beginning in November 2020," he wrote. "I hope that our community will be able to close this chapter soon."
Gregory Goth is an Oakville, CT-based writer who specializes in science and technology.
Kahng et al didn't pretrain!! This is a really, REALLY big flaw, and completely invalidates his study.
I cannot emphasize this enough. This is a learning-based method -- it obviously needs data! Without pretraining, the only chip the AI has ever seen is the one it's being tested on.
Kahng whines that he doesn't have access to TPUs. Of course you don't need to pre-train on TPU specifically, but you do need to pre-train on *something*.
Kahng also claims that the code for pre-training has not been released. Does he even know what pre-training is? It's the same code for training on a single chip - you just reuse the model's weights afterward as a starting point for the next chip.
There are other major problems with this paper -- see the Nature authors' response (via annagoldie dot com): https://drive.google.com/file/d/1jWUw6rUDcc7fuHu_iGeVDUkBxNJjhHdd/view
And in any case, other authors have already built on the work, and haven't stumbled -- you can see a subset listed at the bottom of the authors' response. The Nature method doesn't even seem to be state-of-the-art RL for chip design any more -- others have built on it and improved upon it, which is how science works!
It is understandable that members of the chip design community are confused by Kahng's paper -- ML methods are very new to this field, and a lot of practitioners are still getting up to speed. ML is a discipline in its own right, and it's OK to make mistakes, but to be honest some of Kahng's mistakes are ML 101.
(Incidentally, I do appreciate Kahng's commitment to transparency, even showing the tensorboards. Unfortunately, the tensorboards show that he didn't actually train the model to convergence even for the one-block-at-a-time it gets to see. He shows in the paper a convergence plot for one block, where he trains to for 1 million steps, but if you look at the plots for the other chips in his supporting documents, you can see that the other ones definitely were not trained to convergence, and were trained for 150k-350k steps for no clear reason.)
I'm not an ML nor EDA expert, but Addendum 2C of [https://arxiv.org/pdf/2302.11014.pdf] (posted before the above article appeared) points out that a google engineer wrote in [https://github.com/google-research/circuit_training/blob/main/docs/ARIANE.md#results]: "Our results training from scratch are comparable or better than the reported results in the [Nature] paper (on page 22) which used fine-tuning from a pre-trained model." The addendum also points out that Table 1 of the Nature paper [https://www.nature.com/articles/s41586-021-03544-w] does not show benefits in metrics such as area, power, or wire length from pre-training. Hence, Prof. Kahng's conclusions stand.
Displaying all 2 comments