Large language models help decipher clinical notes

Electronic health records (EHRs) need a new public relations manager. Ten years ago, the U.S. government passed a law that required hospitals to digitize their health records with the intent of improving and streamlining care. The enormous amount of information in these now-digital records could be used to answer very specific questions beyond the scope of clinical trials: What’s the right dose of this medication for patients with this height and weight? What about patients with a specific genomic profile?

Unfortunately, most of the data that could answer these questions is trapped in doctor’s notes, full of jargon and abbreviations. These notes are hard for computers to understand using current techniques — extracting information requires training multiple machine learning models. Models trained for one hospital, also, don’t work well at others, and training each model requires domain experts to label lots of data, a time-consuming and expensive process. 

An ideal system would use a single model that can extract many types of information, work well at multiple hospitals, and learn from a small amount of labeled data. But how? Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) believed that to disentangle the data, they needed to call on something bigger: large language models. To pull that important medical information, they used a very big, GPT-3 style model to do tasks like expand overloaded jargon and acronyms and extract medication regimens. 

For example, the system takes an input, which in this case is a clinical note, “prompts” the model with a question about the note, such as “expand this abbreviation, C-T-A.” The system returns an output such as “clear to auscultation,” as opposed to say, a CT angiography. The objective of extracting this clean data, the team says, is to eventually enable more personalized clinical recommendations. 

Medical data is, understandably, a pretty tricky resource to navigate freely. There’s plenty of red tape around using public resources for testing the performance of large models because of data use restrictions, so the team decided to scrape together their own. Using a set of short, publicly available clinical snippets, they cobbled together a small dataset to enable evaluation of the extraction performance of large language models. 

“It’s challenging to develop a single general-purpose clinical natural language processing system that will solve everyone’s needs and be robust to the huge variation seen across health datasets. As a result, until today, most clinical notes are not used in downstream analyses or for live decision support in electronic health records. These large language model approaches could potentially transform clinical natural language processing,” says David Sontag, MIT professor of electrical engineering and computer science, principal investigator in CSAIL and the Institute for Medical Engineering and Science, and supervising author on a paper about the work, which will be presented at the Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-shot clinical information extraction makes scaling possible. Even if you have hundreds of different use cases, no problem — you can build each model with a few minutes of work, versus having to label a ton of data for that particular task.”

For example, without any labels at all, the researchers found these models could achieve 86 percent accuracy at expanding overloaded acronyms, and the team developed additional methods to boost this further to 90 percent accuracy, with still no labels required.

Imprisoned in an EHR 

Experts have been steadily building up large language models (LLMs) for quite some time, but they burst onto the mainstream with GPT-3’s widely covered ability to complete sentences. These LLMs are trained on a huge amount of text from the internet to finish sentences and predict the next most likely word. 

While previous, smaller models like earlier GPT iterations or BERT have pulled off a good performance for extracting medical data, they still require substantial manual data-labeling effort. 

For example, a note, “pt will dc vanco due to n/v” means that this patient (pt) was taking the antibiotic vancomycin (vanco) but experienced nausea and vomiting (n/v) severe enough for the care team to discontinue (dc) the medication. The team’s research avoids the status quo of training separate machine learning models for each task (extracting medication, side effects from the record, disambiguating common abbreviations, etc). In addition to expanding abbreviations, they investigated four other tasks, including if the models could parse clinical trials and extract detail-rich medication regimens.  

“Prior work has shown that these models are sensitive to the prompt’s precise phrasing. Part of our technical contribution is a way to format the prompt so that the model gives you outputs in the correct format,” says Hunter Lang, CSAIL PhD student and author on the paper. “For these extraction problems, there are structured output spaces. The output space is not just a string. It can be a list. It can be a quote from the original input. So there’s more structure than just free text. Part of our research contribution is encouraging the model to give you an output with the correct structure. That significantly cuts down on post-processing time.”

The approach can’t be applied to out-of-the-box health data at a hospital: that requires sending private patient information across the open internet to an LLM provider like OpenAI. The authors showed that it’s possible to work around this by distilling the model into a smaller one that could be used on-site.

The model — sometimes just like humans — is not always beholden to the truth. Here’s what a potential problem might look like: Let’s say you’re asking the reason why someone took medication. Without proper guardrails and checks, the model might just output the most common reason for that medication, if nothing is explicitly mentioned in the note. This led to the team’s efforts to force the model to extract more quotes from data and less free text.

Future work for the team includes extending to languages other than English, creating additional methods for quantifying uncertainty in the model, and pulling off similar results with open-sourced models. 

“Clinical information buried in unstructured clinical notes has unique challenges compared to general domain text mostly due to large use of acronyms, and inconsistent textual patterns used across different health care facilities,” says Sadid Hasan, AI lead at Microsoft and former executive director of AI at CVS Health, who was not involved in the research. “To this end, this work sets forth an interesting paradigm of leveraging the power of general domain large language models for several important zero-/few-shot clinical NLP tasks. Specifically, the proposed guided prompt design of LLMs to generate more structured outputs could lead to further developing smaller deployable models by iteratively utilizing the model generated pseudo-labels.”

“AI has accelerated in the last five years to the point at which these large models can predict contextualized recommendations with benefits rippling out across a variety of domains such as suggesting novel drug formulations, understanding unstructured text, code recommendations or create works of art inspired by any number of human artists or styles,” says Parminder Bhatia, who was formerly Head of Machine Learning at AWS Health AI and is currently Head of ML for low-code applications leveraging large language models at AWS AI Labs. “One of the applications of these large models [the team has] recently launched is Amazon CodeWhisperer, which is [an] ML-powered coding companion that helps developers in building applications.”

As part of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, Agrawal, Sontag, and Lang wrote the paper alongside Yoon Kim, MIT assistant professor and CSAIL principal investigator, and Stefan Hegselmann, a visiting PhD student from the University of Muenster. First-author Agrawal’s research was supported by a Takeda Fellowship, the MIT Deshpande Center for Technological Innovation, and the MLA@CSAIL Initiatives.

MIT researchers use quantum computing to observe entanglement

For the first time, researchers at MIT, Caltech, Harvard University, and elsewhere sent quantum information across a quantum system in what could be understood as traversing a wormhole. Though this experiment didn’t create a disruption of physical space and time in the way we might understand the term “wormhole” from science fiction, calculations from the experiment showed that qubits traveled from one system of entangled particles to another in a model of gravity. This experiment performed on the Sycamore quantum processor device at Google opens the doors to future experiments with quantum computers to probe ideas from string theory and gravitational physics. 

“Simulating strongly-interacting quantum systems, such as those that arise in quantum gravity, is one of the most exciting applications of quantum computers,” says Daniel Harlow, the Jerrold R. Zacharias Career Development Associate Professor of Physics and a researcher at the MIT Laboratory for Nuclear Science (LNS) who works with David Kolchemeyer, one of the lead authors of the work. “This is a promising initial step.”  

In a new paper in Naturea team of physicists, including MIT Center for Theoretical Physics (CTP) and LNS researchers Kolchmeyer and Alexander Zlokapa, presents results on a pair of quantum systems that behave analogously to a traversable wormhole.

A wormhole is a bridge between two remote spacetime regions. In the classical general theory of relativity, nothing is allowed to pass through the wormhole. In 2019, Harvard University’s Daniel Jafferis and his collaborators suggested a wormhole could be traversable when created by entangled black holes. Kolchmeyer, a postdoc working with CTP and LNS researchers Harlow and Assistant Professor Netta Engelhardt, was advised by Jafferis for his PhD. 

“These physicists discovered a quantum mechanism to make a wormhole traversable by introducing a direct interaction between the distant spacetime regions, using a simple quantum dynamical system of fermions,” says Kolchmeyer. “In our work, we also used these entangled quantum systems to produce this kind of ‘wormhole teleportation’ using quantum computing and were able to confirm the results with classical computers.”  

Caltech’s Professor Maria Spiropulu and Jafferis are the senior authors on the new study, which appeared on Dec. 1 in Nature. Lead authors include Kolchmeyer and Zlokapa from MIT, as well as Joseph D. Lykken from the Fermilab Quantum Institute and Theoretical Physics Department, and Hartmut Neven from Google Quantum AI. Other Caltech and Alliance for Quantum Technologies (AQT) researchers on the paper include Samantha I. Davis and Nikolai Lauk. 

Spooky action at a distance

In this experiment, researchers sent a signal “through the wormhole” by teleporting a quantum state from one quantum system to another on the Sycamore 53-qubit quantum processor. To do so, the research team needed to determine entangled quantum systems that behaved with the properties predicted by quantum gravity — but that were also small enough to run on today’s quantum computers.

“A central challenge for this work was to find a simple enough many-body quantum system that preserves gravitational properties,” says Zlokapa, a second-year graduate student in physics at MIT who began this research as an undergraduate in Spiropulu’s lab.

To achieve this, the team used techniques from machine learning, taking highly interacting quantum systems and gradually reducing their connectivity. The output of this learning process produced many examples of systems with behavior consistent with quantum gravity, but each instance only required around 10 qubits — a perfect size for the Sycamore processor. 

“The complex quantum circuits required would have made larger systems with hundreds of qubits impossible to run on quantum platforms available today, so it was important to find such small examples,” says Zlokapa.

Confirmed by classical computers 

Once Zlokapa and the researchers identified these 10-qubit systems, the team inserted a qubit into one system, applied an energy shockwave across the processor, and then observed this same information on the other quantum system on the processor. The team measured how much quantum information passed from one quantum system to the other depending on the type of shockwave applied, negative or positive. 

“We showed that if the wormhole is propped open for long enough time by the negative energy shockwaves, a causal path is established between the two quantum systems. The qubit inserted into one system is indeed the same that appears on the other system,” says Spiropulu.

The team then verified these and other properties with classical computer calculations. “This is different from running a simulation on a classical computer,” Spiropulu says. “Although one could simulate the system on a classical computer — and this was done as reported in this paper — no physical system is created in a conventional simulation, which is the manipulation of classical bits, zeros and ones. Here, we saw the information travel through the wormhole.” 

This new work opens up the possibility of future quantum gravity experiments with larger quantum computers and more complicated entangled systems. This work doesn’t replace direct observations of quantum gravity, for example from detections of gravitational waves using the Laser Interferometer Gravitational wave Observatory (LIGO), adds Spiropulu. 

Both Zlokapa and Kolchmeyer are keen on understanding how such experiments can help advance quantum gravity. “I’m very curious to see how much further we can probe quantum gravity on today’s quantum computers.We have some concrete ideas for follow-up work that I’m very excited about,” says Zlokapa.

This work is supported by a Department of Energy Office of High Energy Physics QuantISED program grant on “Quantum Communication Channels for Fundamental Physics.”

Ushering in a new era of computing

As a graduate student doing his master’s thesis on speech recognition at the MIT AI Lab (now the MIT Computer Science and Artificial Intelligence Laboratory), Dan Huttenlocher worked closely with Professor Victor Zue. Well known for pioneering the development of systems that enable an user to interact with computers using spoken language, Zue traveled frequently to Asia — where much of the early research in speech recognition happened during the 1980s. Huttenlocher occasionally accompanied his professor on these trips, many of which involved interactions with members of MIT Industrial Liaison Program, as he recalls. “It was a tremendous opportunity,” according to Huttenlocher, “and it was a large part of what built my interest in engaging with companies and industry in addition to the academic side of research.” 

Huttenlocher went on to earn his PhD in computer vision at the Institute and has since embarked on a career that encompasses academia, industry, and the philanthropic sector. In addition to solidifying his status as an esteemed researcher in the academic realm, he spent 12 years as a scientist at Xerox’s Palo Alto Research Center before leaving to co-found a financial technology company. He served on the board of the John D. and Catherine T. MacArthur Foundation from 2010-22 (including as chair starting in 2018), and serves on the boards of directors at Amazon.com and Corning, Inc. He also helped found Cornell Tech, the technology, business, law, and design campus in New York City built by Cornell University. There, he was the school’s first dean and vice provost, guiding its efforts to tie together industry and computing to enhance New York’s tech ecosystem.  

Today, Huttenlocher serves as the inaugural dean at MIT Schwarzman College of Computing. To highlight the significance of this moment in time, and the need for an interdisciplinary computing hub like the college of computing, he references the oft-cited prediction that software would gobble up and disrupt traditional industry structures. Huttenlocher believes that while this insight was right, what we’re experiencing now is something different, greater, with vast implications for humanity. Computing on the whole — not only software but also hardware, algorithms, and machine learning — has evolved to the point where it is redefining our approach to problem-solving in nearly every industry sector, discipline, and area of research. This, he suggests, is also redefining reality as we experience it.  

With Huttenlocher steering, the college is both recognition and response to a new era of computing. It explores ways to support, but also to lead, the technological changes that are reshaping the world. A bidirectional, interdisciplinary approach is key to the agenda, according to Huttenlocher. “We want to harness the forefront of results in computing and infuse them with the other disciplines,” he says. “This means helping departments outside of computing stretch toward computing, but we also want to help the computing fields to stretch toward the other disciplines.” To accomplish this, Huttenlocher and the college aim to forge strong ties and collaborations in education and research between computing and a broad range of disciplines at MIT, across all five schools, departments, and programs at the graduate and the undergraduate levels. 

From an operations standpoint, the college is not yet three years old, but Huttenlocher has already overseen the rollout of several programs and initiatives that build toward the infusion of computing with other disciplines. MIT committed to the creation of 50 new faculty positions for the college: 25 in computer science and artificial intelligence, and 25 shared positions rooted in other academic departments not primarily focused on computing. Thus far, it has hired 25 new faculty members with a half-dozen in shared positions.    

He has also overseen the development of Common Ground for Computing Education, a platform that unites experts from departments across the Institute to develop and teach new courses and launch programs that blend computing with the other disciplines. It aims to capitalize on the ubiquity of computing through a coordinated approach to computing education at the Institute. Current common ground subject offerings include “Interactive data visualization and society,” “Solving real-world problems with optimization and computational imaging: Physics to algorithms,” and “Julia: Solving real-world problems with computation.” 

The Social and Ethical Responsibilities of Computing (SERC), meanwhile, is a cross-cutting initiative that encourages responsible technology development and deployment by incorporating insights and methods from the humanities and social sciences with an emphasis on social responsibility. “SERC brings together multiple viewpoints — social scientists and humanists, engineers and computer scientists — because so much of understanding the societal and ethical challenges of computing is about combining expertise across these disciplines,” says Huttenlocher. The initiative relies on a clearly defined teaching, research, and engagement framework designed to assess the broad challenges and opportunities associated with computing while fostering what it refers to as “responsible habits of mind and action” in MIT students who create and deploy computing technologies. Proving demand and impact, in 2021 more than 2,100 students were enrolled in subjects in which SERC worked with instructors to incorporate social and ethical issues into the syllabus. 

In his book, “The Age of AI: And Our Human Future” (Little, Brown, 2021), co-authored with Henry Kissinger and Eric Schmidt, Huttenlocher explores the ways in which artificial intelligence is fundamentally changing how we view ourselves as human beings, our role in society, how we perceive the world around us, and the need for collaboration across disciplines to define the future. Reflecting on what he and his colleagues have been able to accomplish at the college in such a short time frame, Huttenlocher says he is impressed with and proud of what so many at MIT have already contributed to. But that the work is far from finished: “I believe are now getting to the point where we are starting to have impacts in parts of MIT, but we’re working toward broad impact, an infusion between computing and the disciplines across the Institute — that is the aspiration of MIT Schwarzman College of Computing,” he says.  

Busy GPUs: Sampling and pipelining method speeds up deep learning on large graphs

Graphs, a potentially extensive web of nodes connected by edges, can be used to express and interrogate relationships between data, like social connections, financial transactions, traffic, energy grids, and molecular interactions. As researchers collect more data and build out these graphical pictures, researchers will need faster and more efficient methods, as well as more computational power, to conduct deep learning on them, in the way of graph neural networks (GNN).  

Now, a new method, called SALIENT (SAmpling, sLIcing, and data movemeNT), developed by researchers at MIT and IBM Research, improves the training and inference performance by addressing three key bottlenecks in computation. This dramatically cuts down on the runtime of GNNs on large datasets, which, for example, contain on the scale of 100 million nodes and 1 billion edges. Further, the team found that the technique scales well when computational power is added from one to 16 graphical processing units (GPUs). The work was presented at the Fifth Conference on Machine Learning and Systems.

“We started to look at the challenges current systems experienced when scaling state-of-the-art machine learning techniques for graphs to really big datasets. It turned out there was a lot of work to be done, because a lot of the existing systems were achieving good performance primarily on smaller datasets that fit into GPU memory,” says Tim Kaler, the lead author and a postdoc in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

By vast datasets, experts mean scales like the entire Bitcoin network, where certain patterns and data relationships could spell out trends or foul play. “There are nearly a billion Bitcoin transactions on the blockchain, and if we want to identify illicit activities inside such a joint network, then we are facing a graph of such a scale,” says co-author Jie Chen, senior research scientist and manager of IBM Research and the MIT-IBM Watson AI Lab. “We want to build a system that is able to handle that kind of graph and allows processing to be as efficient as possible, because every day we want to keep up with the pace of the new data that are generated.”

Kaler and Chen’s co-authors include Nickolas Stathas MEng ’21 of Jump Trading, who developed SALIENT as part of his graduate work; former MIT-IBM Watson AI Lab intern and MIT graduate student Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; MIT CSAIL Research Scientist Tao B. Schardl; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher with the MIT-IBM Watson AI Lab.     

For this problem, the team took a systems-oriented approach in developing their method: SALIENT, says Kaler. To do this, the researchers implemented what they saw as important, basic optimizations of components that fit into existing machine-learning frameworks, such as PyTorch Geometric and the deep graph library (DGL), which are interfaces for building a machine-learning model. Stathas says the process is like swapping out engines to build a faster car. Their method was designed to fit into existing GNN architectures, so that domain experts could easily apply this work to their specified fields to expedite model training and tease out insights during inference faster. The trick, the team determined, was to keep all of the hardware (CPUs, data links, and GPUs) busy at all times: while the CPU samples the graph and prepares mini-batches of data that will then be transferred through the data link, the more critical GPU is working to train the machine-learning model or conduct inference. 

The researchers began by analyzing the performance of a commonly used machine-learning library for GNNs (PyTorch Geometric), which showed a startlingly low utilization of available GPU resources. Applying simple optimizations, the researchers improved GPU utilization from 10 to 30 percent, resulting in a 1.4 to two times performance improvement relative to public benchmark codes. This fast baseline code could execute one complete pass over a large training dataset through the algorithm (an epoch) in 50.4 seconds.                          

Seeking further performance improvements, the researchers set out to examine the bottlenecks that occur at the beginning of the data pipeline: the algorithms for graph sampling and mini-batch preparation. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which computes information about a node using information present in other nearby nodes in the graph — for example, in a social network graph, information from friends of friends of a user. As the number of layers in the GNN increase, the number of nodes the network has to reach out to for information can explode, exceeding the limits of a computer. Neighborhood sampling algorithms help by selecting a smaller random subset of nodes to gather; however, the researchers found that current implementations of this were too slow to keep up with the processing speed of modern GPUs. In response, they identified a mix of data structures, algorithmic optimizations, and so forth that improved sampling speed, ultimately improving the sampling operation alone by about three times, taking the per-epoch runtime from 50.4 to 34.6 seconds. They also found that sampling, at an appropriate rate, can be done during inference, improving overall energy efficiency and performance, a point that had been overlooked in the literature, the team notes.      

In previous systems, this sampling step was a multi-process approach, creating extra data and unnecessary data movement between the processes. The researchers made their SALIENT method more nimble by creating a single process with lightweight threads that kept the data on the CPU in shared memory. Further, SALIENT takes advantage of a cache of modern processors, says Stathas, parallelizing feature slicing, which extracts relevant information from nodes of interest and their surrounding neighbors and edges, within the shared memory of the CPU core cache. This again reduced the overall per-epoch runtime from 34.6 to 27.8 seconds.

The last bottleneck the researchers addressed was to pipeline mini-batch data transfers between the CPU and GPU using a prefetching step, which would prepare data just before it’s needed. The team calculated that this would maximize bandwidth usage in the data link and bring the method up to perfect utilization; however, they only saw around 90 percent. They identified and fixed a performance bug in a popular PyTorch library that caused unnecessary round-trip communications between the CPU and GPU. With this bug fixed, the team achieved a 16.5 second per-epoch runtime with SALIENT.

“Our work showed, I think, that the devil is in the details,” says Kaler. “When you pay close attention to the details that impact performance when training a graph neural network, you can resolve a huge number of performance issues. With our solutions, we ended up being completely bottlenecked by GPU computation, which is the ideal goal of such a system.”

SALIENT’s speed was evaluated on three standard datasets ogbn-arxiv, ogbn-products, and ogbn-papers100M, as well as in multi-machine settings, with different levels of fanout (amount of data that the CPU would prepare for the GPU), and across several architectures, including the most recent state-of-the-art one, GraphSAGE-RI. In each setting, SALIENT outperformed PyTorch Geometric, most notably on the large ogbn-papers100M dataset, containing 100 million nodes and over a billion edges Here, it was three times faster, running on one GPU, than the optimized baseline that was originally created for this work; with 16 GPUs, SALIENT was an additional eight times faster. 

While other systems had slightly different hardware and experimental setups, so it wasn’t always a direct comparison, SALIENT still outperformed them. Among systems that achieved similar accuracy, representative performance numbers include 99 seconds using one GPU and 32 CPUs, and 13 seconds using 1,536 CPUs. In contrast, SALIENT’s runtime using one GPU and 20 CPUs was 16.5 seconds and was just two seconds with 16 GPUs and 320 CPUs. “If you look at the bottom-line numbers that prior work reports, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers that have been reported previously on this dataset,” says Kaler. The researchers attributed their performance improvements, in part, to their approach of optimizing their code for a single machine before moving to the distributed setting. Stathas says that the lesson here is that for your money, “it makes more sense to use the hardware you have efficiently, and to its extreme, before you start scaling up to multiple computers,” which can provide significant savings on cost and carbon emissions that can come with model training.

This new capacity will now allow researchers to tackle and dig deeper into bigger and bigger graphs. For example, the Bitcoin network that was mentioned earlier contained 100,000 nodes; the SALIENT system can capably handle a graph 1,000 times (or three orders of magnitude) larger.

“In the future, we would be looking at not just running this graph neural network training system on the existing algorithms that we implemented for classifying or predicting the properties of each node, but we also want to do more in-depth tasks, such as identifying common patterns in a graph (subgraph patterns), [which] may be actually interesting for indicating financial crimes,” says Chen. “We also want to identify nodes in a graph that are similar in a sense that they possibly would be corresponding to the same bad actor in a financial crime. These tasks would require developing additional algorithms, and possibly also neural network architectures.”

This research was supported by the MIT-IBM Watson AI Lab and in part by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator.

Breaking the scaling limits of analog computing

As machine-learning models become larger and more complex, they require faster and more energy-efficient hardware to perform computations. Conventional digital computers are struggling to keep up.

An analog optical neural network could perform the same tasks as a digital one, such as image classification or speech recognition, but because computations are performed using light instead of electrical signals, optical neural networks can run many times faster while consuming less energy.

However, these analog devices are prone to hardware errors that can make computations less precise. Microscopic imperfections in hardware components are one cause of these errors. In an optical neural network that has many connected components, errors can quickly accumulate.

Even with error-correction techniques, due to fundamental properties of the devices that make up an optical neural network, some amount of error is unavoidable. A network that is large enough to be implemented in the real world would be far too imprecise to be effective.

MIT researchers have overcome this hurdle and found a way to effectively scale an optical neural network. By adding a tiny hardware component to the optical switches that form the network’s architecture, they can reduce even the uncorrectable errors that would otherwise accumulate in the device.

Their work could enable a super-fast, energy-efficient, analog neural network that can function with the same accuracy as a digital one. With this technique, as an optical circuit becomes larger, the amount of error in its computations actually decreases.  

“This is remarkable, as it runs counter to the intuition of analog systems, where larger circuits are supposed to have higher errors, so that errors set a limit on scalability. This present paper allows us to address the scalability question of these systems with an unambiguous ‘yes,’” says lead author Ryan Hamerly, a visiting scientist in the MIT Research Laboratory for Electronics (RLE) and Quantum Photonics Laboratory and senior scientist at NTT Research.

Hamerly’s co-authors are graduate student Saumil Bandyopadhyay and senior author Dirk Englund, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), leader of the Quantum Photonics Laboratory, and member of the RLE. The research is published today in Nature Communications.

Multiplying with light

An optical neural network is composed of many connected components that function like reprogrammable, tunable mirrors. These tunable mirrors are called Mach-Zehnder Inferometers (MZI). Neural network data are encoded into light, which is fired into the optical neural network from a laser.

A typical MZI contains two mirrors and two beam splitters. Light enters the top of an MZI, where it is split into two parts which interfere with each other before being recombined by the second beam splitter and then reflected out the bottom to the next MZI in the array. Researchers can leverage the interference of these optical signals to perform complex linear algebra operations, known as matrix multiplication, which is how neural networks process data.

But errors that can occur in each MZI quickly accumulate as light moves from one device to the next. One can avoid some errors by identifying them in advance and tuning the MZIs so earlier errors are cancelled out by later devices in the array.

“It is a very simple algorithm if you know what the errors are. But these errors are notoriously difficult to ascertain because you only have access to the inputs and outputs of your chip,” says Hamerly. “This motivated us to look at whether it is possible to create calibration-free error correction.”

Hamerly and his collaborators previously demonstrated a mathematical technique that went a step further. They could successfully infer the errors and correctly tune the MZIs accordingly, but even this didn’t remove all the error.

Due to the fundamental nature of an MZI, there are instances where it is impossible to tune a device so all light flows out the bottom port to the next MZI. If the device loses a fraction of light at each step and the array is very large, by the end there will only be a tiny bit of power left.

“Even with error correction, there is a fundamental limit to how good a chip can be. MZIs are physically unable to realize certain settings they need to be configured to,” he says.

So, the team developed a new type of MZI. The researchers added an additional beam splitter to the end of the device, calling it a 3-MZI because it has three beam splitters instead of two. Due to the way this additional beam splitter mixes the light, it becomes much easier for an MZI to reach the setting it needs to send all light from out through its bottom port.

Importantly, the additional beam splitter is only a few micrometers in size and is a passive component, so it doesn’t require any extra wiring. Adding additional beam splitters doesn’t significantly change the size of the chip.

Bigger chip, fewer errors

When the researchers conducted simulations to test their architecture, they found that it can eliminate much of the uncorrectable error that hampers accuracy. And as the optical neural network becomes larger, the amount of error in the device actually drops — the opposite of what happens in a device with standard MZIs.

Using 3-MZIs, they could potentially create a device big enough for commercial uses with error that has been reduced by a factor of 20, Hamerly says.

The researchers also developed a variant of the MZI design specifically for correlated errors. These occur due to manufacturing imperfections — if the thickness of a chip is slightly wrong, the MZIs may all be off by about the same amount, so the errors are all about the same. They found a way to change the configuration of an MZI to make it robust to these types of errors. This technique also increased the bandwidth of the optical neural network so it can run three times faster.

Now that they have showcased these techniques using simulations, Hamerly and his collaborators plan to test these approaches on physical hardware and continue driving toward an optical neural network they can effectively deploy in the real world.

This research is funded, in part, by a National Science Foundation graduate research fellowship and the U.S. Air Force Office of Scientific Research.

The task of magnetic classification suddenly looks easier

Knowing the magnetic structure of crystalline materials is critical to many applications, including data storage, high-resolution imaging, spintronics, superconductivity, and quantum computing. Information of this sort, however, is difficult to come by. Although magnetic structures can be obtained from neutron diffraction and scattering studies, the number of machines that can support these analyses — and the time available at these facilities — is severely limited.

As a result, the magnetic structures of only about 1,500 materials worked out experimentally have been tabulated to date. Researchers have also predicted magnetic structures by numerical means, but lengthy calculations are required, even on large, state-of-the-art supercomputers. These calculations, moreover, become increasingly expensive, with power demands growing exponentially, as the size of the crystal structures under consideration goes up.

Now, researchers at MIT, Harvard University, and Clemson University — led by Mingda Li, MIT assistant professor of nuclear science and engineering, and Tess Smidt, MIT assistant professor of electrical engineering and computer science — have found a way to streamline this process by employing the tools of machine learning. “This might be a quicker and cheaper approach,” Smidt says.

The team’s results were recently published in the journal iScience. One unusual feature of this paper, apart from its novel findings, is that its first authors are three MIT undergraduates — Helena Merker, Harry Heiberger, and Linh Nguyen — plus one PhD student, Tongtong Liu.

Merker, Heiberger, and Nguyen joined the project as first-years in fall 2020, and they were given a sizable challenge: to design a neural network that can predict the magnetic structure of crystalline materials. They did not start from scratch, however, making use of “equivariant Euclidean neural networks” that were co-invented by Smidt in 2018. The advantage of this kind of network, Smidt explains, “is that we won’t get a different prediction for the magnetic order if a crystal is rotated or translated, which we know should not affect the magnetic properties.” That feature is especially helpful for examining 3D materials.

The elements of structure

The MIT group drew upon a database of nearly 150,000 substances compiled by the Materials Project at the Lawrence Berkeley National Laboratory, which provided information concerning the arrangement of atoms in the crystal lattice. The team used this input to assess two key properties of a given material: magnetic order and magnetic propagation.

Figuring out the magnetic order involves classifying materials into three categories: ferromagnetic, antiferromagnetic, and nonmagnetic. The atoms in a ferromagnetic material act like little magnets with their own north and south poles. Each atom has a magnetic moment, which points from its south to north pole. In a ferromagnetic material, Liu explains, “all the atoms are lined up in the same direction — the direction of the combined magnetic field produced by all of them.” In an antiferromagnetic material, the magnetic moments of the atoms point in a direction opposite to that of their neighbors — canceling each other out in an orderly pattern that yields zero magnetization overall. In a nonmagnetic material, all the atoms could be nonmagnetic, having no magnetic moments whatsoever. Or the material could contain magnetic atoms, but their magnetic moments would point in random directions so that the net result, again, is zero magnetism.

The concept of magnetic propagation relates to the periodicity of a material’s magnetic structure. If you think of a crystal as a 3D arrangement of bricks, a unit cell is the smallest possible building block — the smallest number, and configuration, of atoms that can make up an individual “brick.” If the magnetic moments of every unit cell are aligned, the MIT researchers accorded the material a propagation value of zero. However, if the magnetic moment changes direction, and hence “propagates,” in moving from one cell to the next, the material is given a non-zero propagation value.

A network solution

So much for the goals. How can machine learning tools help achieve them? The students’ first step was to take a portion of the Materials Project database to train the neural network to find correlations between a material’s crystalline structure and its magnetic structure. The students also learned — through educated guesses and trial-and-error — that they achieved the best results when they included not just information about the atoms’ lattice positions, but also the atomic weight, atomic radius, electronegativity (which reflects an atom’s tendency to attract an electron), and dipole polarizability (which indicates how far the electron is from the atom’s nucleus). During the training process, a large number of so-called “weights” are repeatedly fine-tuned.

“A weight is like the coefficient m in the equation y = mx + b,” Heiberger explains. “Of course, the actual equation, or algorithm, we use is a lot messier, with not just one coefficient but perhaps a hundred; x, in this case, is the input data, and you choose m so that y is predicted most accurately. And sometimes you have to change the equation itself to get a better fit.”

Next comes the testing phase. “The weights are kept as-is,” Heiberger says, “and you compare the predictions you get to previously established values [also found in the Materials Project database].”

As reported in iScience, the model had an average accuracy of about 78 percent and 74 percent, respectively, for predicting magnetic order and propagation. The accuracy for predicting the order of nonmagnetic materials was 91 percent, even if the material contained magnetic atoms.

Charting the road ahead

The MIT investigators believe this approach could be applied to large molecules whose atomic structures are hard to discern and even to alloys, which lack crystalline structures. “The strategy there is to take as big a unit cell — as big a sample — as possible and try to approximate it as a somewhat disordered crystal,” Smidt says.

The current work, the authors wrote, represents one step toward “solving the grand challenge of full magnetic structure determination.” The “full structure” in this case means determining “the specific magnetic moments of every atom, rather than the overall pattern of the magnetic order,” Smidt explains.

“We have the math in place to take this on,” Smidt adds, “though there are some tricky details to be worked out. It’s a project for the future, but one that appears to be within reach.”

The undergraduates won’t participate in that effort, having already completed their work in this venture. Nevertheless, they all appreciated the research experience. “It was great to pursue a project outside the classroom that gave us the chance to create something exciting that didn’t exist before,” Merker says.

“This research, entirely led by undergraduates, started in 2020 when they were first-years. With Institute support from the ELO [Experiential Learning Opportunities] program and later guidance from PhD student Tongtong Liu, we were able to bring them together even while physically remote from each other. This work demonstrates how we can expand the first-year learning experience to include a real research product,” Li adds. “Being able to support this kind of collaboration and learning experience is what every educator strives for. It is wonderful to see their hard work and commitment result in a contribution to the field.”

“This really was a life-changing experience,” Nguyen agrees. “I thought it would be fun to combine computer science with the material world. That turned out to be a pretty good choice.”

Teresa Gao named 2024 Mitchell Scholar

MIT senior Teresa Gao has been named one of the 12 winners of the George J. Mitchell Scholarship’s Class of 2024. After graduating next spring with a double major in computer science and engineering as well as brain and cognitive sciences, she will study augmented and virtual reality at Trinity College Dublin. Gao is the fifth MIT student to be named a Mitchell Scholar.

Mitchell Scholars are selected on the basis of academic achievement, leadership, and dedication to public service. The scholarship is named in honor of U.S. Senator George Mitchell’s contributions to the Northern Ireland peace process. This year, over 300 American students were endorsed to apply for the prestigious fellowship, which is sponsored by the U.S.-Ireland Alliance and funds a year of graduate studies in Ireland.

“Teresa’s excellent work at the intersections of engineering, music, and science communication make the Mitchell Scholarship in Ireland a perfect fit for her next step,” says Kim Benard, associate dean of distinguished fellowships in Career Advising and Professional Development. “We are proud that she will be representing MIT there, as she exemplifies the mind and hand ethos of our education.”

Gao, a resident of Provo, Utah, is interested in artificial intelligence and the development of autonomous agents. She has conducted research in a range of fields, including psycholinguistics in the Department of Brain and Cognitive Sciences, social robots for mental health in the Media Lab, and machine learning architectures for biological images at the Broad Institute. Currently, she is working to establish cognitive benchmarks for AI with the MIT Quest for Intelligence.

Gao’s love for science is only equaled by her passion for creativity and the arts. She hosts an educational radio show, “Psycholochat: Where Neuroscience Meets Philosophy,” on the MIT campus radio station WMBR 88.1 FM, where she investigates topics in psychology, neuroscience, and philosophy.

Completely self-taught on the viola, Gao earned a highly competitive seat in the MIT Chamber Music Society. She also serves as co-president of Ribotones, a student group that plays music in service to hospital patients and nursing home residents throughout the Greater Boston community, and she performs with the competitive MIT Bhangra dance team.

Outside of the arts, Gao tutors fellow MIT students through the IEEE-Eta Kappa Nu Honor Society, manages logistics for the annual Battlecode programming competition fun by MIT’s computer science department, and volunteers with the peer support anonymous campus textline, Lean On Me.

A simpler path to better computer vision

Before a machine-learning model can complete a task, such as identifying cancer in medical images, the model must be trained. Training image classification models typically involves showing the model millions of example images gathered into a massive dataset.

However, using real image data can raise practical and ethical concerns: The images could run afoul of copyright laws, violate people’s privacy, or be biased against a certain racial or ethnic group. To avoid these pitfalls, researchers can use image generation programs to create synthetic data for model training. But these techniques are limited because expert knowledge is often needed to hand-design an image generation program that can create effective training data. 

Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere took a different approach. Instead of designing customized image generation programs for a particular training task, they gathered a dataset of 21,000 publicly available programs from the internet. Then they used this large collection of basic image generation programs to train a computer vision model.

These programs produce diverse images that display simple colors and textures. The researchers didn’t curate or alter the programs, which each comprised just a few lines of code.

The models they trained with this large dataset of programs classified images more accurately than other synthetically trained models. And, while their models underperformed those trained with real data, the researchers showed that increasing the number of image programs in the dataset also increased model performance, revealing a path to attaining higher accuracy.

“It turns out that using lots of programs that are uncurated is actually better than using a small set of programs that people need to manipulate. Data are important, but we have shown that you can go pretty far without real data,” says Manel Baradad, an electrical engineering and computer science (EECS) graduate student working in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper describing this technique.

Co-authors include Tongzhou Wang, an EECS grad student in CSAIL; Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab; Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL; and senior author Phillip Isola, an associate professor in EECS and CSAIL; along with others at JPMorgan Chase Bank and Xyla, Inc. The research will be presented at the Conference on Neural Information Processing Systems. 

Rethinking pretraining

Machine-learning models are typically pretrained, which means they are trained on one dataset first to help them build parameters that can be used to tackle a different task. A model for classifying X-rays might be pretrained using a huge dataset of synthetically generated images before it is trained for its actual task using a much smaller dataset of real X-rays.

These researchers previously showed that they could use a handful of image generation programs to create synthetic data for model pretraining, but the programs needed to be carefully designed so the synthetic images matched up with certain properties of real images. This made the technique difficult to scale up.

In the new work, they used an enormous dataset of uncurated image generation programs instead.

They began by gathering a collection of 21,000 images generation programs from the internet. All the programs are written in a simple programming language and comprise just a few snippets of code, so they generate images rapidly.

“These programs have been designed by developers all over the world to produce images that have some of the properties we are interested in. They produce images that look kind of like abstract art,” Baradad explains.

These simple programs can run so quickly that the researchers didn’t need to produce images in advance to train the model. The researchers found they could generate images and train the model simultaneously, which streamlines the process.

They used their massive dataset of image generation programs to pretrain computer vision models for both supervised and unsupervised image classification tasks. In supervised learning, the image data are labeled, while in unsupervised learning the model learns to categorize images without labels.

Improving accuracy

When they compared their pretrained models to state-of-the-art computer vision models that had been pretrained using synthetic data, their models were more accurate, meaning they put images into the correct categories more often. While the accuracy levels were still less than models trained on real data, their technique narrowed the performance gap between models trained on real data and those trained on synthetic data by 38 percent.

“Importantly, we show that for the number of programs you collect, performance scales logarithmically. We do not saturate performance, so if we collect more programs, the model would perform even better. So, there is a way to extend our approach,” Manel says.

The researchers also used each individual image generation program for pretraining, in an effort to uncover factors that contribute to model accuracy. They found that when a program generates a more diverse set of images, the model performs better. They also found that colorful images with scenes that fill the entire canvas tend to improve model performance the most.

Now that they have demonstrated the success of this pretraining approach, the researchers want to extend their technique to other types of data, such as multimodal data that include text and images. They also want to continue exploring ways to improve image classification performance.

“There is still a gap to close with models trained on real data. This gives our research a direction that we hope others will follow,” he says.

A far-sighted approach to machine learning

Picture two teams squaring off on a football field. The players can cooperate to achieve an objective, and compete against other players with conflicting interests. That’s how the game works.

Creating artificial intelligence agents that can learn to compete and cooperate as effectively as humans remains a thorny problem. A key challenge is enabling AI agents to anticipate future behaviors of other agents when they are all learning simultaneously.

Because of the complexity of this problem, current approaches tend to be myopic; the agents can only guess the next few moves of their teammates or competitors, which leads to poor performance in the long run. 

Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere have developed a new approach that gives AI agents a farsighted perspective. Their machine-learning framework enables cooperative or competitive AI agents to consider what other agents will do as time approaches infinity, not just over a few next steps. The agents then adapt their behaviors accordingly to influence other agents’ future behaviors and arrive at an optimal, long-term solution.

This framework could be used by a group of autonomous drones working together to find a lost hiker in a thick forest, or by self-driving cars that strive to keep passengers safe by anticipating future moves of other vehicles driving on a busy highway.

“When AI agents are cooperating or competing, what matters most is when their behaviors converge at some point in the future. There are a lot of transient behaviors along the way that don’t matter very much in the long run. Reaching this converged behavior is what we really care about, and we now have a mathematical way to enable that,” says Dong-Ki Kim, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS) and lead author of a paper describing this framework.

The senior author is Jonathan P. How, the Richard C. Maclaurin Professor of Aeronautics and Astronautics and a member of the MIT-IBM Watson AI Lab. Co-authors include others at the MIT-IBM Watson AI Lab, IBM Research, Mila-Quebec Artificial Intelligence Institute, and Oxford University. The research will be presented at the Conference on Neural Information Processing Systems.

More agents, more problems

The researchers focused on a problem known as multiagent reinforcement learning. Reinforcement learning is a form of machine learning in which an AI agent learns by trial and error. Researchers give the agent a reward for “good” behaviors that help it achieve a goal. The agent adapts its behavior to maximize that reward until it eventually becomes an expert at a task.

But when many cooperative or competing agents are simultaneously learning, things become increasingly complex. As agents consider more future steps of their fellow agents, and how their own behavior influences others, the problem soon requires far too much computational power to solve efficiently. This is why other approaches only focus on the short term.

“The AIs really want to think about the end of the game, but they don’t know when the game will end. They need to think about how to keep adapting their behavior into infinity so they can win at some far time in the future. Our paper essentially proposes a new objective that enables an AI to think about infinity,” says Kim.

But since it is impossible to plug infinity into an algorithm, the researchers designed their system so agents focus on a future point where their behavior will converge with that of other agents, known as equilibrium. An equilibrium point determines the long-term performance of agents, and multiple equilibria can exist in a multiagent scenario. Therefore, an effective agent actively influences the future behaviors of other agents in such a way that they reach a desirable equilibrium from the agent’s perspective. If all agents influence each other, they converge to a general concept that the researchers call an “active equilibrium.”

The machine-learning framework they developed, known as FURTHER (which stands for FUlly Reinforcing acTive influence witH averagE Reward), enables agents to learn how to adapt their behaviors as they interact with other agents to achieve this active equilibrium.

FURTHER does this using two machine-learning modules. The first, an inference module, enables an agent to guess the future behaviors of other agents and the learning algorithms they use, based solely on their prior actions.

This information is fed into the reinforcement learning module, which the agent uses to adapt its behavior and influence other agents in a way that maximizes its reward.

“The challenge was thinking about infinity. We had to use a lot of different mathematical tools to enable that, and make some assumptions to get it to work in practice,” Kim says.

Winning in the long run

They tested their approach against other multiagent reinforcement learning frameworks in several different scenarios, including a pair of robots fighting sumo-style and a battle pitting two 25-agent teams against one another. In both instances, the AI agents using FURTHER won the games more often.

Since their approach is decentralized, which means the agents learn to win the games independently, it is also more scalable than other methods that require a central computer to control the agents, Kim explains.

The researchers used games to test their approach, but FURTHER could be used to tackle any kind of multiagent problem. For instance, it could be applied by economists seeking to develop sound policy in situations where many interacting entitles have behaviors and interests that change over time.

Economics is one application Kim is particularly excited about studying. He also wants to dig deeper into the concept of an active equilibrium and continue enhancing the FURTHER framework.

This research is funded, in part, by the MIT-IBM Watson AI Lab.

From LGO to PhD

Many students in MIT’s Leaders for Global Operations (LGO) program return to the workplace primed to tackle complex operational problems. But sometimes their research sparks deep scholarly interest, and they bring their LGO toolkit into an academic career instead.

That was the case for Jimmy Smith SM ’18, MBA ’18, who’s currently pursuing a PhD in computational mathematics at Stanford University. He specializes in machine learning models for sequence data.

Smith was ready to accelerate his career as a drilling engineer at BP Exploration Alaska, so he enrolled in the LGO program for experience in engineering management. There, he worked with Goodyear to develop machine learning algorithms to automate a tire-inspection process. He realized that he wanted to explore machine learning even more deeply. Instead of polishing his resume, he began preparing applications to PhD programs, all with the support of MIT mentors.

“LGO opened up the world for me. Getting exposure to so many students and faculty with different interests helped me to gain a better understanding of what I wanted out of my career,” he says.

Smith’s advisor, mechanical engineering professor David Hardt, praises LGO’s natural link between industry and academia. Having worked closely with Smith in his manufacturing statistics class and applauding his curiosity, he wrote Smith a glowing recommendation letter for Stanford.

With LGO, “You get a holistic perspective,” Hardt says. “While LGO students are not research students — they’re professionals doing a project in an industry that ends up as a thesis — Jimmy was asking the probing questions that you’d want in a PhD student.” 

While the LGO program isn’t a traditional training ground for PhD candidates, it’s a highly useful one, Smith says. The work he pursued, initially for a professional edge, ended up blossoming into an intellectual passion.

“LGO exposed me to the machine-learning, AI-type things that I’m interested in now. The master’s thesis component gave me an opportunity to do meaningful research, and working with faculty advisors at MIT gave me a better sense of what doing research full time as a PhD student would be like,” he explains. “I realized it was something I was really interested in and excited about.” 

Smith’s revelatory experience isn’t unusual for LGO students, says MIT LGO executive director Thomas Roemer.

“Students come to us because they want to change direction in life in some way. And some, while at MIT, discover how much they love learning and how much they love being at a university. They may get inspired by professors and say, ‘Hey, this is what I would like to do: become a professor myself,’” he says.

Like Smith, Audrey Bazerghi SM ’20, MBA ’20, a former management consultant, didn’t enter MIT with the desire to pursue a PhD. Before enrolling, she worked for Oliver Wyman, focusing on the manufacturing, transportation, and energy space. She was at a career crossroads and wanted to refine her math and modeling skills. She graduated with a newfound passion for research.

“I focused a lot of my coursework at MIT on supply chain and on questions regarding procurement or logistics that I ran into in my time as a consultant. I discovered through my LGO internship and thesis requirements that I really enjoyed research,” she recalls. “LGO allowed me to discover that I liked it enough to do it full time.”

Now she’s a second-year PhD student at Northwestern University’s Kellogg School of Management, focusing on operations management. Bazerghi hopes to teach master’s of business administration students, ideally helping them to apply cutting-edge operations knowledge to their respective industries. It’s a logical extension of the hands-on education she received at MIT.

“That’s what LGO really is about: How do we organize work so that it serves its purpose? And I think it’s more relevant than ever that people understand that,” says Deishin Lee ’90, SM ’92.

To that end, she’s now an associate professor of operations management and sustainability at Ivey Business School in London, Ontario. The LGO program — at the time called Leaders for Manufacturing — imparted an appreciation for the connection between academics and practicality, which she now shares with her students. In fact, Lee worked at Motorola for seven years before obtaining her PhD. It was a useful strategy. The real-world experience she got on the job helps her teach organizational pain points from a lived perspective.

“The problem is that sometimes students don’t have an appreciation for the problems organizations have. It’s difficult to evaluate the effectiveness of various solutions, if you don’t understand the problem — that comes from an understanding of how organizations work,” she says. “LGO was enormously helpful because we saw so many different organizations, and we had so many managers come and talk to us.”

While the vast majority of LGO alumni reenter the workforce, Roemer hopes that prospective students enter MIT with an open mind.

“[The LGO program] is a life-changing opportunity that will really have a huge impact on their future lives, not so much in terms of careers — of course they’ll have great careers — but in terms of how they look at the world,” he says. “And that transition, in those two years, may go in all sorts of directions.”

Generated by Feedzy