Strengthening trust in machine-learning models

Probabilistic machine learning methods are becoming increasingly powerful tools in data analysis, informing a range of critical decisions across disciplines and applications, from forecasting election results to predicting the impact of microloans on addressing poverty.

This class of methods uses sophisticated concepts from probability theory to handle uncertainty in decision-making. But the math is only one piece of the puzzle in determining their accuracy and effectiveness. In a typical data analysis, researchers make many subjective choices, or potentially introduce human error, that must also be assessed in order to cultivate users’ trust in the quality of decisions based on these methods.

To address this issue, MIT computer scientist Tamara Broderick, associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems (LIDS), and a team of researchers have developed a classification system — a “taxonomy of trust” — that defines where trust might break down in a data analysis and identifies strategies to strengthen trust at each step. The other researchers on the project are Professor Anna Smith at the University of Kentucky, professors Tian Zheng and Andrew Gelman at Columbia University, and Professor Rachael Meager at the London School of Economics. The team’s hope is to highlight concerns that are already well-studied and those that need more attention.

In their paper, published in February in Science Advances, the researchers begin by detailing the steps in the data analysis process where trust might break down: Analysts make choices about what data to collect and which models, or mathematical representations, most closely mirror the real-life problem or question they are aiming to answer. They select algorithms to fit the model and use code to run those algorithms. Each of these steps poses unique challenges around building trust. Some components can be checked for accuracy in measurable ways. “Does my code have bugs?”, for example, is a question that can be tested against objective criteria. Other times, problems are more subjective, with no clear-cut answers; analysts are confronted with numerous strategies to gather data and decide whether a model reflects the real world.

“What I think is nice about making this taxonomy, is that it really highlights where people are focusing. I think a lot of research naturally focuses on this level of ‘are my algorithms solving a particular mathematical problem?’ in part because it’s very objective, even if it’s a hard problem,” Broderick says.

“I think it’s really hard to answer ‘is it reasonable to mathematize an important applied problem in a certain way?’ because it’s somehow getting into a harder space, it’s not just a mathematical problem anymore.”

Capturing real life in a model

The researchers’ work in categorizing where trust breaks down, though it may seem abstract, is rooted in real-world application.

Meager, a co-author on the paper, analyzed whether microfinances can have a positive effect in a community. The project became a case study for where trust could break down, and ways to reduce this risk.

At first look, measuring the impact of microfinancing might seem like a straightforward endeavor. But like any analysis, researchers meet challenges at each step in the process that can affect trust in the outcome. Microfinancing — in which individuals or small businesses receive small loans and other financial services in lieu of conventional banking — can offer different services, depending on the program. For the analysis, Meager gathered datasets from microfinance programs in countries across the globe, including in Mexico, Mongolia, Bosnia, and the Philippines.

When combining conspicuously distinct datasets, in this case from multiple countries and across different cultures and geographies, researchers must evaluate whether specific case studies can reflect broader trends. It is also important to contextualize the data on hand. For example, in rural Mexico, owning goats may be counted as an investment.

“It’s hard to measure the quality of life of an individual. People measure things like, ‘What’s the business profit of the small business?’ Or ‘What’s the consumption level of a household?’ There’s this potential for mismatch between what you ultimately really care about, and what you’re measuring,” Broderick says. “Before we get to the mathematical level, what data and what assumptions are we leaning on?”

With data on hand, analysts must define the real-world questions they seek to answer. In the case of evaluating the benefits of microfinancing, analysts must define what they consider a positive outcome. It is standard in economics, for example, to measure the average financial gain per business in communities where a microfinance program is introduced. But reporting an average might suggest a net positive effect even if only a few (or even one) person benefited, instead of the community as a whole.

“What you really wanted was that a lot of people are benefiting,” Broderick says. “It sounds simple. Why didn’t we measure the thing that we cared about? But I think it’s really common that practitioners use standard machine learning tools, for a lot of reasons. And these tools might report a proxy that doesn’t always agree with the quantity of interest.”

Analysts may consciously or subconsciously favor models they are familiar with, especially after investing a great deal of time learning their ins and outs. “Someone might be hesitant to try a nonstandard method because they might be less certain they will use it correctly. Or peer review might favor certain familiar methods, even if a researcher might like to use nonstandard methods,” Broderick says. “There are a lot of reasons, sociologically. But this can be a concern for trust.”

Final step, checking the code 

While distilling a real-life problem into a model can be a big-picture, amorphous problem, checking the code that runs an algorithm can feel “prosaic,” Broderick says. But it is another potentially overlooked area where trust can be strengthened.

In some cases, checking a coding pipeline that executes an algorithm might be considered outside the purview of an analyst’s job, especially when there is the option to use standard software packages.

One way to catch bugs is to test whether code is reproducible. Depending on the field, however, sharing code alongside published work is not always a requirement or the norm. As models increase in complexity over time, it becomes harder to recreate code from scratch. Reproducing a model becomes difficult or even impossible.

“Let’s just start with every journal requiring you to release your code. Maybe it doesn’t get totally double-checked, and everything isn’t absolutely perfect, but let’s start there,” Broderick says, as one step toward building trust.

Paper co-author Gelman worked on an analysis that forecast the 2020 U.S. presidential election using state and national polls in real-time. The team published daily updates in The Economist magazine, while also publishing their code online for anyone to download and run themselves. Throughout the season, outsiders pointed out both bugs and conceptual problems in the model, ultimately contributing to a stronger analysis.

The researchers acknowledge that while there is no single solution to create a perfect model, analysts and scientists have the opportunity to reinforce trust at nearly every turn.

“I don’t think we expect any of these things to be perfect,” Broderick says, “but I think we can expect them to be better or to be as good as possible.”

Learning to grow machine-learning models

It’s no secret that OpenAI’s ChatGPT has some incredible capabilities — for instance, the chatbot can write poetry that resembles Shakespearean sonnets or debug code for a computer program. These abilities are made possible by the massive machine-learning model that ChatGPT is built upon. Researchers have found that when these types of models become large enough, extraordinary capabilities emerge.

But bigger models also require more time and money to train. The training process involves showing hundreds of billions of examples to a model. Gathering so much data is an involved process in itself. Then come the monetary and environmental costs of running many powerful computers for days or weeks to train a model that may have billions of parameters. 

“It’s been estimated that training models at the scale of what ChatGPT is hypothesized to run on could take millions of dollars, just for a single training run. Can we improve the efficiency of these training methods, so we can still get good models in less time and for less money? We propose to do this by leveraging smaller language models that have previously been trained,” says Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Rather than discarding a previous version of a model, Kim and his collaborators use it as the building blocks for a new model. Using machine learning, their method learns to “grow” a larger model from a smaller model in a way that encodes knowledge the smaller model has already gained. This enables faster training of the larger model.

Their technique saves about 50 percent of the computational cost required to train a large model, compared to methods that train a new model from scratch. Plus, the models trained using the MIT method performed as well as, or better than, models trained with other techniques that also use smaller models to enable faster training of larger models.

Reducing the time it takes to train huge models could help researchers make advancements faster with less expense, while also reducing the carbon emissions generated during the training process. It could also enable smaller research groups to work with these massive models, potentially opening the door to many new advances.

“As we look to democratize these types of technologies, making training faster and less expensive will become more important,” says Kim, senior author of a paper on this technique.

Kim and his graduate student Lucas Torroba Hennigen wrote the paper with lead author Peihao Wang, a graduate student at the University of Texas at Austin, as well as others at the MIT-IBM Watson AI Lab and Columbia University. The research will be presented at the International Conference on Learning Representations.

The bigger the better

Large language models like GPT-3, which is at the core of ChatGPT, are built using a neural network architecture called a transformer. A neural network, loosely based on the human brain, is composed of layers of interconnected nodes, or “neurons.” Each neuron contains parameters, which are variables learned during the training process that the neuron uses to process data.

Transformer architectures are unique because, as these types of neural network models get bigger, they achieve much better results.

“This has led to an arms race of companies trying to train larger and larger transformers on larger and larger datasets. More so than other architectures, it seems that transformer networks get much better with scaling. We’re just not exactly sure why this is the case,” Kim says.

These models often have hundreds of millions or billions of learnable parameters. Training all these parameters from scratch is expensive, so researchers seek to accelerate the process.

One effective technique is known as model growth. Using the model growth method, researchers can increase the size of a transformer by copying neurons, or even entire layers of a previous version of the network, then stacking them on top. They can make a network wider by adding new neurons to a layer or make it deeper by adding additional layers of neurons.

In contrast to previous approaches for model growth, parameters associated with the new neurons in the expanded transformer are not just copies of the smaller network’s parameters, Kim explains. Rather, they are learned combinations of the parameters of the smaller model.

Learning to grow

Kim and his collaborators use machine learning to learn a linear mapping of the parameters of the smaller model. This linear map is a mathematical operation that transforms a set of input values, in this case the smaller model’s parameters, to a set of output values, in this case the parameters of the larger model.

Their method, which they call a learned Linear Growth Operator (LiGO), learns to expand the width and depth of larger network from the parameters of a smaller network in a data-driven way.

But the smaller model may actually be quite large — perhaps it has a hundred million parameters — and researchers might want to make a model with a billion parameters. So the LiGO technique breaks the linear map into smaller pieces that a machine-learning algorithm can handle.

LiGO also expands width and depth simultaneously, which makes it more efficient than other methods. A user can tune how wide and deep they want the larger model to be when they input the smaller model and its parameters, Kim explains.

When they compared their technique to the process of training a new model from scratch, as well as to model-growth methods, it was faster than all the baselines. Their method saves about 50 percent of the computational costs required to train both vision and language models, while often improving performance.

The researchers also found they could use LiGO to accelerate transformer training even when they didn’t have access to a smaller, pretrained model.

“I was surprised by how much better all the methods, including ours, did compared to the random initialization, train-from-scratch baselines.” Kim says.

In the future, Kim and his collaborators are looking forward to applying LiGO to even larger models.

The work was funded, in part, by the MIT-IBM Watson AI Lab, Amazon, the IBM Research AI Hardware Center, Center for Computational Innovation at Rensselaer Polytechnic Institute, and the U.S. Army Research Office.

Detailed images from space offer clearer picture of drought effects on plants

“MIT is a place where dreams come true,” says César Terrer, an assistant professor in the Department of Civil and Environmental Engineering. Here at MIT, Terrer says he’s given the resources needed to explore ideas he finds most exciting, and at the top of his list is climate science. In particular, he is interested in plant-soil interactions, and how the two can mitigate impacts of climate change. In 2022, Terrer received seed grant funding from the Abdul Latif Jameel Water and Food Systems Lab (J-WAFS) to produce drought monitoring systems for farmers. The project is leveraging a new generation of remote sensing devices to provide high-resolution plant water stress at regional to global scales.

Growing up in Granada, Spain, Terrer always had an aptitude and passion for science. He studied environmental science at the University of Murcia, where he interned in the Department of Ecology. Using computational analysis tools, he worked on modeling species distribution in response to human development. Early on in his undergraduate experience, Terrer says he regarded his professors as “superheroes” with a kind of scholarly prowess. He knew he wanted to follow in their footsteps by one day working as a faculty member in academia. Of course, there would be many steps along the way before achieving that dream. 

Upon completing his undergraduate studies, Terrer set his sights on exciting and adventurous research roles. He thought perhaps he would conduct field work in the Amazon, engaging with native communities. But when the opportunity arose to work in Australia on a state-of-the-art climate change experiment that simulates future levels of carbon dioxide, he headed south to study how plants react to CO2 in a biome of native Australian eucalyptus trees. It was during this experience that Terrer started to take a keen interest in the carbon cycle and the capacity of ecosystems to buffer rising levels of CO2 caused by human activity.

Around 2014, he began to delve deeper into the carbon cycle as he began his doctoral studies at Imperial College London. The primary question Terrer sought to answer during his PhD was “will plants be able to absorb predicted future levels of CO2 in the atmosphere?” To answer the question, Terrer became an early adopter of artificial intelligence, machine learning, and remote sensing to analyze data from real-life, global climate change experiments. His findings from these “ground truth” values and observations resulted in a paper in the journal Science. In it, he claimed that climate models most likely overestimated how much carbon plants will be able to absorb by the end of the century, by a factor of three. 

After postdoctoral positions at Stanford University and the Universitat Autonoma de Barcelona, followed by a prestigious Lawrence Fellowship, Terrer says he had “too many ideas and not enough time to accomplish all those ideas.” He knew it was time to lead his own group. Not long after applying for faculty positions, he landed at MIT. 

New ways to monitor drought

Terrer is employing similar methods to those he used during his PhD to analyze data from all over the world for his J-WAFS project. He and postdoc Wenzhe Jiao collect data from remote sensing satellites and field experiments and use machine learning to come up with new ways to monitor drought. Terrer says Jiao is a “remote sensing wizard,” who fuses data from different satellite products to understand the water cycle. With Jiao’s hydrology expertise and Terrer’s knowledge of plants, soil, and the carbon cycle, the duo is a formidable team to tackle this project.

According to the U.N. World Meteorological Organization, the number and duration of droughts has increased by 29 percent since 2000, as compared to the two previous decades. From the Horn of Africa to the Western United States, drought is devastating vegetation and severely stressing water supplies, compromising food production and spiking food insecurity. Drought monitoring can offer fundamental information on drought location, frequency, and severity, but assessing the impact of drought on vegetation is extremely challenging. This is because plants’ sensitivity to water deficits varies across species and ecosystems. 

Terrer and Jiao are able to obtain a clearer picture of how drought is affecting plants by employing the latest generation of remote sensing observations, which offer images of the planet with incredible spatial and temporal resolution. Satellite products such as Sentinel, Landsat, and Planet can provide daily images from space with such high resolution that individual trees can be discerned. Along with the images and datasets from satellites, the team is using ground-based observations from meteorological data. They are also using the MIT SuperCloud at MIT Lincoln Laboratory to process and analyze all of the data sets. The J-WAFS project is among one of the first to leverage high-resolution data to quantitatively measure plant drought impacts in the United States with the hopes of expanding to a global assessment in the future.

Assisting farmers and resource managers 

Every week, the U.S. Drought Monitor provides a map of drought conditions in the United States. The map has zero resolution and is more of a drought recap or summary, unable to predict future drought scenarios. The lack of a comprehensive spatiotemporal evaluation of historic and future drought impacts on global vegetation productivity is detrimental to farmers both in the United States and worldwide.  

Terrer and Jiao plan to generate metrics for plant water stress at an unprecedented resolution of 10-30 meters. This means that they will be able to provide drought monitoring maps at the scale of a typical U.S. farm, giving farmers more precise, useful data every one to two days. The team will use the information from the satellites to monitor plant growth and soil moisture, as well as the time lag of plant growth response to soil moisture. In this way, Terrer and Jiao say they will eventually be able to create a kind of “plant water stress forecast” that may be able to predict adverse impacts of drought four weeks in advance. “According to the current soil moisture and lagged response time, we hope to predict plant water stress in the future,” says Jiao. 

The expected outcomes of this project will give farmers, land and water resource managers, and decision-makers more accurate data at the farm-specific level, allowing for better drought preparation, mitigation, and adaptation. “We expect to make our data open-access online, after we finish the project, so that farmers and other stakeholders can use the maps as tools,” says Jiao. 

Terrer adds that the project “has the potential to help us better understand the future states of climate systems, and also identify the regional hot spots more likely to experience water crises at the national, state, local, and tribal government scales.” He also expects the project will enhance our understanding of global carbon-water-energy cycle responses to drought, with applications in determining climate change impacts on natural ecosystems as a whole.

Peter Baddoo, Department of Mathematics instructor, dies at 29

Peter Baddoo, an instructor in the Department of Mathematics, passed away suddenly on Feb. 15 while playing basketball on campus.

Baddoo joined the MIT Department of Mathematics in January 2021. Prior to this, he was an EPSRC Doctoral Prize Fellow at Imperial College London. He studied mathematics as an undergraduate at the University of Oxford and received his PhD from Cambridge University.

An accomplished applied mathematician, Baddoo had broad research interests and activities spanning complex function theory, fluid dynamics, and machine learning and data-driven methods. His book, “Analytic Solutions for Flows Through Cascades” (Springer, 2020) received praise for its “exceptionally clear presentation with beautiful figures.”

“Peter was an outstanding, self-propelling researcher, a master of complex function theory with a burgeoning interest in machine learning, and had several collaborations within the U.S. and farther afield. He had an exceptionally promising future in academia. He was a deeply respected and valued member of my research group and the broader applied math community. He will be sorely missed,” says Professor John Bush, his faculty mentor.

In addition to his research, Baddoo was an exemplary teacher who gave generously of his time in assisting colleagues, graduate students, and undergraduates. 

“Peter was an excellent lecturer — clear, composed, thoughtful, and kind. He was extremely popular among his students,” says Michel Goemans, the RSA Professor of Mathematics and Department of Mathematics head. One of Baddoo’s students in class 18.04 (Complex Variables with Applications) says that “I took Peter’s class, and I walked out of that class actually liking math. I was assured that I want to study more of math and pursue a minor.”

Aside from his work as a scholar and teacher, Baddoo brought the department together by organizing social events for postdocs and instructors; for these and other efforts he received a Math Community Service Award. His interests extended well beyond mathematics and included music and sports such as basketball and lacrosse — which he played at Oxford and Cambridge universities, and as a member of the Senior England Men’s training squad. He was also a devoted and active member of Park Street Church.

In his honor, the Department of Mathematics will be endowing a Peter Baddoo Prize to recognize outstanding contributions to community-building within the department.

Peter Baddoo is survived by his parents, Jim and Nancy; his sisters, Kate and Harriet; and his fiancée, Yuna Kim.  

Mining the right transition metals in a vast chemical space

Swift and significant gains against climate change require the creation of novel, environmentally benign, and energy-efficient materials. One of the richest veins researchers hope to tap in creating such useful compounds is a vast chemical space where molecular combinations that offer remarkable optical, conductive, magnetic, and heat transfer properties await discovery.

But finding these new materials has been slow going.

“While computational modeling has enabled us to discover and predict properties of new materials much faster than experimentation, these models aren’t always trustworthy,” says Heather J. Kulik  PhD ’09, associate professor in the departments of Chemical Engineering and Chemistry. “In order to accelerate computational discovery of materials, we need better methods for removing uncertainty and making our predictions more accurate.”

A team from Kulik’s lab set out to address these challenges with a team including Chenru Duan PhD ’22.

A tool for building trust

Kulik and her group focus on transition metal complexes, molecules comprised of metals found in the middle of the periodic table that are surrounded by organic ligands. These complexes can be extremely reactive, which gives them a central role in catalyzing natural and industrial processes. By altering the organic and metal components in these molecules, scientists can generate materials with properties that can improve such applications as artificial photosynthesis, solar energy absorption and storage, higher efficiency OLEDS (organic light emitting diodes), and device miniaturization.

“Characterizing these complexes and discovering new materials currently happens slowly, often driven by a researcher’s intuition,” says Kulik. “And the process involves trade-offs: You might find a material that has good light-emitting properties, but the metal at the center may be something like iridium, which is exceedingly rare and toxic.”

Researchers attempting to identify nontoxic, earth-abundant transition metal complexes with useful properties tend to pursue a limited set of features, with only modest assurance that they are on the right track. “People continue to iterate on a particular ligand, and get stuck in local areas of opportunity, rather than conduct large-scale discovery,” says Kulik.

To address these screening inefficiencies, Kulik’s team developed a new approach — a machine-learning based “recommender” that lets researchers know the optimal model for pursuing their search. Their description of this tool was the subject of a paper in Nature Computational Science in December.

“This method outperforms all prior approaches and can tell people when to use methods and when they’ll be trustworthy,” says Kulik.

The team, led by Duan, began by investigating ways to improve the conventional screening approach, density functional theory (DFT), which is based on computational quantum mechanics. He built a machine learning platform to determine how accurate density functional models were in predicting structure and behavior of transition metal molecules.

“This tool learned which density functionals were the most reliable for specific material complexes,” says Kulik. “We verified this by testing the tool against materials it had never encountered before, where it in fact chose the most accurate density functionals for predicting the material’s property.”

A critical breakthrough for the team was its decision to use the electron density — a fundamental quantum mechanical property of atoms — as a machine learning input. This unique identifier, as well as the use of a neural network model to carry out the mapping, creates a powerful and efficient aide for researchers who want to determine whether they are using the appropriate density functional for characterizing their target transition metal complex. “A calculation that would take days or weeks, which makes computational screening nearly infeasible, can instead take only hours to produce a trustworthy result.”

Kulik has incorporated this tool into molSimplify, an open source code on the lab’s website, enabling researchers anywhere in the world to predict properties and model transition metal complexes.

Optimizing for multiple properties

In a related research thrust, which they showcased in a recent publication in JACS Au, Kulik’s group demonstrated an approach for quickly homing in on transition metal complexes with specific properties in a large chemical space.

Their work springboarded off a 2021 paper showing that agreement about the properties of a target molecule among a group of different density functionals significantly reduced the uncertainty of a model’s predictions.

Kulik’s team exploited this insight by demonstrating, in a first, multi-objective optimization. In their study, they successfully identified molecules that were easy to synthesize, featuring significant light-absorbing properties, using earth-abundant metals. They searched 32 million candidate materials, one of the largest spaces ever searched for this application. “We took apart complexes that are already in known, experimentally synthesized materials, and we recombined them in new ways, which allowed us to maintain some synthetic realism,” says Kulik.

After collecting DFT results on 100 compounds in this giant chemical domain, the group trained machine learning models to make predictions on the entire 32 million-compound space, with an eye to achieving their specific design goals. They repeated this process generation after generation to winnow out compounds with the explicit properties they wanted.

“In the end we found nine of the most promising compounds, and discovered that the specific compounds we picked through machine learning contained pieces (ligands) that had been experimentally synthesized for other applications requiring optical properties, ones with favorable light absorption spectra,” says Kulik.

Applications with impact

While Kulik’s overarching goal involves overcoming limitations in computational modeling, her lab is taking full advantage of its own tools to streamline the discovery and design of new, potentially impactful materials.

In one notable example, “We are actively working on the optimization of metal–organic frameworks for the direct conversion of methane to methanol,” says Kulik. “This is a holy grail reaction that folks have wanted to catalyze for decades, but have been unable to do efficiently.” 

The possibility of a fast path for transforming a very potent greenhouse gas into a liquid that is easily transported and could be used as a fuel or a value-added chemical holds great appeal for Kulik. “It represents one of those needle-in-a-haystack challenges that multi-objective optimization and screening of millions of candidate catalysts is well-positioned to solve, an outstanding challenge that’s been around for so long.”

A new method to boost the speed of online databases

Hashing is a core operation in most online databases, like a library catalogue or an e-commerce website. A hash function generates codes that replace data inputs. Since these codes are shorter than the actual data, and usually a fixed length, this makes it easier to find and retrieve the original information.

However, because traditional hash functions generate codes randomly, sometimes two pieces of data can be hashed with the same value. This causes collisions — when searching for one item points a user to many pieces of data with the same hash value. It takes much longer to find the right one, resulting in slower searches and reduced performance.

Certain types of hash functions, known as perfect hash functions, are designed to sort data in a way that prevents collisions. But they must be specially constructed for each dataset and take more time to compute than traditional hash functions.

Since hashing is used in so many applications, from database indexing to data compression to cryptography, fast and efficient hash functions are critical. So, researchers from MIT and elsewhere set out to see if they could use machine learning to build better hash functions.

They found that, in certain situations, using learned models instead of traditional hash functions could result in half as many collisions. Learned models are those that have been created by running a machine-learning algorithm on a dataset. Their experiments also showed that learned models were often more computationally efficient than perfect hash functions.

“What we found in this work is that in some situations we can come up with a better tradeoff between the computation of the hash function and the collisions we will face. We can increase the computational time for the hash function a bit, but at the same time we can reduce collisions very significantly in certain situations,” says Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Their research, which will be presented at the International Conference on Very Large Databases, demonstrates how a hash function can be designed to significantly speed up searches in a huge database. For instance, their technique could accelerate computational systems that scientists use to store and analyze DNA, amino acid sequences, or other biological information.

Sabek is co-lead author of the paper with electrical engineering and computer science (EECS) graduate student Kapil Vaidya. They are joined by co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, associate professor of EECS at MIT and co-director of the Data Systems and AI Lab.

Hashing it out

Given a data input, or key, a traditional hash function generates a random number, or code, that corresponds to the slot where that key will be stored. To use a simple example, if there are 10 keys to be put into 10 slots, the function would generate a random integer between 1 and 10 for each input. It is highly probable that two keys will end up in the same slot, causing collisions.

Perfect hash functions provide a collision-free alternative. Researchers give the function some extra knowledge, such as the number of slots the data are to be placed into. Then it can perform additional computations to figure out where to put each key to avoid collisions. However, these added computations make the function harder to create and less efficient.

“We were wondering, if we know more about the data — that it will come from a particular distribution — can we use learned models to build a hash function that can actually reduce collisions?” Vaidya says.

A data distribution shows all possible values in a dataset, and how often each value occurs. The distribution can be used to calculate the probability that a particular value is in a data sample.

The researchers took a small sample from a dataset and used machine learning to approximate the shape of the data’s distribution, or how the data are spread out. The learned model then uses the approximation to predict the location of a key in the dataset.

They found that learned models were easier to build and faster to run than perfect hash functions and that they led to fewer collisions than traditional hash functions if data are distributed in a predictable way. But if the data are not predictably distributed, because gaps between data points vary too widely, using learned models might cause more collisions.

“We may have a huge number of data inputs, and each one has a different gap between it and the next one, so learning that is quite difficult,” Sabek explains.

Fewer collisions, faster results

When data were predictably distributed, learned models could reduce the ratio of colliding keys in a dataset from 30 percent to 15 percent, compared with traditional hash functions. They were also able to achieve better throughput than perfect hash functions. In the best cases, learned models reduced the runtime by nearly 30 percent.

As they explored the use of learned models for hashing, the researchers also found that throughout was impacted most by the number of sub-models. Each learned model is composed of smaller linear models that approximate the data distribution. With more sub-models, the learned model produces a more accurate approximation, but it takes more time.

“At a certain threshold of sub-models, you get enough information to build the approximation that you need for the hash function. But after that, it won’t lead to more improvement in collision reduction,” Sabek says.

Building off this analysis, the researchers want to use learned models to design hash functions for other types of data. They also plan to explore learned hashing for databases in which data can be inserted or deleted. When data are updated in this way, the model needs to change accordingly, but changing the model while maintaining accuracy is a difficult problem.

“We want to encourage the community to use machine learning inside more fundamental data structures and operations. Any kind of core data structure presents us with an opportunity use machine learning to capture data properties and get better performance. There is still a lot we can explore,” Sabek says.

This work was supported, in part, by Google, Intel, Microsoft, the National Science Foundation, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.

MIT professor to Congress: “We are at an inflection point” with AI

Government should not “abdicate” its responsibilities and leave the future path of artificial intelligence solely to Big Tech, Aleksander Mądry, the Cadence Design Systems Professor of Computing at MIT and director of the MIT Center for Deployable Machine Learning, told a Congressional panel on Wednesday. 

Rather, Mądry said, government should be asking questions about the purpose and explainability of the algorithms corporations are using, as a precursor to regulation, which he described as “an important tool” in ensuring that AI is consistent with society’s goals. If the government doesn’t start asking questions, then “I am extremely worried” about the future of AI, Mądry said in response to a question from Rep. Gerald Connolly.

Mądry, a leading expert on explainability and AI, was testifying at a hearing titled “Advances in AI: Are We Ready for a Tech Revolution?” before the House Subcommittee on Cybersecurity, Information Technology, and Government Innovation, a panel of the House Committee on Government Reform and Oversight. The other witnesses at the hearing were former Google CEO Eric Schmidt, IBM Vice President Scott Crowder, and Center for AI and Digital Policy Senior Research Director Merve Hickok.

In her opening remarks, Subcommittee Chair Rep. Nancy Mace cited the book “The Age of AI: And Our Human Future” by Schmidt, Henry Kissinger, and Dan Huttenlocher, the dean of the MIT Schwarzman College of Computing. She also called attention to a March 3 op-ed in The Wall Street Journal by the three authors that summarized the book while discussing ChatGPT. Mace said her formal opening remarks had been entirely written by ChatGPT.

In his prepared remarks, Mądry raised three overarching points. First, he noted that AI is “no longer a matter of science fiction” or confined to research labs. It is out in the world, where it can bring enormous benefits but also poses risks.

Second, he said AI exposes us to “interactions that go against our intuition.” He said because AI tools like ChatGPT mimic human communication, people are too likely to unquestioningly believe what such large language models produce. In the worst case, Mądry warned, human analytical skills will atrophy. He also said it would be a mistake to regulate AI as if it were human — for example, by asking AI to explain its reasoning and assuming that the resulting answers are credible.

Finally, he said too little attention has been paid to problems that will result from the nature of the AI “supply chain” — the way AI systems are built on top of each other. At the base are general systems like ChatGPT, which can be developed by only a few companies because they are so expensive and complex to build. Layered on top of such systems are many AI systems designed to handle a particular task, like figuring out whom a company should hire. 

Mądry said this layering raised several “policy-relevant” concerns. First, the entire system of AI is subject to whatever vulnerabilities or biases are in the large system at its base, and is dependent on the work of a few, large companies. Second, the interaction of AI systems is not well-understood from a technical standpoint, making the results of AI even more difficult to predict or explain, and making the tools difficult to “audit.” Finally, the mix of AI tools makes it difficult to know whom to hold responsible when a problem results — who should be legally liable and who should address the concern.

In the written material submitted to the subcommittee, Mądry concluded, “AI technology is not particularly well-suited for deployment through complex supply chains,” even though that is exactly how it is being deployed.

Mądry ended his testimony by calling on Congress to probe AI issues and to be prepared to act. “We are at an inflection point in terms of what future AI will bring. Seizing this opportunity means discussing the role of AI, what exactly we want it to do for us, and how to ensure it benefits us all. This will be a difficult conversation but we do need to have it, and have it now,” he told the subcommittee.

The testimony of all the hearing witnesses and a video of the hearing, which lasted about two hours is available at https://oversight.house.gov/hearing/advances-in-ai-are-we-ready-for-a-tech-revolution/.

Creating a versatile vaccine to take on Covid-19 in its many guises

One of the 12 labors of Hercules, according to ancient lore, was to destroy a nine-headed monster called the Hydra. The challenge was that when Hercules used his sword to chop off one of the monster’s heads, two would grow back in its place. He therefore needed an additional weapon, a torch, to vanquish his foe.

There are parallels between this legend and our three-years-and-counting battle with SARS-Cov-2, the virus that causes Covid-19. Every time scientists have thought they’d subdued one strain of the virus — be it alpha, beta, delta, or omicron — another variant or subvariant emerged a short while later.

For this reason, researchers at MIT and other institutions are preparing a new strategy against the virus — a novel vaccine that, unlike those in use today, could potentially counteract all variants of the disease, having a property called “pan-variance” that could circumvent the need for a different booster shot every time a new strain comes into circulation. In a paper published today in the journal Frontiers in Immunology, the team reports on experiments with mice that demonstrate the vaccine’s effectiveness in preventing death from Covid-19 infection.

Viral vaccines typically work by exposing the immune system to a small piece of the virus. That can create learned responses that protect people later when they’re exposed to the actual virus. The premise of standard Covid-19 vaccines, such as those produced by Moderna and Pfizer, is to activate the part of the immune system that releases neutralizing antibodies. They do this by providing cells with instructions (in the form of mRNA molecules) for making the spike protein — a protein found on the surface of the Covid-19 virus whose presence can trigger an immune reaction. “The problem with that approach is that the target keeps changing” — the spike protein itself can vary among different viral strains — “and that can make the vaccine ineffective,” says David Gifford, an MIT professor in electrical engineering and computer science and biological engineering, as well as a coauthor of the Frontiers paper.

He and his colleagues, accordingly, have taken a different approach, selecting a different target for their vaccine: activating the part of the immune system that unleashes “killer” T cells, which attack cells infected with the virus. A vaccine of this sort will not keep people from getting Covid-19, but it could keep them from getting very sick or dying.

A key innovation made by this group — which included researchers from MIT, the University of Texas, Boston University, Tufts University, Massachusetts General Hospital, and Acuitas Therapeutics — was to bring machine learning techniques into the vaccine design process. A critical aspect of that process involves determining which parts of SARS-Cov-2, which peptides (chains of amino acids that are the building blocks of proteins), should go into the vaccine. That entails sifting through thousands of peptides in the virus and picking out just 30 or so that should be incorporated.

But that decision has to take into account so-called HLA molecules — protein fragments on the surface of cells that serve as “billboards,” telling immune cells (which lack X-ray vision) what is going on inside other cells. The display of specific protein fragments can indicate, for instance, that a certain cell is infected by SARS-Cov-2 and should be gotten rid of.

Machine learning algorithms were used to solve a complicated set of “optimization problems,” notes Brandon Carter, a PhD student in MIT’s Department of Electrical Engineering and Computer Science, an affiliate of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and a lead author of the new paper. The overriding goal is to select peptides that are present, or “conserved,” in all variants of the virus. But those peptides also need to be associated with HLA molecules that have a high likelihood of being displayed so they can alert the immune system. “You want this to happen in as many people as possible to get maximum population coverage from your vaccine,” Carter says. Furthermore, you want each individual to be covered multiple times by the vaccine, he adds. “This means that more than one peptide in the vaccine is predicted to be displayed by some HLA in each person.” Achieving these various objectives is a task that can be significantly expedited by machine learning tools.

While that touches on the theoretical end of this project, the latest results came from experiments carried out by collaborators at the University of Texas Medical Branch in Galveston, which showed a strong immune response in mice given the vaccine. The mice in this experiment did not die but were were “humanized,” meaning that they had an HLA molecule found in human cells. “This study,” Carter says, “offers proof in a living system, an actual mouse, that the vaccines we devised using machine learning can afford protection from the Covid virus.” Gifford characterizes their work as “the first experimental evidence that a vaccine formulated in this fashion would be effective.”

Paul Offit, a professor of pediatrics in the Division of Infectious Diseases at Children’s Hospital of Philadelphia, finds the results encouraging. „A lot of people wonder about what approaches will be used to make Covid-19 vaccines in the future,” Offit says. „Given that T cells are critical in protection against severe Covid-19, future vaccines that focus on inducing the broadest T cell responses will be an important step forward in the next generation of vaccines.“

More animal studies — and eventual human studies — would have to be done before this work can usher in the “next generation of vaccines.” The fact that 24 percent of the lung cells in vaccinated mice were T cells, Gifford says, “showed that their immune systems were poised to fight viral infection.” But one has to be careful to avoid too strong of an immune response, he cautions, so as not to cause lung damage.

Other questions abound. Should T-cell vaccines be used instead of, or in combination with, standard spike protein vaccines? While it might be possible to enhance existing vaccines by including a T-cell component, Gifford says, “putting two things together may not be strictly additive, as one part of the vaccine could mask the other.”

Nevertheless, he and his colleagues believe their T-cell vaccine has the potential to help immunocompromised individuals who cannot produce neutralizing antibodies and thus may not benefit from traditional Covid vaccines. Their vaccine may also alleviate suffering from “long Covid” in people who continue to harbor reservoirs of the virus well after their initial infection.

The mechanism behind current flu vaccines, like current Covid-19 vaccines, is to induce neutralizing antibodies, but those vaccines don’t always work for different influenza strains. Carter sees potential for flu vaccines based on a T-cell response, “which may prove to be more effective, providing broader coverage, because of their pan-variance.”

Nor are the methods they are developing limited to Covid-19 or the flu, he maintains, as they might someday be applied to cancer. Gifford agrees, saying that a T-cell vaccine — designed to maximize immune protection both within an individual and among the greatest number of individuals — could become a key asset in the fight against cancer. “That’s not within the scope of our present study,” he says, “but it could be the subject of future work.”

Other MIT contributors to the work were Ge Liu and Alexander Dimitrakakis. The work was supported, in part, by Schmidt Futures and a C3.ai grant to David Gifford.

New insights into training dynamics of deep classifiers

A new study from researchers at MIT and Brown University characterizes several properties that emerge during the training of deep classifiers, a type of artificial neural network commonly used for classification tasks such as image classification, speech recognition, and natural language processing.

The paper, “Dynamics in Deep Classifiers trained with the Square Loss: Normalization, Low Rank, Neural Collapse and Generalization Bounds,” published today in the journal Research, is the first of its kind to theoretically explore the dynamics of training deep classifiers with the square loss and how properties such as rank minimization, neural collapse, and dualities between the activation of neurons and the weights of the layers are intertwined.

In the study, the authors focused on two types of deep classifiers: fully connected deep networks and convolutional neural networks (CNNs).

A previous study examined the structural properties that develop in large neural networks at the final stages of training. That study focused on the last layer of the network and found that deep networks trained to fit a training dataset will eventually reach a state known as “neural collapse.” When neural collapse occurs, the network maps multiple examples of a particular class (such as images of cats) to a single template of that class. Ideally, the templates for each class should be as far apart from each other as possible, allowing the network to accurately classify new examples.

An MIT group based at the MIT Center for Brains, Minds and Machines studied the conditions under which networks can achieve neural collapse. Deep networks that have the three ingredients of stochastic gradient descent (SGD), weight decay regularization (WD), and weight normalization (WN) will display neural collapse if they are trained to fit their training data. The MIT group has taken a theoretical approach — as compared to the empirical approach of the earlier study — proving that neural collapse emerges from the minimization of the square loss using SGD, WD, and WN.

Co-author and MIT McGovern Institute postdoc Akshay Rangamani states, “Our analysis shows that neural collapse emerges from the minimization of the square loss with highly expressive deep neural networks. It also highlights the key roles played by weight decay regularization and stochastic gradient descent in driving solutions towards neural collapse.”

Weight decay is a regularization technique that prevents the network from over-fitting the training data by reducing the magnitude of the weights. Weight normalization scales the weight matrices of a network so that they have a similar scale. Low rank refers to a property of a matrix where it has a small number of non-zero singular values. Generalization bounds offer guarantees about the ability of a network to accurately predict new examples that it has not seen during training.

The authors found that the same theoretical observation that predicts a low-rank bias also predicts the existence of an intrinsic SGD noise in the weight matrices and in the output of the network. This noise is not generated by the randomness of the SGD algorithm but by an interesting dynamic trade-off between rank minimization and fitting of the data, which provides an intrinsic source of noise similar to what happens in dynamic systems in the chaotic regime. Such a random-like search may be beneficial for generalization because it may prevent over-fitting.

“Interestingly, this result validates the classical theory of generalization showing that traditional bounds are meaningful. It also provides a theoretical explanation for the superior performance in many tasks of sparse networks, such as CNNs, with respect to dense networks,” comments co-author and MIT McGovern Institute postdoc Tomer Galanti. In fact, the authors prove new norm-based generalization bounds for CNNs with localized kernels, that is a network with sparse connectivity in their weight matrices.

In this case, generalization can be orders of magnitude better than densely connected networks. This result validates the classical theory of generalization, showing that its bounds are meaningful, and goes against a number of recent papers expressing doubts about past approaches to generalization. It also provides a theoretical explanation for the superior performance of sparse networks, such as CNNs, with respect to dense networks. Thus far, the fact that CNNs and not dense networks represent the success story of deep networks has been almost completely ignored by machine learning theory. Instead, the theory presented here suggests that this is an important insight in why deep networks work as well as they do.

“This study provides one of the first theoretical analyses covering optimization, generalization, and approximation in deep networks and offers new insights into the properties that emerge during training,” says co-author Tomaso Poggio, the Eugene McDermott Professor at the Department of Brain and Cognitive Sciences at MIT and co-director of the Center for Brains, Minds and Machines. “Our results have the potential to advance our understanding of why deep learning works as well as it does.”

Large language models are biased. Can logic help save them?

Turns out, even language models “think” they’re biased. When prompted in ChatGPT, the response was as follows: “Yes, language models can have biases, because the training data reflects the biases present in society from which that data was collected. For example, gender and racial biases are prevalent in many real-world datasets, and if a language model is trained on that, it can perpetuate and amplify these biases in its predictions.” A well-known but dangerous problem. 

Humans (typically) can dabble with both logical and stereotypical reasoning when learning. Still, language models mainly mimic the latter, an unfortunate narrative we’ve seen play out ad nauseam when the ability to employ reasoning and critical thinking is absent. So would injecting logic into the fray be enough to mitigate such behavior? 

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) had an inkling that it might, so they set off to examine if logic-aware language models could significantly avoid more harmful stereotypes. They trained a language model to predict the relationship between two sentences, based on context and semantic meaning, using a dataset with labels for text snippets detailing if a second phrase “entails,” “contradicts,” or is neutral with respect to the first one. Using this dataset — natural language inference — they found that the newly trained models were significantly less biased than other baselines, without any extra data, data editing, or additional training algorithms.

For example, with the premise “the person is a doctor” and the hypothesis “the person is masculine,” using these logic-trained models, the relationship would be classified as “neutral,” since there’s no logic that says the person is a man. With more common language models, two sentences might seem to be correlated due to some bias in training data, like “doctor” might be pinged with “masculine,” even when there’s no evidence that the statement is true. 

At this point, the omnipresent nature of language models is well-known: Applications in natural language processing, speech recognition, conversational AI, and generative tasks abound. While not a nascent field of research, growing pains can take a front seat as they increase in complexity and capability. 

“Current language models suffer from issues with fairness, computational resources, and privacy,” says MIT CSAIL postdoc Hongyin Luo, the lead author of a new paper about the work. “Many estimates say that the CO2 emission of training a language model can be higher than the lifelong emission of a car. Running these large language models is also very expensive because of the amount of parameters and the computational resources they need. With privacy, state-of-the-art language models developed by places like ChatGPT or GPT-3 have their APIs where you must upload your language, but there’s no place for sensitive information regarding things like health care or finance. To solve these challenges, we proposed a logical language model that we qualitatively measured as fair, is 500 times smaller than the state-of-the-art models, can be deployed locally, and with no human-annotated training samples for downstream tasks. Our model uses 1/400 the parameters compared with the largest language models, has better performance on some tasks, and significantly saves computation resources.” 

This model, which has 350 million parameters, outperformed some very large-scale language models with 100 billion parameters on logic-language understanding tasks. The team evaluated, for example, popular BERT pretrained language models with their “textual entailment” ones on stereotype, profession, and emotion bias tests. The latter outperformed other models with significantly lower bias, while preserving the language modeling ability. The “fairness” was evaluated with something called ideal context association (iCAT) tests, where higher iCAT scores mean fewer stereotypes. The model had higher than 90 percent iCAT scores, while other strong language understanding models ranged between 40 to 80. 

Luo wrote the paper alongside MIT Senior Research Scientist James Glass. They will present the work at the Conference of the European Chapter of the Association for Computational Linguistics in Croatia. 

Unsurprisingly, the original pretrained language models the team examined were teeming with bias, confirmed by a slew of reasoning tests demonstrating how professional and emotion terms are significantly biased to the feminine or masculine words in the gender vocabulary. 

With professions, a language model (which is biased) thinks that “flight attendant,” “secretary,” and “physician’s assistant” are feminine jobs, while “fisherman,” “lawyer,” and “judge” are masculine. Concerning emotions, a language model thinks that “anxious,” “depressed,” and “devastated” are feminine.

While we may still be far away from a neutral language model utopia, this research is ongoing in that pursuit. Currently, the model is just for language understanding, so it’s based on reasoning among existing sentences. Unfortunately, it can’t generate sentences for now, so the next step for the researchers would be targeting the uber-popular generative models built with logical learning to ensure more fairness with computational efficiency. 

“Although stereotypical reasoning is a natural part of human recognition, fairness-aware people conduct reasoning with logic rather than stereotypes when necessary,“ says Luo. „We show that language models have similar properties. A language model without explicit logic learning makes plenty of biased reasoning, but adding logic learning can significantly mitigate such behavior. Furthermore, with demonstrated robust zero-shot adaptation ability, the model can be directly deployed to different tasks with more fairness, privacy, and better speed.”