A diagnostic insight in healthcare. A character’s dialogue in an interactive game. An autonomous resolution from a customer service agent. Each of these AI-powered interactions is built on the same unit of intelligence: a token.
Scaling these AI interactions requires businesses to consider whether they can afford more tokens. The answer lies in better tokenomics — which at its core is about driving down the cost of each token. This downward trend is unfolding across industries. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually.
To understand how infrastructure efficiency improves tokenomics, consider the analogy of a high-speed printing press. If the press produces 10x output with incremental investment in ink, energy and the machine itself, the cost to print each individual page drops. In the same way, investments in AI infrastructure can lead to far greater token output compared with the increase in cost — causing a meaningful reduction in the cost per token.

That’s why leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are using the NVIDIA Blackwell platform, which helps them reduce cost per token by up to 10x compared with the NVIDIA Hopper platform.
These providers host advanced open source models, which have now reached frontier-level intelligence. By combining open source frontier intelligence, the extreme hardware-software codesign of NVIDIA Blackwell and their own optimized inference stacks, these providers are enabling dramatic token cost reductions for businesses across every industry.
In healthcare, tedious, time-consuming tasks like medical coding, documentation and managing insurance forms cut into the time doctors can spend with patients.
Sully.ai helps solve this problem by developing “AI employees” that can handle routine tasks like medical coding and note-taking. As the company’s platform scaled, its proprietary, closed source models created three bottlenecks: unpredictable latency in real-time clinical workflows, inference costs that scaled faster than revenue and insufficient control over model quality and updates.

To overcome these bottlenecks, Sully.ai uses Baseten’s Model API, which deploys open source models such as gpt-oss-120b on NVIDIA Blackwell GPUs. Baseten used the low-precision NVFP4 data format, the NVIDIA TensorRT-LLM library and the NVIDIA Dynamo inference framework to deliver optimized inference. The company chose NVIDIA Blackwell to run its Model API after seeing up to 2.5x better throughput per dollar compared with the NVIDIA Hopper platform.
As a result, Sully.ai’s inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes. The company has now returned over 30 million minutes to physicians, time previously lost to data entry and other manual tasks.
Latitude is building the future of AI-native gaming with its AI Dungeon adventure-story game and upcoming AI-powered role-playing gaming platform, Voyage, where players can create or play worlds with the freedom to choose any action and make their own story.
The company’s platform uses large language models to respond to players’ actions — but this comes with scaling challenges, as every player action triggers an inference request. Costs scale with engagement, and response times must stay fast enough to keep the experience seamless.

Latitude runs large open source models on DeepInfra’s inference platform, powered by NVIDIA Blackwell GPUs and TensorRT-LLM. For a large-scale mixture-of-experts (MoE) model, DeepInfra reduced the cost per million tokens from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell. Moving to Blackwell’s native low-precision NVFP4 format further cut that cost to just 5 cents — for a total 4x improvement in cost per token — while maintaining the accuracy that customers expect.
Running these large-scale MoE models on DeepInfra’s Blackwell-powered platform allows Latitude to deliver fast, reliable responses cost effectively. DeepInfra inference platform delivers this performance while reliably handling traffic spikes, letting Latitude deploy more capable models without compromising player experience.
Sentient Labs is focused on bringing AI developers together to build powerful reasoning AI systems that are all open source. The goal is to accelerate AI toward solving harder reasoning problems through research in secure autonomy, agentic architecture and continual learning.
Its first app, Sentient Chat, orchestrates complex multi-agent workflows and integrates more than a dozen specialized AI agents from the community. Due to this, Sentient Chat has massive compute demands because a single user query could trigger a cascade of autonomous interactions that typically lead to costly infrastructure overhead.
To manage this scale and complexity, Sentient uses Fireworks AI’s inference platform running on NVIDIA Blackwell. With Fireworks’ Blackwell-optimized inference stack, Sentient achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment.

This higher throughput per GPU allowed the company to serve significantly more concurrent users for the same cost. The platform’s scalability supported a viral launch of 1.8 million waitlisted users in 24 hours and processed 5.6 million queries in a single week while delivering consistent low latency.
Customer service calls with voice AI often end in frustration because even a slight delay can lead users to talk over the agent, hang up or lose trust.
Decagon builds AI agents for enterprise customer support, with AI-powered voice being its most demanding channel. Decagon needed infrastructure that could deliver sub-second responses under unpredictable traffic loads with tokenomics that supported 24/7 voice deployments.

Together AI runs production inference for Decagon’s multimodel voice stack on NVIDIA Blackwell GPUs. The companies collaborated on several key optimizations: speculative decoding that trains smaller models to generate faster responses while a larger model verifies accuracy in the background, caching repeated conversation elements to speed up responses and building automatic scaling that handles traffic surges without degrading performance.
Decagon saw response times under 400 milliseconds even when processing thousands of tokens per query. Cost per query, which is the total cost to complete one voice interaction, dropped by 6x compared with using closed source proprietary models. This was achieved through the combination of Decagon’s multimodel approach (some open source, some trained in house on NVIDIA GPUs), NVIDIA Blackwell’s extreme codesign and Together’s optimized inference stack.
The dramatic cost savings seen across healthcare, gaming and customer service are driven by the efficiency of NVIDIA Blackwell. The NVIDIA GB200 NVL72 system further scales this impact by delivering a breakthrough 10x reduction in cost per token for reasoning MoE models compared with NVIDIA Hopper.
NVIDIA’s extreme codesign across every layer of the stack — spanning compute, networking and software — and its partner ecosystem are unlocking massive reductions in cost per token at scale.
This momentum continues with the NVIDIA Rubin platform — integrating six new chips into a single AI supercomputer to deliver 10x performance and 10x lower token cost over Blackwell.
Explore NVIDIA’s full-stack inference platform to learn more about how it delivers better tokenomics for AI inference.
At leading institutions across the globe, the NVIDIA DGX Spark desktop supercomputer is bringing data‑center‑class AI to lab benches, faculty offices and students’ systems. There’s even a DGX Spark hard at work in the South Pole, at the IceCube Neutrino Observatory run by the University of Wisconsin-Madison.
The compact supercomputer’s petaflop‑class performance enables local deployment of large AI applications, from clinical report evaluators to robotics perception systems, all while keeping sensitive data on site and shortening iteration loops for researchers and learners.
Powered by the NVIDIA GB10 superchip and the NVIDIA DGX operating system, each DGX Spark unit supports AI models of up to 200 billion parameters and integrates seamlessly with the NVIDIA NeMo, Metropolis, Holoscan and Isaac platforms, giving students access to the same professional-grade tools used across the DGX ecosystem.
Read more below on how DGX Spark powers groundbreaking AI work at leading institutions worldwide.
At the University of Wisconsin-Madison’s IceCube Neutrino Observatory in Antarctica, researchers are using DGX Spark to run AI models for its experiments studying the universe’s most cataclysmic events, using subatomic particles called neutrinos.
Traditional astronomy methods, based on detecting light waves, enable observing about 80% of the known universe, according to Benedikt Riedel, computing director at the Wisconsin IceCube Particle Astrophysics Center. A new way to explore the universe — using gravitational waves and particles like neutrinos — unlocks examining the most extreme cosmic environments, including those involving supernovas and dark matter.

“There’s no hardware store in the South Pole, which is technically a desert, with relative humidity under 5% and an elevation of 10,000 feet, meaning very limited power,” Riedel said. “DGX Spark allows us to deploy AI in a compartmentalized and easy fashion, at low cost and in such an extremely remote environment, to run AI analyses locally on our neutrino observation data.”
At NYU’s Global AI Frontier Lab, the ICARE (Interpretable and Clinically‑Grounded Agent‑Based Report Evaluation) project runs end-to-end on a DGX Spark in the lab. ICARE uses collaborating AI agents and multiple‑choice question generation to evaluate how closely AI‑generated radiology reports align with expert sources, enabling real‑time clinical evaluation and continuous monitoring without sending medical imaging data to the cloud.
“Being able to run powerful LLMs locally on the DGX Spark has completely changed my workflow,” said Lucius Bynum, data science assistant professor and a faculty fellow at the NYU Center for Data Science. “I have been able to focus my efforts on quickly iterating and improving the research tool I’m developing.”
NYU researchers also use DGX Spark to run LLMs locally as part of interactive causal modeling tools that generate and refine semantic causal models — structured, machine‑readable maps of cause‑and‑effect relationships between clinical variables, imaging findings and potential diagnoses. This setup lets teams rapidly design, test and iterate on advanced models without waiting for cluster resources, including for privacy- and security‑sensitive applications such as in healthcare, where data must stay on premises.
At Harvard’s Kempner Institute for the Study of Natural and Artificial Intelligence, neuroscientists are using DGX Spark as a compact desktop supercomputer to probe how genetic mutations in the brain drive epilepsy. The system lets researchers run complex analyses in real time without needing to wait for access to large institutional clusters.

The team, led by Kempner Institute Co-Director Bernardo Sabatini, is studying about 6,000 mutations in excitatory and inhibitory neurons, building protein-structure and neuronal-function prediction maps that guide which variants to test next in the lab.
DGX Spark acts as a bridge between benchtop and cluster‑scale computing at Harvard. Researchers first validate workflows and timing on a single DGX Spark, then scale successful pipelines to large GPU clusters for massive protein screens.
Arizona State University was among the first universities to receive multiple DGX Spark systems, which now support AI research across the campus, spanning initiatives for memory care, transportation safety and sustainable energy.

One ASU team led by Yezhou “YZ” Yang, associate professor in the School of Computing and Augmented Intelligence, is using DGX Spark to power advanced perception and robotics research, including for applications such as AI‑enabled, search-and-rescue robotic dogs and assistance tools for visually impaired users.
In the computer science and engineering department at Mississippi State University, DGX Spark serves as a hands‑on learning platform for the next generation of AI engineers.
The enthusiasm around DGX Spark at Mississippi State is captured through lab‑driven outreach, including an unboxing video created by a lab working to advance applied AI, foster AI workforce development and drive real-world AI experimentation across the state.
When ASUS delivered the school’s first Ascent GX10 — powered by DGX Spark — Sunita Chandrasekaran, professor of computer and information sciences and director of the First State AI Institute, called it “transformative for research,” enabling teams across disciplines like sports analytics and coastal science to run large AI models directly on campus instead of relying on costly cloud resources. Through the ASUS Virtual Lab program, schools can test GX10 performance remotely before deployment.
At the Institute of Science and Technology Austria, researchers are using an HP ZGX Nano AI Station — a compact system based on NVIDIA DGX Spark — to train and fine‑tune LLMs right on a desktop. The team’s open source LLMQ software enables working with models of up to 7 billion parameters, making advanced LLM training accessible to more students and researchers.
Because the ZGX Nano includes 128GB of unified memory, the entire LLM and its training data can remain on the system, avoiding the complex memory juggling usually required on consumer GPUs. This helps teams move faster and keep sensitive data on premises. Read this research paper on ISTA’s LLMQ software.
At Stanford University, researchers are using DGX Spark to prototype complete training and evaluation pipelines to run their Biomni biological agent workflows locally before scaling to large GPU clusters. This enables a tight, iterative loop for model development and benchmarking, and automates complex analysis and experimental planning directly in the lab environment.
The Stanford research team reported that DGX Spark provides performance similar to big cloud GPU instances — about 80 tokens per second on a 120 billion‑parameter gpt‑oss model at MXFP4 via Ollama — while keeping the entire workload on a desktop.
College students from across the globe are invited to participate in Treehacks, a massive student hackathon running Feb. 13-15 at Stanford, which will feature DGX Spark units from ASUS.
See how DGX Spark is transforming higher education and student innovation at Stanford by joining this livestream on Friday, Feb. 13, at 9 a.m. PT.
Get started with DGX Spark and find purchase options on this webpage.
The GeForce NOW sixth-anniversary festivities roll on this February, continuing a monthlong celebration of NVIDIA’s cloud gaming service.
This week brings even more reasons to join the party, as GeForce NOW launches on a new platform with support for Amazon Fire TV devices, and eight new games to keep the streaming going strong.
The new app brings GeForce NOW directly to select Amazon Fire TV devices, so members can jump into their PC game libraries on the biggest screen in the house using a compatible controller.
GeForce NOW continues to expand access across devices, supporting multiple platforms with the latest additions of Linux. From PCs and Macs to smartphones, browsers and smart TVs, the cloud delivers high-performance gaming to nearly any screen, on more devices than ever.
Members tap into GeForce RTX power on the screens they already own, extending the value of their membership without extra hardware or complexity. That reach expands again with the launch of the GeForce NOW app on Amazon Fire TV.
GeForce NOW is powering up the living room with the launch of the GeForce NOW app on Amazon Fire TV streaming sticks, leveling up big-screen gaming with PC-quality performance.

The GeForce NOW app on Fire TV is available now with initial support for the Fire TV Stick 4K Plus (2nd Gen) with Fire OS 8.1.6.0 and later and Fire TV Stick 4K Max (2nd Gen)with Fire OS 8.1.6.0 and later, and Fire TV Stick 4K Max (1st Gen) with Fire OS 7.7.1.1 and later, streaming at up to 1080p60 with standard-dynamic-range graphics using H.264 video encoding and stereo audio. Fire TV owners now have an even more powerful way to play on the big screen.
Connect a controller, open the app and stream favorite titles instantly with RTX power behind every pixel — no console required. Cloud gaming has never felt more at home.

Torment: Tides of Numenera, inXile Entertainment’s story-focused role-playing game, puts emphasis on choices, characters and dialogue, mixing science fiction and fantasy into a strange, colorful world where every conversation and decision matters.
The adventure takes place in the “Ninth World,” a far-future version of Earth built on the ruins of incredibly advanced civilizations. Step into the role of the Last Castoff, a person connected to a powerful, mysterious figure, and get pulled into a mystery that asks one big question: What does a single life truly mean? Expect memorable companions, tough moral choices and plenty of odd, fascinating situations rather than a typical hero-saves-the-world story.
On GeForce NOW, this story-heavy experience is easy to jump into, no high-end rig required. Gaming sessions can pick up right where they left off across supported devices, making it simple for gamers to sink back into the Ninth World whenever the story calls.
Arcade icons and blue‑bomber brawls light up GeForce NOW this week with a retro‑powered surge from Capcom, featuring Mega Man 11, Street Fighter 30th Anniversary Collection, Capcom Fighting Collection, and the Capcom Beat ’Em Up Bundle. Members can catch them in the cloud, reliving the glory arcade days and timeless showdowns — now streaming anywhere to members’ devices.

Mega Man is back — Mega Man 11 is the latest entry in the iconic series that blends classic, challenging 2D platforming action with a fresh visual style.

Celebrate Street Fighter’s historic legacy with the Street Fighter 30th Anniversary Collection. In this collection of a dozen Street Fighter titles, four groundbreaking titles let gamers play with friends and relive the arcade experience through the online Arcade Mode.

Capcom Fighting Collection includes 10 of Capcom’s most popular arcade games in one bumper collection, from series such as Street Fighter and Darkstalkers to Cyberbots and the first home console port of Red Earth. It’s the perfect collection for arcade veterans and newcomers alike.

Relive the glory days of cooperative arcade games with the Capcom Beat ‘Em Up Bundle. This comprehensive collection includes seven classic titles, each with various multiplayer options and online capabilities.

Reanimal is a twisted little nightmare that leans into creepy-cute horror with a co-op twist. Guide a scrappy brother-sister duo through a haunting, off-kilter world where every shadow hides something worse, every puzzle feels like a dare and “sticking together” isn’t just smart — it’s survival.
In addition, members can look for the following:
What are you planning to play this weekend? Let us know on X or in the comments below.
Our Reddit community is hosting a giveaway!
Here’s your chance to win:
HOTAS Thrustmaster One
Amazon Fire TV Sticks
G515 Keyboard
G522 Headset
G502X Mouse
Racing Wheels (G920/G29)How to enter
https://t.co/fZ9mKx9p61#6YearsofGFN
—
NVIDIA GeForce NOW (@NVIDIAGFN) February 11, 2026
For more than a decade, MIT Associate Professor Rafael Gómez-Bombarelli has used artificial intelligence to create new materials. As the technology has expanded, so have his ambitions.
Now, the newly tenured professor in materials science and engineering believes AI is poised to transform science in ways never before possible. His work at MIT and beyond is devoted to accelerating that future.
“We’re at a second inflection point,” Gómez-Bombarelli says. “The first one was around 2015 with the first wave of representation learning, generative AI, and high-throughput data in some areas of science. Those are some of the techniques I first brought into my lab at MIT. Now I think we’re at a second inflection point, mixing language and merging multiple modalities into general scientific intelligence. We’re going to have all the model classes and scaling laws needed to reason about language, reason over material structures, and reason over synthesis recipes.”
Gómez Bombarelli’s research combines physics-based simulations with approaches like machine learning and generative AI to discover new materials with promising real-world applications. His work has led to new materials for batteries, catalysts, plastics, and organic light-emitting diodes (OLEDs). He has also co-founded multiple companies and served on scientific advisory boards for startups applying AI to drug discovery, robotics, and more. His latest company, Lila Sciences, is working to build a scientific superintelligence platform for the life sciences, chemical, and materials science industries.
All of that work is designed to ensure the future of scientific research is more seamless and productive than research today.
“AI for science is one of the most exciting and aspirational uses of AI,” Gómez-Bombarelli says. “Other applications for AI have more downsides and ambiguity. AI for science is about bringing a better future forward in time.”
From experiments to simulations
Gómez-Bombarelli grew up in Spain and gravitated toward the physical sciences from an early age. In 2001, he won a Chemistry Olympics competition, setting him on an academic track in chemistry, which he studied as an undergraduate at his hometown college, the University of Salamanca. Gómez-Bombarelli stuck around for his PhD, where he investigated the function of DNA-damaging chemicals.
“My PhD started out experimental, and then I got bitten by the bug of simulation and computer science about halfway through,” he says. “I started simulating the same chemical reactions I was measuring in the lab. I like the way programming organizes your brain; it felt like a natural way to organize one’s thinking. Programming is also a lot less limited by what you can do with your hands or with scientific instruments.”
Next, Gómez-Bombarelli went to Scotland for a postdoctoral position, where he studied quantum effects in biology. Through that work, he connected with Alán Aspuru-Guzik, a chemistry professor at Harvard University, whom he joined for his next postdoc in 2014.
“I was one of the first people to use generative AI for chemistry in 2016, and I was on the first team to use neural networks to understand molecules in 2015,” Gómez-Bombarelli says. “It was the early, early days of deep learning for science.”
Gómez-Bombarelli also began working to eliminate manual parts of molecular simulations to run more high-throughput experiments. He and his collaborators ended up running hundreds of thousands of calculations across materials, discovering hundreds of promising materials for testing.
After two years in the lab, Gómez-Bombarelli and Aspuru-Guzik started a general-purpose materials computation company, which eventually pivoted to focus on producing organic light-emitting diodes. Gómez-Bombarelli joined the company full-time and calls it the hardest thing he’s ever done in his career.
“It was amazing to make something tangible,” he says. “Also, after seeing Aspuru-Guzik run a lab, I didn’t want to become a professor. My dad was a professor in linguistics, and I thought it was a mellow job. Then I saw Aspuru-Guzik with a 40-person group, and he was on the road 120 days a year. It was insane. I didn’t think I had that type of energy and creativity in me.”
In 2018, Aspuru-Guzik suggested Gómez-Bombarelli apply for a new position in MIT’s Department of Materials Science and Engineering. But, with his trepidation about a faculty job, Gómez-Bombarelli let the deadline pass. Aspuru-Guzik confronted him in his office, slammed his hands on the table, and told him, “You need to apply for this.” It was enough to get Gómez-Bombarelli to put together a formal application.
Fortunately at his startup, Gómez-Bombarelli had spent a lot of time thinking about how to create value from computational materials discovery. During the interview process, he says, he was attracted to the energy and collaborative spirit at MIT. He also began to appreciate the research possibilities.
“Everything I had been doing as a postdoc and at the company was going to be a subset of what I could do at MIT,” he says. “I was making products, and I still get to do that. Suddenly, my universe of work was a subset of this new universe of things I could explore and do.”
It’s been nine years since Gómez Bombarelli joined MIT. Today his lab focuses on how the composition, structure, and reactivity of atoms impact material performance. He has also used high-throughput simulations to create new materials and helped develop tools for merging deep learning with physics-based modeling.
“Physics-based simulations make data and AI algorithms get better the more data you give them,” Gómez Bombarelli’s says. “There are all sorts of virtuous cycles between AI and simulations.”
The research group he has built is solely computational — they don’t run physical experiments.
“It’s a blessing because we can have a huge amount of breadth and do lots of things at once,” he says. “We love working with experimentalists and try to be good partners with them. We also love to create computational tools that help experimentalists triage the ideas coming from AI .”
Gómez-Bombarelli is also still focused on the real-world applications of the materials he invents. His lab works closely with companies and organizations like MIT’s Industrial Liaison Program to understand the material needs of the private sector and the practical hurdles of commercial development.
Accelerating science
As excitement around artificial intelligence has exploded, Gómez-Bombarelli has seen the field mature. Companies like Meta, Microsoft, and Google’s DeepMind now regularly conduct physics-based simulations reminiscent of what he was working on back in 2016. In November, the U.S. Department of Energy launched the Genesis Mission to accelerate scientific discovery, national security, and energy dominance using AI.
“AI for simulations has gone from something that maybe could work to a consensus scientific view,” Gómez-Bombarelli says. “We’re at an inflection point. Humans think in natural language, we write papers in natural language, and it turns out these large language models that have mastered natural language have opened up the ability to accelerate science. We’ve seen that scaling works for simulations. We’ve seen that scaling works for language. Now we’re going to see how scaling works for science.”
When he first came to MIT, Gómez-Bombarelli says he was blown away by how non-competitive things were between researchers. He tries to bring that same positive-sum thinking to his research group, which is made up of about 25 graduate students and postdocs.
“We’ve naturally grown into a really diverse group, with a diverse set of mentalities,” Gomez-Bombarelli says. “Everyone has their own career aspirations and strengths and weaknesses. Figuring out how to help people be the best versions of themselves is fun. Now I’ve become the one insisting that people apply to faculty positions after the deadline. I guess I’ve passed that baton.”
A firm that wants to use a large language model (LLM) to summarize sales reports or triage customer inquiries can choose between hundreds of unique LLMs with dozens of model variations, each with slightly different performance.
To narrow down the choice, companies often rely on LLM ranking platforms, which gather user feedback on model interactions to rank the latest LLMs based on how they perform on certain tasks.
But MIT researchers found that a handful of user interactions can skew the results, leading someone to mistakenly believe one LLM is the ideal choice for a particular use case. Their study reveals that removing a tiny fraction of crowdsourced data can change which models are top-ranked.
They developed a fast method to test ranking platforms and determine whether they are susceptible to this problem. The evaluation technique identifies the individual votes most responsible for skewing the results so users can inspect these influential votes.
The researchers say this work underscores the need for more rigorous strategies to evaluate model rankings. While they didn’t focus on mitigation in this study, they provide suggestions that may improve the robustness of these platforms, such as gathering more detailed feedback to create the rankings.
The study also offers a word of warning to users who may rely on rankings when making decisions about LLMs that could have far-reaching and costly impacts on a business or organization.
“We were surprised that these ranking platforms were so sensitive to this problem. If it turns out the top-ranked LLM depends on only two or three pieces of user feedback out of tens of thousands, then one can’t assume the top-ranked LLM is going to be consistently outperforming all the other LLMs when it is deployed,” says Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS); a member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems, and Society; an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author of this study.
She is joined on the paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen as well as Dennis Wei, a senior research scientist at IBM Research. The study will be presented at the International Conference on Learning Representations.
Dropping data
While there are many types of LLM ranking platforms, the most popular variations ask users to submit a query to two models and pick which LLM provides the better response.
The platforms aggregate the results of these matchups to produce rankings that show which LLM performed best on certain tasks, such as coding or visual understanding.
By choosing a top-performing LLM, a user likely expects that model’s top ranking to generalize, meaning it should outperform other models on their similar, but not identical, application with a set of new data.
The MIT researchers previously studied generalization in areas like statistics and economics. That work revealed certain cases where dropping a small percentage of data can change a model’s results, indicating that those studies’ conclusions might not hold beyond their narrow setting.
The researchers wanted to see if the same analysis could be applied to LLM ranking platforms.
“At the end of the day, a user wants to know whether they are choosing the best LLM. If only a few prompts are driving this ranking, that suggests the ranking might not be the end-all-be-all,” Broderick says.
But it would be impossible to test the data-dropping phenomenon manually. For instance, one ranking they evaluated had more than 57,000 votes. Testing a data drop of 0.1 percent means removing each subset of 57 votes out of the 57,000, (there are more than 10194 subsets), and then recalculating the ranking.
Instead, the researchers developed an efficient approximation method, based on their prior work, and adapted it to fit LLM ranking systems.
“While we have theory to prove the approximation works under certain assumptions, the user doesn’t need to trust that. Our method tells the user the problematic data points at the end, so they can just drop those data points, re-run the analysis, and check to see if they get a change in the rankings,” she says.
Surprisingly sensitive
When the researchers applied their technique to popular ranking platforms, they were surprised to see how few data points they needed to drop to cause significant changes in the top LLMs. In one instance, removing just two votes out of more than 57,000, which is 0.0035 percent, changed which model is top-ranked.
A different ranking platform, which uses expert annotators and higher quality prompts, was more robust. Here, removing 83 out of 2,575 evaluations (about 3 percent) flipped the top models.
Their examination revealed that many influential votes may have been a result of user error. In some cases, it appeared there was a clear answer as to which LLM performed better, but the user chose the other model instead, Broderick says.
“We can never know what was in the user’s mind at that time, but maybe they mis-clicked or weren’t paying attention, or they honestly didn’t know which one was better. The big takeaway here is that you don’t want noise, user error, or some outlier determining which is the top-ranked LLM,” she adds.
The researchers suggest that gathering additional feedback from users, such as confidence levels in each vote, would provide richer information that could help mitigate this problem. Ranking platforms could also use human mediators to assess crowdsourced responses.
For the researchers’ part, they want to continue exploring generalization in other contexts while also developing better approximation methods that can capture more examples of non-robustness.
“Broderick and her students’ work shows how you can get valid estimates of the influence of specific data on downstream processes, despite the intractability of exhaustive calculations given the size of modern machine-learning models and datasets,” says Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved with this work. “The recent work provides a glimpse into the strong data dependencies in routinely applied — but also very fragile — methods for aggregating human preferences and using them to update a model. Seeing how few preferences could really change the behavior of a fine-tuned model could inspire more thoughtful methods for collecting these data.”
This research is funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, and a CSAIL seed award.
Whether you’re a scientist brainstorming research ideas or a CEO hoping to automate a task in human resources or finance, you’ll find that artificial intelligence tools are becoming the assistants you didn’t know you needed. In particular, many professionals are tapping into the talents of semi-autonomous software systems called AI agents, which can call on AI at specific points to solve problems and complete tasks.
AI agents are particularly effective when they use large language models (LLMs) because those systems are powerful, efficient, and adaptable. One way to program such technology is by describing in code what you want your system to do (the “workflow”), including when it should use an LLM. If you were a software company trying to revamp your old codebase to use a more modern programming language for better optimizations and safety, you might build a system that uses an LLM to translate the codebase one file at a time, testing each file as you go.
But what happens when LLMs make mistakes? You’ll want the agent to backtrack to make another attempt, incorporating lessons it learned from previous mistakes. Coding this up can take as much effort as implementing the original agent; if your system for translating a codebase contained thousands of lines of code, then you’d be making thousands of lines of code changes or additions to support the logic for backtracking when LLMs make mistakes.
To save programmers time and effort, researchers with MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Asari AI have developed a framework called “EnCompass.”
With EnCompass, you no longer have to make these changes yourself. Instead, when EnCompass runs your program, it automatically backtracks if LLMs make mistakes. EnCompass can also make clones of the program runtime to make multiple attempts in parallel in search of the best solution. In full generality, EnCompass searches over the different possible paths your agent could take as a result of the different possible outputs of all the LLM calls, looking for the path where the LLM finds the best solution.
Then, all you have to do is to annotate the locations where you may want to backtrack or clone the program runtime, as well as record any information that may be useful to the strategy used to search over the different possible execution paths of your agent (the search strategy). You can then separately specify the search strategy — you could either use one that EnCompass provides out of the box or, if desired, implement your own custom search strategy.
“With EnCompass, we’ve separated the search strategy from the underlying workflow of an AI agent,” says lead author Zhening Li ’25, MEng ’25, who is an MIT electrical engineering and computer science (EECS) PhD student, CSAIL researcher, and research consultant at Asari AI. “Our framework lets programmers easily experiment with different search strategies to find the one that makes the AI agent perform the best.”
EnCompass was used for agents implemented as Python programs that call LLMs, where it demonstrated noticeable code savings. EnCompass reduced coding effort for implementing search by up to 80 percent across agents, such as an agent for translating code repositories and for discovering transformation rules of digital grids. In the future, EnCompass could enable agents to tackle large-scale tasks, including managing massive code libraries, designing and carrying out science experiments, and creating blueprints for rockets and other hardware.
Branching out
When programming your agent, you mark particular operations — such as calls to an LLM — where results may vary. These annotations are called “branchpoints.” If you imagine your agent program as generating a single plot line of a story, then adding branchpoints turns the story into a choose-your-own-adventure story game, where branchpoints are locations where the plot branches into multiple future plot lines.
You can then specify the strategy that EnCompass uses to navigate that story game, in search of the best possible ending to the story. This can include launching parallel threads of execution or backtracking to a previous branchpoint when you get stuck in a dead end.
Users can also plug-and-play a few common search strategies provided by EnCompass out of the box, or define their own custom strategy. For example, you could opt for Monte Carlo tree search, which builds a search tree by balancing exploration and exploitation, or beam search, which keeps the best few outputs from every step. EnCompass makes it easy to experiment with different approaches to find the best strategy to maximize the likelihood of successfully completing your task.
The coding efficiency of EnCompass
So just how code-efficient is EnCompass for adding search to agent programs? According to researchers’ findings, the framework drastically cut down how much programmers needed to add to their agent programs to add search, helping them experiment with different strategies to find the one that performs the best.
For example, the researchers applied EnCompass to an agent that translates a repository of code from the Java programming language, which is commonly used to program apps and enterprise software, to Python. They found that implementing search with EnCompass — mainly involving adding branchpoint annotations and annotations that record how well each step did — required 348 fewer lines of code (about 82 percent) than implementing it by hand. They also demonstrated how EnCompass enabled them to easily try out different search strategies, identifying the best strategy to be a two-level beam search algorithm, achieving an accuracy boost of 15 to 40 percent across five different repositories at a search budget of 16 times the LLM calls made by the agent without search.
“As LLMs become a more integral part of everyday software, it becomes more important to understand how to efficiently build software that leverages their strengths and works around their limitations,” says co-author Armando Solar-Lezama, who is an MIT professor of EECS and CSAIL principal investigator. “EnCompass is an important step in that direction.”
The researchers add that EnCompass targets agents where a program specifies the steps of the high-level workflow; the current iteration of their framework is less applicable to agents that are entirely controlled by an LLM. “In those agents, instead of having a program that specifies the steps and then using an LLM to carry out those steps, the LLM itself decides everything,” says Li. “There is no underlying programmatic workflow, so you can execute inference-time search on whatever the LLM invents on the fly. In this case, there’s less need for a tool like EnCompass that modifies how a program executes with search and backtracking.”
Li and his colleagues plan to extend EnCompass to more general search frameworks for AI agents. They also plan to test their system on more complex tasks to refine it for real-world uses, including at companies. What’s more, they’re evaluating how well EnCompass helps agents work with humans on tasks like brainstorming hardware designs or translating much larger code libraries. For now, EnCompass is a powerful building block that enables humans to tinker with AI agents more easily, improving their performance.
“EnCompass arrives at a timely moment, as AI-driven agents and search-based techniques are beginning to reshape workflows in software engineering,” says Carnegie Mellon University Professor Yiming Yang, who wasn’t involved in the research. “By cleanly separating an agent’s programming logic from its inference-time search strategy, the framework offers a principled way to explore how structured search can enhance code generation, translation, and analysis. This abstraction provides a solid foundation for more systematic and reliable search-driven approaches to software development.”
Li and Solar-Lezama wrote the paper with two Asari AI researchers: Caltech Professor Yisong Yue, an advisor at the company; and senior author Stephan Zheng, who is the founder and CEO. Their work was supported by Asari AI.
The team’s work was presented at the Conference on Neural Information Processing Systems (NeurIPS) in December.
Break out the cake and green sprinkles — GeForce NOW is turning six.
Since launch, members have streamed over 1 billion hours, and the party’s just getting started.
Throughout February, members can look forward to new games, fresh ways to play across more devices and even more ways to bring RTX power to every screen in the house.
There’s plenty to celebrate: the February games list kicks off with 24 new games. Start with the 10 new games in the cloud this week, including the launch of Team Jade’s Delta Force and the newest title launching in the PUBG universe, PUBG: BLINDSPOT.

Delta Force, now boots on the ground and fully deployed on GeForce NOW, brings the tactical first‑person shooter from Team Jade (TiMi Studio Group) to the cloud. The game features high-stakes extraction with an all-out warfare mode, giving players a playground of open environments, vehicles and gadgets to pull off coordinated assaults.
Players join elite units tasked with tackling high‑risk missions across sprawling maps, from tight urban incursions to rugged open‑terrain operations. Expect strategic objectives, combined‑arms combat with land, air and sea vehicles, and tense firefights where teamwork and planning are just as important as quick reflexes.
On GeForce NOW, Delta Force leans into its high‑octane personality: fast drops, big maps and cinematic engagements that look sharp and feel responsive across devices. Members can squad up from almost anywhere, enjoy high‑resolution streaming and smooth performance, and stay ready for every op without waiting on downloads or big updates before jumping into the next mission.

PUBG: BLINDSPOT, a new spin-off set in the PUBG universe from Krafton, expands the franchise with a standalone 5v5 top-down tactical shooter.
Set across tightly designed maps, matches focus on information, positioning and coordinated team play, with squads clearing angles, locking down objectives and outmaneuvering opponents in fast, round-based firefights. Every callout and ability use matters, turning each round into a layered tactical puzzle rather than a simple test of reflexes.
On GeForce NOW, responsive streaming and sharp RTX-powered visuals keep every angle, rotation and clutch play feeling precise, even on lower-powered devices. With support across a wide range of screens, the cloud makes it easy for the gaming squad to jump right into the action, without lengthy downloads or updates getting in the way.
Kick off GeForce NOW’s anniversary month in style. Here’s what’s in store to start the celebration, with this week’s 10 new additions:
In addition to this week’s additions of Menace, PUBG: BLINDSPOT and Carmageddon: Rogue Shift, this game will also be GeForce RTX 5080-ready this week:
And look forward to the games coming throughout the rest of the month:
In addition to the 14 games announced last month, 21 more joined the GeForce NOW library:
Nova Roma is now set to launch in March and will arrive in the cloud when it debuts. Stay tuned to GFN Thursday for more details.
What are you planning to play this weekend? Let us know on X or in the comments below.
Hopping into this convo.
2016 vs. 2026
For @CandraHastings, it looks like growth – share yours.
pic.twitter.com/o9p3oBBQYD
—
NVIDIA GeForce NOW (@NVIDIAGFN) February 3, 2026
Editor’s note: This post is part of the Nemotron Labs blog series, which explores how the latest open models, datasets and training techniques help businesses build specialized AI systems and applications on NVIDIA platforms. Each post highlights practical ways to use an open stack to deliver value in production — from transparent research copilots to scalable AI agents.
Businesses today face the challenge of uncovering valuable insights buried within a wide variety of documents — including reports, presentations, PDFs, web pages and spreadsheets.
Often, teams piece together insights by manually reviewing files, copying data into spreadsheets, building dashboards and using basic search or template-based optical character recognition (OCR) tools that often miss important details in complex media.
Intelligent document processing is an AI-powered workflow that automatically reads, understands and extracts insights from documents. It interprets rich formats inside those documents — including tables, charts, images and text — using AI agents and techniques like retrieval-augmented generation (RAG) to turn the multimodal content into insights that other multi-agent systems and people can easily use.
With NVIDIA Nemotron open models and GPU-accelerated libraries, organizations can build AI-powered document intelligence systems for research, financial services, legal workflows and more.
These open models, datasets and training recipes have powered strong results on leaderboards such as MTEB, MMTEB and ViDoRe V3, benchmarks for evaluating multilingual and multimodal retrieval models. Teams can choose from among the best models for tasks like search and question answering.
Document intelligence systems that can pull meaning from complex layouts, scale to huge file libraries and show exactly where an answer came from are incredibly useful in high-stakes environments. These systems:

The result is a shift from static document archives to living knowledge systems that directly power business intelligence, customer experiences and operational workflows.
Intelligent document processing systems built on NVIDIA Nemotron RAG models, Nemotron Parse and accelerated computing are already reshaping how organizations across industries gain insights from their documents.
Justt: AI-Native Chargeback Management and Dispute Optimization
In financial services, payment disputes create significant revenue loss and operational complexity for merchants, largely because the evidence needed to handle them lives in unstructured formats. Transaction logs, customer communications and policy documents are often fragmented across systems and difficult to process at scale, making dispute handling slow, manual and costly.
Justt.ai provides an AI-driven platform that automates the full chargeback lifecycle at scale. The platform connects directly to payment service providers and merchant data sources to ingest transaction data, customer interactions and policies, then automatically assembles dispute-specific evidence that aligns with card network and issuer requirements.
The platform’s AI-powered dispute optimization, powered by Nemotron Parse, applies predictive analytics to determine which chargebacks to fight or accept, and how to optimize each response for maximum net recovery. Leading hospitality operators like HEI Hotels & Resorts use the platform to automate dispute handling across their properties, recapturing revenue while maintaining guest relationships.
By pairing document-centric intelligence with decision automation, merchants can recapture a significant portion of revenue lost to illegitimate chargebacks while reducing manual review effort.
Docusign: Scaling Agreement Intelligence
Docusign is the global leader in Intelligent Agreement Management, handling millions of transactions every day for more than 1.8 million customers and over 1 billion users.
Agreements are the foundation of every business, but the critical information they contain are often buried inside pages of documents. To surface the information, Docusign needed high-fidelity extraction of tables, text and metadata from complex documents like PDFs so organizations could understand and act on obligations, risks and opportunities faster.
Docusign is evaluating Nemotron Parse for deeper contract understanding at scale. Running on NVIDIA GPUs, the model combines advanced AI with layout detection and OCR. The system can reliably interpret complex tables and reconstruct tables with required information. This reduces the need for manual corrections and helps ensure that even the most complex contracts are processed with the speed and accuracy their customers expect.
With this foundation, Docusign will transform agreement repositories into structured data that powers contract search, analysis and AI-driven workflows — turning agreements into business assets that help organizations and their teams improve visibility, reduce risk and make faster decisions.
Edison Scientific: Research Across Massive Literature Scale
Edison Scientific’s Kosmos AI Scientist helps researchers navigate complex scientific landscapes to synthesize literature, identify connections and surface evidence.
Edison needed a way to rapidly and accurately extract structured information from large volumes of PDFs, including equations, tables and figures that traditional information parsing methods often mishandle.
By integrating the NVIDIA Nemotron Parse model into its PaperQA2 pipeline, Edison can decompose research papers, index key concepts and ground responses in specific passages, improving both throughput and answer quality for scientists. This approach turns a sprawling research corpus into an interactive, queryable knowledge engine that accelerates hypothesis generation and literature review.
The high efficiency of Nemotron Parse enables cost-efficient serving at scale, allowing Edison’s team to unlock the whole multimodal pipeline.
A robust, domain-specific document intelligence pipeline requires technologies that can handle data extraction, embedding and reranking, while keeping the data secure and compliant with regulations.
These capabilities are packaged as NVIDIA NIM microservices and foundation models that run efficiently on NVIDIA GPUs, allowing teams to scale from proof of concept to production while keeping sensitive data within their chosen cloud or data center environment.
The most effective AI systems use a mix of frontier models and open source models like NVIDIA Nemotron, with an LLM router analyzing each task and automatically selecting the model best suited for it. This approach keeps performance strong while managing computing costs and improving efficiency.
Access a step-by-step tutorial on how to build a document processing pipeline with RAG capabilities. Explore how Nemotron RAG can power specialized agents tailored for different industries.
Plus, experiment with Nemotron RAG models and the NVIDIA NeMo Retriever open library, available on GitHub and Hugging Face, as well as Nemotron Parse on Hugging Face.
Join the community of developers building with the NVIDIA Blueprint for Enterprise RAG — trusted by a dozen industry-leading AI Data Platform providers and available now on build.nvidia.com, GitHub and the NGC catalog.
Stay up to date on agentic AI, NVIDIA Nemotron and more by subscribing to NVIDIA AI news, joining the community and following NVIDIA AI on LinkedIn, Instagram, X and Facebook.
At 3DEXPERIENCE World in Houston, NVIDIA founder and CEO Jensen Huang and Dassault Systèmes CEO Pascal Daloz laid out a blueprint for industrial AI rooted in physics-based “world models” — systems designed to simulate products, factories and even biological systems before they’re built.
“Artificial intelligence will be infrastructure,” like water, electricity, and the internet Huang told the crowd, playfully referring to the engineering-heavy audience as “Solid Workers,” a nod to Dassault Systèmes’ SolidWorks platform.
The announcement continues a collaboration spanning more than a quarter century between NVIDIA and Dassault Systèmes.
“This is the largest collaboration our two companies have ever had in over a quarter century,” Huang said. “We’re going to fuse these technologies so engineers can work at a scale that’s 100 times, 1,000 times — and eventually a million times greater than before.”
The new partnership brings NVIDIA accelerated computing and AI libraries together with Dassault Systèmes’ Virtual Twin platforms to move more engineering work into real-time digital workflows, powered by AI companions that help teams explore, validate, prototype and iterate faster.
Huang framed the shift as a reinvention of the computing stack: moving from hand-specified, structured digital designs to systems that can generate, simulate and optimize in software — at industrial scale.
Virtual twins are not applications, “they are knowledge factories,” Daloz said.
The partnership aims to establish industry world models — science-validated AI systems grounded in physics that can serve as mission-critical platforms across biology, materials science, engineering and manufacturing.
In Daloz’s framing, the value moves upstream: virtual twins become the place where knowledge is created, tested, and trusted — before anything is built in the physical world.
Dassault Systèmes, whose 3DEXPERIENCE platform serves more than 45 million users and 400,000 customers globally, has long been a leader in virtual twin technology — digital replicas that let engineers simulate products and processes before building them physically.
The collaboration brings together accelerated computing, AI and digital twin technologies so engineers can design not only geometry, but behavior — and explore radically larger design spaces earlier in development.
Together, the companies outlined how this shared architecture will show up across science, engineering and manufacturing workflows:
Huang said that in domains like biology and materials, the frontier is learning the underlying “language” of complex systems and then generating new options that can be evaluated and validated in simulation.
A central theme of the discussion was how factories themselves are changing — from static physical assets to living systems that are designed, simulated and operated as virtual twins.
As part of the partnership, Dassault Systèmes is deploying NVIDIA-powered AI factories on three continents through its OUTSCALE sovereign cloud, enabling customers to run AI workloads while maintaining data residency and security requirements.
Both executives emphasized that the goal isn’t to replace engineers — it’s to amplify them. As AI agent companions take on more exploratory and repetitive tasks, designers and engineers gain leverage and creativity, not redundancy.
Every designer will have a “team of companions,” Huang said — a shift he described as fundamentally positive for engineers, software platforms and the broader ecosystem built on them.
For the tens of millions of engineers who use Dassault Systèmes tools to design everything from aircraft to consumer packaged goods, the shift isn’t about replacing human creativity — it’s about expanding it.
“Success is not about automation,” Daloz said. “[Engineers] don’t want to automate the past — they want to invent the future.”
Looking ahead, Daloz framed the partnership as about more than performance gains – it’s an effort to open new possibilities, help companies eliminate bad choices before they become expensive mistakes, and create entirely new categories of products.
“Virtual twins and the 3D Universes are not applications,” Daloz said. “They are knowledge factories.”
The fireside conversation between Huang and Daloz was broadcast live from 3DEXPERIENCE World.