A language model that thinks before speaking
AI technologies based on large language models (LLMs) have empowered users to automate workflows, improve their written communications, summarize long documents, and much more, both in their personal and professional lives. When it comes to applying LLMs to scientific endeavors, however, LLMs still have a long way to go before they can be trusted to provide key insights into experimental design, understand the significance of project results, or begin to appreciate the real-world scientific phenomena involved.
Lawrence Livermore computer scientist Bhavya Kailkhura explains, “Researchers are excited to leverage AI tools like LLMs for solving scientific challenges. However, existing approaches rely heavily on verbally articulated intermediate steps and often fall short in capturing complex, non-verbalized scientific patterns commonly encountered in the scientific applications encountered at Livermore and the DOE broadly. Take protein–protein interactions (PPI) as an example. Understanding PPIs, which are critical to cellular function, demands modeling multi-scale, context-dependent dynamics that words alone cannot sufficiently convey. Relying solely on verbal reasoning can introduce risks such as hallucination errors and omission of critical information, leading to incomplete or inaccurate scientific understanding.”
For Kailkhura and fellow Livermore researcher Brian Bartoldson, one key to making LLMs’ scientific reasoning more reliable and grounded in physics lies in re-working the models’ internal architecture. Created in partnership with researchers from the University of Tübingen and the University of Maryland, the team’s new proof-of-concept LLM, Huginn, represents a new breed of language models that emphasizes careful introspection over immediate yet often incomplete answers.
The “intelligence” of a machine learning (ML) model is said to roughly scale with the logarithm of the computing resources used to train it; that is, doubling the amount of information used to train the system will not make the system twice as intelligent, only a fraction more so. Therefore, continuing to train new AI models with larger and larger amounts of data is neither an efficient nor effective approach to improving mainstream models’ abilities going forward.
Instead, the team investigated how a language model can make better use of what data is provided to it. A fundamental innovation of Huginn is to incentivize continuous reasoning in latent space. Traditionally, language models first dice up natural language information (like a user’s text prompt) into small chunks, known as tokens; they then consult a table of word embeddings to translate tokens into numerical information that can be mathematically manipulated by an ML architecture called a “transformer,” and the intermediate states of these calculations exist in “latent space,” or, “hidden space.” While the model computes, it constantly translates from hidden space back into natural language text outputs that show the user its thinking process. Huginn’s creators thought that allowing the model to perform more calculations in hidden space before creating natural language outputs would give the model opportunity to think critically and uninterrupted.
“There is not necessarily training data available for reasoning about nuanced scientific phenomenon, for example, the biochemical complexities of molecular interactions. Depth-recurrent AI models can perform iterative reasoning within latent space rather than relying on emulating verbal reasoning steps, presenting a potential way forward for LLM reasoning and advances in science. Moreover, by performing logic steps solely inside the embedding space, Huginn can reason more efficiently,” says Bartoldson.
To facilitate this process, they incorporated a recurrent element into Huginn’s transformer architecture. Unlike most LLMs, which have a pre-defined number of neurons and computational layers, Huginn more closely mimics human neurophysiology by incorporating a recurrent block that allows the neural network’s computational layers to grow as necessary. This asset allows Huginn to continue searching for the best output by devoting more computation to learning and deploying logical strategies rather than merely memorizing training data. In short, it thinks longer before speaking.
By prioritizing thinking before responding, the research team considers Huginn a crucial step towards a more comprehensive scientific reasoning LLM. While Huginn is currently a prototype model, the team aims to incorporate its reasoning capabilities into national laboratories’ mission-specific applications. Kailkhura outlines a comprehensive path forward for such work:
"Our vision is to build COGENT, a next-generation scientific reasoning LLM that stands as the gold standard for trustworthiness and efficiency — two critical pillars for scientific breakthroughs in data-scarce, high-stakes environments. COGENT will navigate the unique challenges of DOE’s scientific mission: where data is often scarce, compute is precious, and trust in the model’s reasoning is non-negotiable. With COGENT, we are not just advancing AI, we are redefining how AI and science converge to unlock the next era of discovery."