April 30, 2024
Welcome New DSI Team Members
When DSI director Brian Giera and deputy director Cindy Gonzales began planning activities for fiscal year 2024 and beyond, they immediately realized that LLNL’s growth in data science and AI/ML research requires corresponding growth in the DSI’s efforts. “Our field is booming,” Giera states. “The Lab has a stake in the larger AI/ML community, especially when it comes to national security, and is investing accordingly in research and supporting programs. The future is wide open.”
For the DSI, a brighter spotlight on data science and AI/ML translates into staff opportunities. Gonzales points out, “The demand signal is clear: We need more opportunities for staff to engage, contribute ideas, and ultimately strengthen the DSI as the field continues to flourish. It has been awesome to see interest from all career levels who are eager to bring new ideas to streamline, maintain, and even grow the DSI like never before.”
Over the past few months, the DSI Council reviewed applications and invited staff to join the DSI team to focus on student programs, the consulting service (see story below), internal and external outreach, and other advisors. “We’ve made an investment in our organization, expanding existing focus areas as well as creating new roles,” says Giera.
Aligning LLM Behavior with Human Intent
Large language models (LLMs), exemplified by OpenAI’s ChatGPT and Meta’s LLaMA, often struggle with understanding and following nuances in human instructions, leading to challenges in adopting LLMs for mission-critical applications with complex requirements. LLMs are typically trained on massive raw datasets, then honed with small, curated instruction datasets to improve their usability. However, standard training strategies prioritize the model’s optimization stability over its generalization (adaptation to unfamiliar data)—a trend that may not scale well.
Developed by LLNL researchers Brian Bartoldson and Bhavya Kailkhura with colleagues from the University of Maryland and New York University, a new approach called NEFTune (Noisy Embedding Instruction Fine Tuning) improves the instruction-following ability of LLMs without additional compute or data overhead. It may seem counterintuitive, but NEFTune adds noise to vector embeddings—numerical representations of words and text—of the instruction dataset. NEFTune uses random noise, as other perturbation schemes were found to be inferior. With NEFTune, LLM behavior better aligns with human preferences and instruction following improves.
In one test, the LLaMA-2-7B model performed 35% better on conversational tasks when the team applied NEFTune to the fine-tuning phase. In another test, Stanford University’s Alpaca-2-7B model provided a more detailed description of quantum computing principles with NEFTune than without it. The team hypothesizes that standard fine-tuning recipes “overfit” the model to the instruction dataset, in many cases exactly reproducing instruction inputs as answers. With NEFTune, models do not “lock in” the exact wording of the instruction data. (Image at left: Performance of the LLaMA-2-7B model on various datasets with [striped bars] and without [blue bars] NEFTune, measured as the “win rate” percentage of an industry-standard accuracy benchmark.)
“It’s not completely clear why this is the case, but the LLMs we tested are capable of providing longer, more coherent answers when random noisy embeddings are added to the mix,” says Bartoldson. “Enhancing the instruction-following ability of LLMs not only empowers them to fulfill complex tasks accurately but also fosters safety and reliability, paving the way for trustworthy AI for mission-critical applications.” The team’s paper has been accepted to the International Conference on Learning Representations (ICLR) coming up in May, and their approach has been adopted by the Hugging Face ML platform.
Why Not Both? ICF Design Optimization with Low- and High-Fidelity Models
Inertial confinement fusion (ICF) experiments, like those at LLNL’s National Ignition Facility, depend on numerical simulations to guide design. Traditionally, these experimental designs use low-fidelity (and therefore computationally low-cost) modeling to identify potentially viable design regions, which are then subsequently explored via selected high-fidelity (high computational cost) modeling. A team of LLNL researchers has pointed out the inefficiency of this two-step optimization approach: It can lead high-fidelity searching toward incorrect regions and waste computational resources on parameter regimes far away from the true optimal solution. Instead, they came up with a better way to refine ICF designs.
In a new Physics of Plasmas paper, Jingyi Wang, Nai-Yuan Chiang, Andrew Gillette, and J. Luc Peterson describe an iterative multifidelity Bayesian optimization method based on Gaussian process regression, leveraging surrogate models that rely on simulation results from more than one fidelity. In other words, this method uses low- and high-fidelity models simultaneously. The team collected simulation data from HYDRA, an LLNL-developed multiphysics simulation code, then demonstrated their algorithm’s effectiveness on 2D and 8D test problems. (Image at left: The yields attained over 100 iterations of the multifidelity Bayesian optimization algorithm for three 8D samples [black, blue, green lines] find similar designs to that for high-fidelity samples only [yellow line], but can get to good designs in fewer iterations, at a fraction of the cost of high-fidelity alone.)
Funded by the Laboratory Directed Research and Development program, the research shows a statistically significant improvement over single-fidelity strategies and could be applied to experimental or mixed simulation–experiment campaigns. “How can we use cheap but fast models in concert with accurate but expensive ones? These data science techniques let us pull humans, and their biases, out of the loop and automatically and rigorously choose the right combinations to efficiently design complex systems,” states Peterson. “While we developed this for ICF, we see applications to a wide variety of science and engineering problems—from manufacturing advanced parts to discovering new materials to finding the next drug.”
Consulting Service Success Story: Millions of Molecules
LLNL’s renowned Forensic Science Center (FSC) supports chemical, nuclear, explosive, and biological counterterrorism. Its activities include analysis of interdicted samples; 24/7 radiological assistance; and the critical R&D needs of the intelligence community, law enforcement, homeland security, and health professionals. FSC deputy director Brian Mayer (pictured at left) recently contacted the DSI’s Consulting Service (DSICS) for help with a high-impact project that required a quick turnaround and data science expertise.
FSC scientists needed to calculate the atomic masses of different molecular compounds based on a large number of various chemical substructures, enumerating over several million possible candidates. The goal was generation of the most comprehensive chemical screening library ever created to help support the FSC’s work with the Organisation for the Prohibition of Chemical Weapons based in The Hague, Netherlands.
LLNL postdoctoral researcher and DSICS consultant Mike Boyle wrote a Python script to iterate though all the relevant combinations of chemical functional groups to determine their resulting chemical formulae and masses. “Manually calculating the mass of millions of combinations of chemical groups is a time-consuming, repetitive problem—an ideal task to automate with Python. We provided the FSC scientists with a script that does their calculations in minutes and that they can continue to use whenever this task comes up again,” he explains.
Mayer states, “Mike’s technical work saved the FSC time and resources that would have otherwise taken hundreds of hours of manual data curation. Mike’s enthusiasm and responsivity made this process that much more impactful, and we hope we get the chance to work with him in the future.”
With interest in applying data science techniques to problems in the physical sciences, Boyle jumped at the chance to contribute to the DSICS. “It’s fun to apply data science skills in new, impactful domains at the Lab. I look forward to working as a DSICS consultant on new projects in the future,” he says. Boyle was recognized for his contribution to the FSC’s project with an LLNL Global Security Bronze Award.
Video: Igniting Scientific Discovery with AI and Supercomputing
LLNL’s fusion ignition breakthrough, more than 60 years in the making, was enabled by a combination of traditional fusion target design methods, high-performance computing (HPC), and AI techniques. The success of ignition marks a significant milestone in fusion energy research, and was facilitated in part by the precision simulations and rapid experimental data analysis only possible through HPC and AI-driven tools. As described in a new video, this merger of HPC and AI is already having a transformative impact on science—particularly in fusion energy—and is enhancing national security efforts and opening doors to further scientific advancements at the Department of Energy. Transitioning from ignition to a viable fusion energy ecosystem remains a formidable challenge, necessitating continued investments and a broad commitment to HPC, fusion R&D, and AI initiatives. Despite the hurdles, harnessing the power of supercomputing and AI has the potential to lead the world towards a brighter, cleaner, and more sustainable future.
Meet a Data Scientist in Nuclear Fission
Nicolas Schunck, a staff scientist with the nuclear data and theory group in LLNL’s Physical and Life Sciences Principal Directorate, researches computational nuclear theory with a particular focus on nuclear fission. Schunck develops models and simulations that predict the properties of short-lived radioactive species in nucleosynthesis mechanisms. He’s also working on a predictive theory of nuclear fission for stockpile stewardship and nuclear forensics programs. Schunck was educated at the University of Strasbourg and worked as a postdoc in the UK, Spain, and at Oak Ridge National Lab before coming to Livermore 13 years ago. A prolific communicator, he lists 66 papers as author or coauthor and edited the textbook Energy Density Functional Methods in Atomic Nuclei. He’s helping mentor five postdocs and always looks forward to working with postdocs and summer interns. “Teaching forces me to question what I know and what I do on a regular basis, which is refreshing,” says Schunck. “Constant reassessment is critical to staying at the cutting edge of research.” Schunck finds his data science work provides valuable perspectives on problems in nuclear science, stating, “By looking at the same old problem in a completely different light, we can come up with innovative solutions.”
Meet a Data Scientist in Manufacturing Processes
Computational engineer and data scientist Vic Castillo helps domestic manufacturers improve their processes—a challenge he finds rewarding. As a 30-year LLNL employee, Castillo has worn many hats, especially in support of Strategic Deterrence and Global Security programs. Castillo first came to the Lab as a graduate student through UC Davis’s Applied Science program (a.k.a. Teller Tech), having previously worked as a scientist at the Clorox Research Center. Castillo is now assisting the High-Performance Computing for Energy Innovation (HPC4EI) Program and has received 12 grants to collaborate with domestic manufacturing companies. He says that analyzing real-world data can be laborious, but HPC simulation and ML tools have vast potential to improve manufacturing processes and reduce carbon emissions. “My work feels like a small consulting business with many clients,” he states. “There are many different types of problems, and the Lab has broad capabilities to meet these challenges.” Castillo frequently speaks to manufacturing and university audiences to publicize the Lab’s capabilities and publishes journal articles with his team. He has mentored 25 students and postdocs during his career and helped to create an LLNL Teacher Research Academy for local STEM teachers.