Volume 18

June 10, 2022

DSI logo cropped FY22

Our mission at the Data Science Institute (DSI) is to enable excellence in data science research and applications across LLNL. Our newsletter is a compendium of breaking news, the latest research, outreach efforts, and more. Past volumes of our newsletter are available online.

three sponge-like simulated shapes resulting from data reduction, with the middle shape representing the original dataset and the left and right shapes for comparison

LLNL Wins PacificVis Best Paper Award

Three LLNL computer scientists and University of Utah colleagues have won the 2022 PacificVis Best Paper award. Harsh Bhatia, Peer-Timo Bremer, and Peter Lindstrom co-authored “AMM: Adaptive Multilinear Meshes” (see the PDF and GitHub repository). AMM provides users with a resolution-precision-adaptive representation technique that reduces mesh sizes, thereby reducing the memory and storage footprints of large scientific datasets. The approach combines two solutions into one—reducing data precision and adapting data resolution—to improve the performance and efficiency of data processing and visualization. A key novelty in AMM is that it lets users compare different data-reduction strategies to determine which would work best for a specific dataset.

PacificVis is among the leading international conferences in visualization research. Bhatia, the paper’s lead author, notes, “AMM is the result of a lot of hard work. Publishing our work and releasing our code as open source are ways to engage with others in this field, and it’s an honor to be acknowledged.” (Image at left: An original dataset [center] is flanked by the results of two data-reduction strategies [left and right].)

two similar versions of a dome-shaped manufactured part displayed side by side

Smarter Manufacturing with Digital Twins

LLNL scientists and engineers are constantly seeking new ways to improve manufactured parts and materials by optimizing production, predicting performance, and ensuring reliability. A new capability has emerged from a Laboratory Directed Research and Development project focusing on improving manufacturing process performance and component quality: Digital twins (DTs) are virtual representations of physical systems, combining physics-based and data-driven models. DTs can simulate the advanced manufacturing pipeline, speeding up development time and reducing the costs and risks associated with using real equipment and materials.

A team led by research engineer Brian Giera aims to develop DTs for both process- and part-level manufacturing—in other words, replicating real manufacturing systems and physical parts, respectively. For example, a DT of a manufacturing process will use the same inputs as its physical counterpart, simulate machine motions and states, and output a DT of each part made. Giera points out that DTs can benefit any manufacturing technique: “The digital twin concept and framework are bigger than any single manufacturing method. We envision broader use among the Department of Energy community, industry, and academia.” (Images at left: The DT strategy includes leveraging data to improve manufacturing processes and parts. A digital twin [right] can be compared to its physical counterpart [left]; the resulting data can be fed back into the manufacturing pipeline.)

U.S. map with 13 states highlighted in various gradients of orange according to number of students in each

DSSI Welcomes the Class of 2022

The Data Science Summer Institute (DSSI) welcomed new student interns on May 23. This is the fifth year of the summer program, which is directed by Goran Konjevod and co-directed by Nisha Mulakken. The group consists of 32 students from 23 schools, including 7 of the 10 University of California (UC) campuses. (The map at left highlights the states where these schools are located, with darker orange corresponding to a higher number of interns.) They will work with mentors in LLNL’s Computing, Engineering, and Physical and Life Sciences directorates.

Among the many projects awaiting student contribution are bioinformatics modeling in genetics, deep learning for atmospheric and seismic applications, machine learning (ML) surrogates for plasma detachment control, data cleaning and visualization of a chemistry database, ML for viral protein predictions, combining nearest neighbor and reinforcement learning methods, and developing techniques to train ML models.

diagram showing acetaminophen as an orange circle connected to many other circles in different colors

Fellowship Fuels Student Research

Livermore Lab Foundation (LLF) fellowships allow students selected from a competitive applicant field to focus on their studies and work with LLNL mentors throughout their senior year of college. Randy Posada, a former DSSI intern and newly graduated UC Merced student, has capped his 2021–2022 LLF fellowship with an impressive accomplishment: publication. Posada’s paper, “Graph-Based Featurization Methods for Classifying Small Molecule Compounds,” was published in the UC Merced Undergraduate Research Journal, which is one of 80 academic journals in the UC-affiliated, open-access eScholarship publishing program. LLNL co-authors are Mary Silva, Marisa Torres, Jonathan Allen, Jeff Drocco, Sarah Sandholtz, and Adam Zemla. The work was done under the Target ID project, which is developing a computational drug target identification tool through the Laboratory Directed Research and Development (LDRD) program.

Using random forest, neural networks, and logistic regression classification, Posada and the team proposed a method for learning features of drugs that cause drug-induced liver injury (DILI)—a major reason for drug withdrawal from the pharmaceutical production market. The new algorithm was tested on toxicology datasets and classified the compounds according to whether they are associated with liver injury as a side effect or disease. (Image at left: Graphical output of acetaminophen, an active ingredient in commercial pain relievers. This visualization shows relationships among nodes of genes, proteins, side effects, and diseases.)

“Hepatotoxicity from drugs is a large and pertinent problem in the field of medicine,” Posada explains. “Working closely with such talented scientists on the Target ID team was exciting. I acquired many skills during this project and am glad we were able to publish this novel research.” Additionally, he expresses gratitude to the LLF for providing the fellowship opportunity, and thanks Nisha Mulakken, Jennifer Bellig, and Goran Konjevod for a seamless experience through the DSSI.

Silva, a data scientist who mentored Posada, notes, “This paper was a huge undertaking for an undergraduate in his senior year. Randy was great to work with, and I look forward to tracking his progress. He has a bright future ahead.” This summer Posada will attend a University of Washington workshop in statistical genetics, then head to the University of Arizona to pursue a Doctor of Veterinary Medicine degree.

supercomputer rack on a dramatic black and gray background

Next-Generation High-Performance Networking Collaboration

The National Nuclear Security Administration (NNSA) has awarded an $18 million contract to Cornelis Networks for collaborative research and development in next-generation networking for supercomputing systems at the NNSA laboratories. The Next-Generation High Performance Computing Network project for the NNSA’s Advanced Simulation and Computing (ASC) program will enable NNSA to co-design and partner with Cornelis on development and productization of next-generation interconnect technologies for high-performance computing (HPC). The project is led by LLNL for the NNSA Tri-Labs (LLNL, Los Alamos, and Sandia). The resulting networking technologies are expected to support mission-critical applications for HPC, artificial intelligence (AI) and ML, and high-performance data analytics. Applications could include stockpile stewardship, fusion research, advanced manufacturing, climate research and other open science on future ASC HPC systems.

cartoon of four people working a computers next to a large screen that depicts data presentations such as charts and graphs

What’s New with the Open Data Initiative?

The DSI’s Open Data Initiative (ODI) provides unique datasets, representing LLNL mission areas, to the data science community. Since its inception in 2019, the ODI has grown steadily under the leadership of research scientist Rushil Anirudh. The ODI currently shares 12 datasets from a wide variety of scientific domains such as medical diagnostics, astrophysics, additive manufacturing, atmospheric science, and inertial confinement fusion. Thanks to a partnership launched in 2020, the UC San Diego library catalog includes 8 of the datasets.

Researchers can use these datasets to explore applications of algorithms, models, techniques, and methodologies, as well as to test hardware configurations for ML platforms. In addition to expanding the dataset collection, Anirudh states, “We have begun developing ‘zero configuration’ Google Collab notebooks. We hope these will make it easy for students, researchers, and practitioners to quickly run code and models on our data.”

Additionally, the National AI Initiative (NAII) website lists the ODI as a key data resource for AI research and development. The NAII was established in response to the National Artificial Intelligence Initiative Act of 2020. One of its strategic goals is to increase access to data resources by experts and researchers who advance AI, thus strengthening the nation’s competitiveness in this field.

LLNL researchers are welcome to contribute datasets that are (1) associated with a well-defined data science problem; (2) less than 500 gigabytes; and (3) cleaned up (e.g., have labels or responses). Contact datascience [at] llnl.gov (subject: ODI) (datascience[at]llnl[dot]gov) to contribute your dataset.

Ming’s portrait next to the seminar series icon

Virtual Seminar Discusses Best Practices for Apache Spark

The DSI’s May 16 seminar featured LLNL computer scientist Dr. Ming Jiang, who presented “Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion.” He discussed challenges with using Big Data frameworks, such as Apache Spark, on HPC systems at LLNL. His team developed a set of best practices for using Apache Spark to perform ML and analytics on massive HPC simulation data. Their investigation focused on the real-world application of scaling ML algorithms to predict and analyze failures in multiphysics simulations, and it culminated with the demonstration of training a random forest regression model on 76 terabytes of data with over one trillion training examples.

Dr. Jiang’s current research focuses on applying ML to automate simulation workflows and exploiting Big Data analytics for HPC simulation data. He joined LLNL’s Center for Applied Scientific Computing in 2005 after receiving his PhD in Computer Science and Engineering from The Ohio State University. He has been a principal investigator and project/task lead on several large-scale collaborative research projects, including ML for arbitrary Lagrangian-Eulerian simulations, data-centric architectures, and real-time space situational awareness. A recording of the talk will be posted to the seminar series’ YouTube playlist.

WPD and WSC logos

WCI Hosts Machine Learning Workshop

The Weapon Physics and Design (WPD) and Weapon Simulation and Computing (WSC) Programs, under LLNL’s Weapons and Complex Integration (WCI) Directorate, jointly held a Machine Learning and Cognitive Simulation Workshop on March 15–16. These programs are key components in the Lab’s mission to ensure the safety, security, and effectiveness of the nation’s nuclear stockpile.

The workshop consisted of unclassified and classified presentations, covered the Lab’s ML strategy as well as LDRD and programmatic initiatives, and featured updates on ongoing projects. Keynote speakers discussed the WCI Virtual Design Assistant (ViDA) and several LDRD Strategic Initiatives including DarkStar, which explores ML-aided design. Additionally, the workshop included discussions on strengthening WCI’s ML community and ensuring that ML research and development are advancing programmatic deliverables.

Kevin’s portrait next to the highlights icon

Meet an LLNL Data Scientist

Kevin McLoughlin has always been fascinated by the intersection of computing and biology. As a graduate student in the 1980s, he worked on early neural network simulations to understand the human brain. He recalls, “Computational biology as a field didn’t exist then, but that changed when the Human Genome Project launched.” After a stint with a biotech startup, McLoughlin joined the Lab in 2004 to work on pathogen bioinformatics. Since 2017, he has participated in the Accelerating Therapeutics for Opportunities in Medicine (ATOM) consortium, which combines HPC and data science techniques to design drugs for cancer, pathogens, and other diseases. “ATOM is enormously important, draws on my full skill set, and demands that I constantly learn new things. I work with extremely smart people from LLNL and our partners,” he says. McLoughlin helped develop a COVID-19 antiviral drug design pipeline that combines a computational autoencoder framework with ML algorithms to propose molecular structures, identify those with desirable properties, and suggest new molecules based on the best results—research that earned one of the Lab’s 2021 Excellence in Publication Awards. He holds a PhD in Biostatistics from UC Berkeley.