Collaboration Drives Data Science Workshop

Lawrence Livermore National Laboratory’s (LLNL’s) Data Science Institute (DSI) hosted its second annual workshop on July 23–24, 2019. Co-sponsored by the University of California (UC) system, the event drew more than 200 participants to Garré Winery in Livermore. A common theme ran throughout both days: Collaboration is always welcome.

Indeed, feedback from last year’s workshop inspired a collaborative twist on this year’s call for abstracts. Registrants were invited to propose session topics with related presentation lineups. During the event, moderators introduced their research areas before speakers presented, and the sessions concluded with a question-and-answer period that spilled over into birds-of-a-feather discussions between sessions.

“The extended birds-of-a-feather breaks helped provide more opportunity for networking and community building,” stated DSI director Michael Goldman. “This format helped with the cohesiveness of the collected talks, highlighting particular projects and research problems.”

Cumulatively, session topics emphasized key challenges with machine learning (ML) and artificial intelligence (AI) in scientific research. For example, one session focused on applying ML techniques at scale to traditional scientific computing simulations, while others described data-driven breakthroughs in domains such as biomedicine and atmospheric science. Goldman noted, “We saw a very diverse range of topics, several of which many attendees had no idea Livermore Lab was involved in.”

Laboratories Meet Academia

Kevin speaks to the workshop audience
Kevin speaks to the workshop audience LLNL’s Kevin McLoughlin discussed computational approaches to predicting drug-induced liver injury. Liver toxicity is an important factor in the development of cancer drugs.

The workshop represented another successful collaboration in LLNL’s long partnership with UC campuses and UC National Laboratories (UCNL), a division of the UC Office of the President. UCNL works closely with Lawrence Livermore, Lawrence Berkeley, and Los Alamos national laboratories to share expertise and technology, provide research opportunities, and develop the Labs’ workforce pipeline. Goldman said, “We are thrilled to be part of such a great community of data scientists who are passionate about driving frontier science using ML and AI.”

“We were very pleased to co-sponsor the DSI workshop. The talents and capabilities showcased were dynamite. The work being carried out at the individual Labs and campuses are excellent, but history has shown that when the individual efforts can work as a system, it is often better than the sum of the parts,” explained June Yu, executive director of UCNL’s National Laboratories Programs. “The workshop brought together the far-flung and siloed data scientists, with the goal of creating larger impacts in strategic fields. It is exciting to see how the workshop has evolved in just two years and the incredible interest in this workshop from the Labs and UC campuses’ communities.”

Keynote: Data and Ethics

Abel Rodriguez speaks to the audience
UC Santa Cruz professor Abel Rodriguez kicked off the DSI workshop with a keynote address on emerging ethical issues in mainstream data analytics.

The workshop kicked off with a keynote address by Abel Rodriguez, professor of statistics and associate dean for graduate affairs at UC Santa Cruz. He spoke about ethical challenges wrought by the world’s increasing data-interconnectedness, including considerations of privacy, confidentiality, data collection, and informed consent.

“We are at the intersection of big data, powerful algorithms, and a lot of computing power while trying to use these tools in a societal context,” Rodriguez explained. “This confluence could help alleviate poverty, save the planet, or fight crime. But a lot of risks come with these tools, and we have to balance the benefits and harms.”

Rodriguez outlined several case studies that illustrate these challenges. For example, the popularity of facial recognition technology is due, in part, to the rise of publicly available datasets. This scenario prompts questions about meaningful informed consent. “Open science is not a synonym for ethical science,” he cautioned, noting this distinction’s relevancy to the Labs. “You still want to allow for reproducibility and sharing of your information, but maintaining security of the data is also crucial.”

In another case study, Rodriguez described bias in a hiring algorithm that discriminated against women applicants, stating, “An algorithm is only as good as the data fed into it.” He provided other similar examples and emphasized the importance of training an algorithm on a truly representative dataset—one that includes enough variety. Rodriguez concluded his talk by encouraging data scientists to aim for transparency, trust, and reproducibility in their work.

Sessions Probe ML and AI

Following the keynote address, the workshop continued with four sessions on the first day and five on the second. One group of presenters covered several challenges unique to scientific ML, such as dataset complexity and scale. For example, LLNL physicist Jim Gaffney spoke about using ML to compare simulation and experimental data. “Generating meaningful uncertainties and inferring known physics parameters are difficult,” he stated. Gaffney’s team is developing a new metric that provides insight into the quality of uncertainty models. Their process involved transfer learning to discover important features in a dataset and preserve them during model calibration.

Workflows and scalable learning in predictive science made up another session. As LLNL design physicist Luc Peterson explained, existing high-performance computing systems are not designed for the massive scale of some scientific ML workloads. Next-generation workflows should be asynchronous and able to run independent tasks on different machines. “Running 100 simulations is easy. Running 100 million is not,” Peterson said. At the other end of the spectrum—when experimental datasets are expensive and, therefore, small—LLNL inertial confinement fusion researcher Kelli Humbird showed how transfer learning, in which neural network (NN) training on a specific task is applied to a related task, can enable more predictive ML models.

Five researchers stand in front of a projection screen to answer questions
Five people stand in front of a projection screen to answer questions Left to right: LLNL researchers Frank Di Natale, Peter Robinson, Luc Peterson, Sam Jacobs, and Kelli Humbird answer audience questions after their session on workflows and scalable learning solutions for predictive science applications.

One session focused on emerging analytical approaches in predictive biology, specifically targeting small molecule perturbations. Such predictive capabilities include drug development (using high-throughput simulations and ML to detect molecular behavior) and liver toxicity (using manifold learning and dimensionality reduction of a dataset) applications. As another example, Denise Wolf, a computational biologist at UC San Francisco, highlighted her team’s work in testing biomarkers for response prediction. They aim to target the hallmarks of high-risk, early-stage breast cancer and match a patient’s vulnerability with appropriate treatment.

Alongside predictive biology, AI in acute healthcare settings was another hot topic. Researchers from several UC campuses presented advanced analytics for electronic health record systems via NNs and Bayesian models. “Healthcare data contains a wealth of information,” noted Ira Hofer, assistant professor and anesthesiologist at UC Los Angeles. “All patients are not the same, so we need to account for those differences in our models.”

Two sessions featured ML insights in physical sciences: Earth, atmospheric, chemical, and material. Qingkai Kong, assistant data scientist at UC Berkeley, described NN technology that aggregates mobile seismic sensor data to record earthquake activity. LLNL staff scientist Gemma Anderson and Lawrence Berkeley project scientist Karthik Kashinath outlined deep learning approaches that improve weather and climate predictions. “Extreme weather is changing in ways that are hard to predict. The challenges include multivariate data, multiscale data, and inconsistent labeling,” said Kashinath.

Another group of presenters applied ML to nondestructive characterization methods such as x‐ray computed tomography and optical visualization. Among the latest developments were high-throughput workflows for computing molecular bond energies as well as active learning of molecular properties to determine a training set’s characteristics. A subsequent session showcased optimized ML algorithms for better control and performance in design and manufacturing processes—a major area of collaboration between the Department of Energy and industry. For example, LLNL computational engineer Victor Castillo demonstrated an NN that quickly replicates simulation results for a molten glass tank where raw materials are melted.

Researchers from LLNL, Los Alamos, and UC Merced joined forces in a session focused on energy. LLNL engineer Jonathan Donadee described efforts to improve electric vehicle performance by using reinforcement learning to inform vehicle driving and charging tasks. His team proposed a decision-support tool that addresses vehicle service demand, traffic conditions, electric grid emissions, charger utilization, and energy prices. Other presenters explored the potential of physics-informed NNs to enhance data analytics and neural vector field estimation in power grids.

Summarizing all the sessions, Brian Giera, research engineer at LLNL, said, “Scientific applications have high stakes and important problems for data scientists. The growth is great, but there’s more work than we have people to do it. Workshops like this help current and potential collaborators interact.”

Student Perspective

Two men talk during a networking break
Two DSSI interns confer during one of the workshop’s networking breaks.

Students from LLNL’s Data Science Summer Institute (DSSI) attended the workshop as part of their internships. Many interns enjoyed learning about new data science applications. McKell Woodland, a computer science PhD candidate at Rice University, stated, “The workshop exposed me to real-world examples I hadn’t thought about before.” UC Los Angeles statistics major Kayla Schroeder agreed, “In school, we learn more about theory and methods than about how to apply our degrees outside the classroom.”

Other DSSI students found Rodriguez’s keynote presentation inspiring. “I have taken ethics training before, but it’s an area we don’t talk about enough,” noted Majerle Reeves, who is pursuing a PhD in applied mathematics at UC Merced. Joanna Held, a University of Iowa mathematics and music major, added, “I’ve never had a technical class that presented ethics that way.”

Acknowledgments

Along with Goldman, the DSI’s Data Science Council includes LLNL staff Peer-Timo Bremer, Barry Chen, Dan Faissol, Ana Kupresanin, and Michael Schneider. Event organizers were Emily Brannan, Kate Burnett, Mary Gines, Cindy Gonzales, and Mia Niklewicz from LLNL. The workshop was moderated by LLNL’s Katie Schmidt, and individual sessions were led by Jose Cadena Pico, Clara Druzgalski, Don Lucas, Harry Martz, Brian Spears, and Brian Van Essen of LLNL; Amrita Basu and Xiao Hu of UC San Francisco; and Yu-Hang Tang of Lawrence Berkeley. June Yu from UC and former LLNL scientist Kassie Fronczyk provided valuable support.

Workshop presentations are posted here.

—Article by Holly Auten/LLNL
—Photos by Ian Fabre/LLNL
—Posted September 12, 2019