Our seminar series features talks from innovators from academia, industry, and national labs. These talks provide a forum for thought leaders to share their work, discuss trends, and stimulate collaboration. These monthly seminars are held onsite and virtually. Recordings are posted to a YouTube playlist.

### Join Us at WiDS Livermore on March 13

Instead of hosting a standalone March DSI seminar, we invite you to attend our regional Women in Data Science (WiDS) conference and hear from the lineup of keynote speakers, technical talks, and career-focused panel discussions. See the WiDS page for registration links, speaker information, and other details.

### GeoAI: Past, Present, and Future

This talk will focus on GeoAI which is the application of artificial intelligence (AI) to geographic data. First, I will briefly describe some of my work in this area over the last 25 years which has been driven largely by two themes. One theme is that spatial data is special in that space (and time) provides a rich context in which to analyze it. The challenge is how to incorporate spatial context into AI methods when adapting or developing them for geographic data—that is, to make them spatially explicit. A second theme is that location is a powerful key (in the database sense) that allows us to associate large amounts of different kinds of data. This can be especially useful, for example, for generating large collections of weakly labelled data when training machine learning models. In the second part of my talk, I’ll discuss near-term opportunities in GeoAI related to foundation models particularly for multi-modal data. Finally, I’ll point out some anticipated challenges in GeoAI as generative models like OpenAI’s generative pre-trained transformer (GPT) become pervasive.

Dr. Shawn Newsam is a Professor of Computer Science and Engineering and Founding Faculty at the University of California, Merced. He has degrees from UC Berkeley, UC Davis, and UC Santa Barbara, and did a postdoc in the Sapphire Scientific Data Mining group in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory from 2003 to 2005. (So, UC Merced is his 5th UC institution!) Dr. Newsam is the recipient of a U.S. Department of Energy Early Career Scientist and Engineer Award, a U.S. National Science Foundation Faculty Early Career Development (CAREER) Award, and a U.S. Office of Science and Technology Policy Presidential Early Career Award for Scientists and Engineers (PECASE). He has held leadership positions in SIGSPATIAL, the ACM special interest group on the acquisition, management, and processing of spatially-related information, including serving as the general and program chair of its flagship conference and as the chair of the SIG. His research interests include computer vision and machine learning particularly applied to geographic data.

### Using AI to Expand What Is Possible in Cardiovascular Medicine

Machine learning and artificial intelligence (ML/AI) methods have shown great promise across various industries, including in medicine. Medicine has unique characteristics, however, that can make medical data more complex and in some respects harder to analyze compared to data outside of medicine. These issues include the complicated clinical workflow and the many human stakeholders and decision makers that all contribute at various time-points to any given patient’s medical data record. In this talk, Dr. Tison will discuss the application of ML/AI approaches in medicine, focusing on his prior work spanning several cardiovascular diagnostic modalities including electrocardiograms, echocardiograms, photoplethysmography, and angiography.

Dr. Geoffrey H. Tison, MD, MPH, is an Associate Professor of Medicine and Cardiology, and faculty in the Bakar Computational Health Sciences Institute at the University of California, San Francisco (UCSF). He is a practicing cardiologist who also leads a computational research lab at UCSF (tison.ucsf.edu) focused on improving cardiovascular disease prediction and prevention by applying artificial intelligence and epidemiologic and statistical methods to large-scale medical data. He received the DP2 New Innovator Award from the National Institutes of Health Office of the Director, and his work has been supported by the National Institutes of Health and the Patient-Centered Outcomes Research Institute, among others.

### Operational Cybersecurity of Distributed Energy Resources using Optimization, Control Theory, and Machine Learning

The adoption of Distributed Energy Resources (DER), such as rooftop solar systems, behind-the-meter batteries, and electric vehicles presents many challenges for system operators who are tasked with maintaining the safety and efficiency of the power grid. IoT connectivity of these devices, coupled with emerging control paradigms being put forth in DER standards, makes it possible for these devices to be remotely accessed and utilized to disrupt the operation of the power system. In this talk, I will highlight past, recent, and new research being led by Lawrence Berkeley National Laboratory and funded by the U.S. Department of Energy looking at this issue. Specifically, I will present work showing how techniques from control theory, optimization, and machine learning can be used to detect and mitigate certain kinds of cyber attacks on DER control systems. I will close with an overview of cybersecurity-related challenges in power systems and some thoughts on how data science techniques can be used to address those challenges.

Dr. Daniel Arnold is a Research Scientist at Lawrence Berkeley National Laboratory and an Adjunct Professor of Civil and Environmental Engineering at UC Berkeley. He graduated from UC Berkeley with a PhD in Mechanical Engineering in 2015 and was an ITRI-Rosenfeld Postdoctoral Fellow at Lawrence Berkeley National Laboratory from 2016 to 2017. His interests are in the fields of control theory, optimization, and machine learning. His recent work focuses on the use of these techniques for cybersecurity of the electric power system and other critical infrastructure.

### Higher-Order Multi-Variate Statistics for Scientific Data Analysis

Scientific phenomena are often associated with multi-variate non-Gaussian statistical processes. However, analysis techniques applied to scientific data have rarely ventured beyond correlations and covariance, despite important information being present in higher-order statistics. While they have been applied for financial modelling, higher-order multi-variate statistics (e.g., coskewness, cokurtosis) have not been widely used for analysis of scientific data. In this talk I will present a motivation for use of higher-order multi-variate statistics (e.g., joint moments, cumulants) for scientific data analyses, both observational and computational. I will present two topics where we had some recent success with the use of cokurtosis: rare (anomalous) event detection, and dimensionality reduction for stiff dynamical systems. This will include a brief discussion on connections to Independent Component Analysis, which seeks a transformation to a latent space that is statistically independent (as opposed to merely uncorrelated), as well as connections to symmetric tensor decomposition. Applying the cokurtosis-based techniques in situ with large simulations requires low overhead, scalable, and parallelizable algorithms for computing and factorizing the cokurtosis tensor. I will present algorithm development motivated by this imperative, and the software development that achieved a target performance milestone under the Exascale Computing Project (ECP).

Hemanth Kolla is a Principal Member of Technical Staff in the Scalable Modelling & Analysis department at Sandia National Laboratories. His interests lie at the intersection of high-performance scientific computing and statistical learning. He is currently working on projects involving tensor decompositions for various analyses, efficient forward propagation of parametric uncertainty in computational mechanics, and algorithm-based fault tolerance for HPC. In the past he has worked on modelling and direct numerical simulation of turbulent combustion, in situ scientific data compression, and asynchronous many-task programming models for HPC. He obtained a Bachelors in Aerospace (2003) from Indian Institute of Technology (IIT) Madras, a Masters in Aerospace (2005) from Indian Institute of Science (IISc) Bangalore, and a PhD in Engineering from The University of Cambridge (2010).

### Quantifying the Benefits of Immersion in Virtual Reality

Virtual reality (VR) technology has become mainstream, affordable, and powerful in recent years, but there is still skepticism about the usefulness of VR for serious applications. Although VR provides a compelling and unique experience, is there anything beyond this “wow factor,” or is it simply a flashy demo? How can VR be used effectively for real-world applications beyond gaming and entertainment? In this talk, I will review decades of research on the benefits of immersion in VR. Starting with an objective definition of immersion, we will discuss hypothesized benefits, and then numerous examples of empirical studies that provide quantitative evidence for these hypotheses. Finally, case studies of successful real-world VR applications will demonstrate how these results can be applied in areas such as scientific visual data analysis.

Doug A. Bowman is the Frank J. Maher Professor of Computer Science and Director of the Center for Human-Computer Interaction at Virginia Tech. He is the principal investigator of the 3D Interaction Group, focusing on the topics of 3D user interfaces, VR/AR user experience, and the benefits of immersion in virtual environments. Dr. Bowman is one of the co-authors of 3D User Interfaces: Theory and Practice. He has served in many roles for the IEEE Virtual Reality Conference, including program chair, general chair, and steering committee chair. He also co-founded the IEEE Symposium on 3D User Interfaces (now part of IEEE VR). He received a CAREER award from the National Science Foundation for his work on 3D Interaction and has been named an ACM Distinguished Scientist. He received the Technical Achievement award from the IEEE Visualization and Graphics Technical Committee in 2014, and the Career Impact Award from IEEE ISMAR in 2021. His undergraduate degree in mathematics and computer science is from Emory University, and he received his M.S. and Ph.D. in computer science from Georgia Tech.

### Tensor Factorization for Biomedical Representation Learning

Biomedical datasets are often noisy, irregularly sampled, sparse, and high-dimensional. One key question is how to produce appropriate representations that are amenable to a variety of downstream tasks. Tensors, generalizations of matrices to multiway data, are natural structures for capturing higher-order interactions. Factorization of these tensors can provide a powerful, data-driven framework for learning representations useful across a variety of downstream prediction tasks. In this talk, I will introduce how tensors can succinctly capture patient representations from electronic health records to deal with missing and time-varying measurements while providing better predictive power than deep learning models. I will also discuss how tensor factorization can be used for learning node embeddings for both dynamic and heterogeneous graphs, and illustrate their use for automating systematic reviews.

Joyce Ho is an Associate Professor in the Computer Science Department at Emory University. She received her PhD in Electrical and Computer Engineering from the University of Texas at Austin, and an MA and BS in Electrical Engineering and Computer Science from Massachusetts Institute of Technology. Her research focuses on the development of novel machine learning algorithms to address problems in healthcare such as identifying patient subgroups or phenotypes, integration of new streams of data, fusing different modalities of data (e.g., structured medical codes and unstructured text), and dealing with conflicting expert annotations. Her work has been supported by the National Science Foundation (including a CAREER award), National Institutes of Health, Robert Wood Johnson Foundation, and Johnson and Johnson.

### Photorealistic Reconstruction from First Principles

In computational imaging, inverse problems describe the general process of turning measurements into images using algorithms: images from sound waves in sonar, spin orientations in magnetic resonance imaging, or X-ray absorption in computed tomography. Today, the two dominant algorithmic approaches for solving inverse problems are compressed sensing and deep learning. Compressed sensing leverages convex optimization and comes with strong theoretical guarantees of correct reconstruction, but requires linear measurements and substantial processor memory, both of which limit its applicability to many imaging modalities. In contrast, deep learning methods leverage nonconvex optimization and neural networks, allowing them to use nonlinear measurements, data-driven priors, and limited memory. However, they can be unreliable, and it is difficult to inspect, analyze, and predict when they will produce correct reconstructions. In this talk, I focus on an inverse problem central to computer vision and graphics: given calibrated photographs of a scene, recover the optical density and view-dependent color of every point in the scene. For this problem, we take steps to bridge the best aspects of compressed sensing and deep learning: (i) combining an explicit, non-neural scene representation with optimization through a nonlinear forward model, (ii) reducing memory requirements through a compressed representation that retains aspects of interpretability, and extends to dynamic scenes, and (iii) presenting a preliminary convergence analysis that suggests faithful reconstruction under our modeling.

Sara Fridovich-Keil is a postdoctoral scholar at Stanford University, where she works on foundations and applications of machine learning and signal processing in computational imaging. She is currently supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship. Sara received her PhD in electrical engineering and computer sciences in 2023 from UC Berkeley and BSE in electrical engineering from Princeton University in 2018. During her time at UC Berkeley, Sara worked as a student researcher at Google Brain and collaborated with researchers at LLNL, the University of Southern California, and UC San Diego.

### Leveraging Latent Representations for Predictive Physics-Based Modeling and Uncertainty Quantification

Nonlinear PDEs provide a quantitative description for a vast array of phenomena in physics involving reaction, diffusion, convection, shocks, equilibrium, and more. Commonly, physical and engineering systems are associated with stochastic and epistemic uncertainties which can be characterized, quantified, and propagated through models by utilizing tools from UQ. UQ becomes prohibitively expensive when considering complex PDEs, and to tackle this limitation, surrogate models have been developed for approximating expensive numerical solvers while maintaining solution accuracy. Yet, the performance of surrogates in terms of predictive accuracy, robustness and generalizability deteriorates in cases of high-fidelity simulations, highly non-linear PDE mappings, and high-dimensional uncertainties sources. This presentation showcases a set of approaches, based on dimension reduction principles, that leverage latent representations of high-dimensional data, to improve the performance of surrogate models and enable UQ for complex PDE applications. The first part of the talk focuses on inverse problems and the development of a manifold-based approach for the probabilistic parameterization of nonlinear PDEs based on atomistic simulation data. The proposed approach is applied for modeling plastic deformation in a bulk metallic glass (amorphous solid) system based on available observations from molecular dynamics simulations. The second part of the talk focuses on the Latent Deep Operator Network (L-DeepONet) for training neural operators on latent spaces which significantly improves predictive accuracy for time-dependent PDEs of varying degrees of complexity. The final component of this talk focuses on transfer learning (TL) for conditional shift in PDE regression using DeepONet. We propose a TL framework based on Hilbert space embeddings of conditional distributions and construct task-specific models by leveraging domain-invariant features and finetuning pre-trained neural operators. Our approach provides a powerful tool in complex physics and engineering applications as it enables generalizability and mitigates the need for big-data and large-scale computational resources.

Katiana Kontolati is a data scientist at Bayer R&D with a focus on machine learning and genome modeling for designing high-performing crops. She received her PhD from the Department of Civil and Systems Engineering at Johns Hopkins University in 2023. Her doctoral research revolved around physics-informed machine learning with a focus on high-dimensional surrogate modeling and uncertainty quantification in physics-based and engineering problems involving nonlinear partial differential equations under uncertainty. In parallel to her research activities, Kontolati is contributing to the development of the open-source python software UQpy for modeling uncertainty in physical and mathematical systems. Her work has been published in top journals including Acta Materialia and Nature Machine Intelligence and she has received multiple awards and recognition including the Joseph Meyerhof Fellowship from Johns Hopkins, the Applied Machine Learning Research Fellowship from Los Alamos National Lab, the Gerondelis Foundation Graduate Scholarship and was recently selected as a Rising Star in Computational and Data Sciences. A native of Athens, Greece, Kontolati received a BSc in Structural Engineering from the University of Thessaly and a MSc in Applied Mechanics from the National Technical University of Athens.

### Adversarial Machine Learning: Categories, Concepts, and Current Landscape

Machine learning depends critically on data—on the data that trains a machine learning model, on the data that exercises it. Tight dependence on the data means that machine learning can be subverted by an adversary who does nothing more than manipulate some of that data. That is, most adversarial computer attacks are attacks on an *implementation*, and depend on corruption of the hardware, software, or network that runs some program. Machine learning, on the other hand, has *algorithmic *vulnerabilities, and can be subverted even when its hardware, software, and network environment is pristine. In some cases, these vulnerabilities can be triggered by simply querying the model in a fashion nearly indistinguishable from normal, non-adversarial use. This talk will provide an overview of the three main categories of these vulnerabilities, speaking to how an adversary might: *subvert* the original training data to manipulate the resulting model, change the test data in order to *evade *the correct outcome from the model, or cause the model to *reveal *details of its training data or its structure that it did not intend to reveal. The intent is to define and illustrate these attacks in just enough detail to usefully alarm anyone who might be building or using machine learning models. A secondary goal is to motivate thinking carefully about who your adversary might be. That is, what distinguishes counter adversarial machine learning from other aspects of machine learning (e.g., reliability, accuracy, or quantification of its uncertainties) is indeed the presence of an *adversary*. If you wish to do or use adversarial machine learning research, it is important to build a model of the adversary you are considering: their goals, capabilities, success measures, costs, observables, and so on. Much academic work in “adversarial machine learning” has greatly limited its utility due to the lack of a reasonable adversarial model. Still, there is still a great deal of academic work being published in adversarial machine learning, much of it entertainingly or worrisomely creative. So, the tail end of this talk will be a brief survey of recent work, focusing on edge cases that don’t smoothly fit into the subvert/evade/reveal categorization.

Philip Kegelmeyer (E.E. PhD, Stanford) is a Senior Scientist at SNL Livermore. His current interests are machine learning and graph algorithms, especially as applied to ugly, obdurate, real-world data which is actively resistant to analysis. Since 2013 Dr. Kegelmeyer has been leading research efforts in “Counter Adversarial Data Analytics,” starting with adversarial machine learning. The core idea is to take a vulnerability assessment approach to quantitatively assessing, and perhaps countering, the result of an adversary knowing and adapting to exactly the specific data analysis method in use. Dr. Kegelmeyer has 30 years’ experience inventing, tinkering with, quantitatively improving, and now, subverting supervised machine learning algorithms (particularly ensemble methods), including investigations into how to accurately and statistically significantly compare such algorithms. His work has resulted in over 80 refereed publications, 2 patents, and commercial software licenses.

### Using Data Science to Advance the Impact of Vascular Digital Twins in Medicine

The recognition of the role hemodynamic forces have in the localization and development of disease has motivated large-scale efforts to enable patient-specific simulations. When combined with computational approaches that can extend the models to include physiologically accurate hematocrit levels in large regions of the circulatory system, these image-based models yield insight into the underlying mechanisms driving disease progression and inform surgical planning or the design of next-generation drug delivery systems. Building a detailed, realistic model of human blood flow, however, is a formidable mathematical and computational challenge. The models must incorporate the motion of fluid, intricate geometry of the blood vessels, continual pulse-driven changes in flow and pressure, and the behavior of suspended bodies such as red blood cells. Combining physics-based modeling with data science approaches is critical to addressing open questions in personalized medicine. In this talk, I will discuss how we’re building and using high-resolution digital twins of patients’ vascular anatomy to inform the treatment of a range of human diseases. I will present the data challenges we run into and identify key areas where data science can play a role in advancing the work.

Dr. Amanda Randles is the Alfred Winborne Mordecai and Victoria Stover Mordecai Assistant Professor of Biomedical Sciences and Biomedical Engineering at Duke University. Focusing on the intersection of HPC, ML, and personalized modeling, her group is developing new methods to aid in the diagnosis and treatment of a diseases ranges from cardiovascular disease to cancer. Amongst other recognitions, she has received the NIH Pioneer Award, the NSF CAREER Award, and the ACM Grace Hopper Award. She was named to the World Economic Forum Young Scientist List and the MIT Technology Review World’s Top 35 Innovators under the Age of 35 list and is a Fellow of the National Academy of Inventors. Randles received her PhD in Applied Physics from Harvard University as a DOE Computational Graduate Fellow and NSF Fellow.

### Calling the Shot: How AI Predicted Fusion Ignition Before It Happened

At 1:03am on December 5, 2022, 192 laser beams at the National Ignition Facility focused 2.05 megajoules of energy onto a peppercorn-sized capsule of frozen hydrogen fuel. In less time than it takes light to travel 10 feet, the laser crushed the capsule to smaller than the width of a human hair, vaulting the fuel to temperatures and densities exceeding those found in the sun. Under these extreme conditions, the fuel ignited and produced 3.15 megajoules of energy, making it the first experiment to ever achieve net energy gain from nuclear fusion. Nuclear fusion is the universe’s ultimate power source. It drives our sun and all the stars in the night sky. Harnessing it would mean a future of limitless carbon-free, safe, clean energy. After several decades of research, fusion breakeven at NIF brings humanity one step closer to that dream. Yet, the shot that finally ushered in the Fusion Age was not actually that surprising. A few hours before the experiment, our physics team used an artificial intelligence model to predict the outcome of the experiment. Our model, which blends supercomputer simulations with experimental data, indicated that ignition was the most likely outcome for this shot. As such, hopes were high that something big was about to occur. In this talk, we discuss the breakthrough experiment, nuclear fusion, and how we used machine learning to call the shot heard around the world.

Dr. Kelli Humbird’s work focuses on machine learning (ML) discovery and design for inertial confinement fusion and integrated hohlraum design. During her time at the Lab, she has worked in stockpile certification, technical nuclear forensics, ML accelerators for multiphysics codes, and ML analysis for the spread of COVID-19 during the first year of the pandemic. The common thread throughout much of her work is the application of ML to scientific problems with sparse data.

Dr. Jayson “Luc” Peterson is the Associate Program Leader for Data Science within LLNL’s Space Science and Security Program, where he is responsible for the leadership and development of a broad portfolio of projects at the intersection of data science and outer space. He also leads the ICECap and Driving Design with Cognitive Simulation projects, which aim to bring ML-enhanced digital design to exascale supercomputers.

### Physics-Based Machine Learning with Differentiable Solvers

Modeling the differential equations governing physical systems is a central aspect of many science and engineering tasks. These governing equations are derived from first-principles, and correspond to the laws of physics. Recently, neural network methods have shown some promise in solving differential equations through adding the underlying governing equations as a soft constraint to the loss function. However, such approaches only approximately enforce the constraints on a physical system. I will first discuss the challenges associated with such an approach. I will then discuss how we overcome these challenges by developing a neural network architecture that incorporates differential equation constrained optimization, which outputs solutions that verify the desired physical constraints exactly over a given spatial and/or temporal domain. I will show that this architecture allows us to accurately and efficiently fit solutions to new problems, and demonstrate this on fluid flow and transport phenomena problems.

Dr. Aditi Krishnapriyan is Assistant Professor at UC Berkeley where she is a member of Berkeley AI Research (BAIR), the AI+Science group in Electrical Engineering and Computer Sciences (EECS), and the theory group in Chemical Engineering. Her research interests include physics-inspired machine learning methods; geometric deep learning; inverse problems; and development of machine learning methods informed by physical sciences applications including molecular dynamics, fluid mechanics, and climate science. A former DOE Computational Science Graduate Fellow, Dr. Krishnapriyan holds a PhD from Stanford University and in 2020–2022 was the Luis W. Alvarez Fellow in Computing Sciences at Lawrence Berkeley National Laboratory.

### Networks that Adapt to Intrinsic Dimensionality Beyond the Domain

A central question in deep learning is the minimum size of the network needed to approximate a certain class of functions, and how the dimensionality of the data affects the number of points needed to learn such a network. In this talk, we’ll focus on how ReLU (rectified linear unit) networks can approximate a particular but general class of functions, f(x)=g(φ(x)) where φ is a dimensionality reducing feature map. We’ll focus on two intuitive and practically relevant choices for φ: the projection onto a low-dimensional embedded submanifold and a distance to a collection of low-dimensional sets. Since φ encapsulates a set of features that are invariant to the function f, we’ll show that deep nets are faithful to an intrinsic dimension governed by f rather than the complexity of the domain/ data on which f is defined. In particular, the prevalent model of approximating functions on low-dimensional manifolds can be relaxed to include significant off-manifold noise by using functions of this type, with φ representing an orthogonal projection onto the same manifold. We’ll also discuss connections of this work to two-sample testing, manifold autoencoders, and data generation.

Alex Cloninger is an Associate Professor in the Department of Mathematical Sciences and the Halıcıoğlu Data Science Institute at the University of California, San Diego. He received his PhD in Applied Mathematics and Scientific Computation from the University of Maryland in 2014, and was then a National Science Foundation Postdoc and Gibbs Assistant Professor of Mathematics at Yale University until 2017, when he joined UCSD. Dr. Cloninger researches problems in the area of geometric data analysis and applied harmonic analysis. He focuses on approaches that model the data as being locally lower dimensional, including data concentrated near manifolds or subspaces. These types of problems arise in a number of scientific disciplines, including imaging, medicine, and artificial intelligence, and the techniques developed relate to many machine learning and statistical algorithms, including deep learning, network analysis, and measuring distances between probability distributions.

### Leakage and the Reproducibility Crisis in ML-Based Science

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this talk, I will present results from our investigation of reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as logistic regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don’t perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

Sayash Kapoor is a PhD candidate at Princeton University's Center for Information Technology Policy. His research critically investigates ML methods and their use in science and has been featured in WIRED and Nature among other media outlets. He has received a best paper award by ACM FAccT. His work has been published in conferences and journals such as ACM FAccT, CSCW,AIES,IJCAI, Machine Learning, and AI Communications. At Princeton University, he organized a workshop titled *The Reproducibility Crisis in ML-Based Science*, which saw more than 1,700 registrations. He has worked on ML in several institutions in the industry and academia, including Facebook, Columbia University, and EPFL Switzerland.

### Simultaneous Feature Selection and Outlier Detection Using Mixed-Integer Programming with Optimality Guarantees

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, i.e., a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between early infant weight gain and the human microbiome.

Ana is a postdoctoral researcher in the Department of Statistics at UC Berkeley working primarily at the interface of computational statistics/machine learning and optimization applied to biomedical sciences. In Fall 2021, she earned her PhD in Statistics and Operations Research at Penn State where she was a Sloan Scholar and Biomedical Big Data to Knowledge Fellow.

### Just Machine Learning

In this talk, I will address some concerns about the use of machine learning in situations where the stakes are high (such as criminal justice, law enforcement, employment decisions, credit scoring, health care, public eligibility assessment, and school assignments). First, I will discuss the popular task of risk assessment and impossibility results for group fairness, where one cannot simultaneously satisfy desirable probabilistic measures of fairness. Second, I will present how machine learning can be used to generate aspirational data (i.e., data that are free of biases present in real data). Such data are useful for recognizing sources of unfairness in machine learning models besides biased data. Third, I will describe how information access equality in complex networks is an interplay between the network structure and the spreading process, leading to a tradeoff between equality and efficiency in certain circumstances. If time permits, I will discuss the steps needed to measure our algorithmically infused societies and present our findings from a 2022 qualitative study examining responsibility and deliberation in AI impact statements and ethics reviews.

Tina Eliassi-Rad is a Professor of Computer Science at Northeastern University. She is also a core faculty member at Northeastern's Network Science Institute and the Institute for Experiential AI. In addition, she is an external faculty member at the Santa Fe Institute and the Vermont Complex Systems Center. Prior to joining Northeastern, Tina was an Associate Professor of Computer Science at Rutgers University; and before that she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin-Madison. Her research is at the intersection of data mining, machine learning, and network science. She has over 100 peer-reviewed publications (including a few best paper and best paper runner-up awards); and has given over 250 invited talks and 14 tutorials. Tina's work has been applied to personalized search on the World-Wide Web, statistical indices of large-scale scientific simulation data, fraud detection, mobile ad targeting, cyber situational awareness, drug discovery, democracy and online discourse, and ethics in machine learning. Her algorithms have been incorporated into systems used by governments and industry (e.g., IBM System G Graph Analytics), as well as open-source software (e.g., Stanford Network Analysis Project). In 2017, Tina served as the program co-chair for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (a.k.a. KDD, which is the premier conference on data mining) and as the program co-chair for the International Conference on Network Science (a.k.a. NetSci, which is the premier conference on network science). In 2020, she served as the program co-chair for the International Conference on Computational Social Science (a.k.a. IC2S2, which is the premier conference on computational social science). Tina received an Outstanding Mentor Award from the U.S. Department of Energy's Office of Science in 2010, became an ISI Foundation Fellow in 2019, was named one of the 100 Brilliant Women in AI Ethics in 2021, and received Northeastern University's Excellence in Research and Creative Activity Award in 2022.

### Interpretability In Deep Learning Models For Atomic-Scale Simulations

In the field of computational materials science, model interpretability has historically been enforced through the use of strictly defined functional forms based on assumptions regarding the underlying physics governing the distribution of data. However, with the adoption of deep learning models these assumptions have been largely discarded in exchange for greater flexibility and higher training accuracy. In this talk I will discuss our research exploring the necessary complexity of models for describing atomic interactions (“interatomic potentials”), as well as techniques for leveraging generative models for designing more interpretable and transferable potentials. Finally, I will end with a description of our efforts to build a massive, open-source database of interatomic potential training data, and discuss how it can be used in combination with our techniques to begin making fundamental physical insights using a data-driven approach.

Joshua Vita is a Ph.D. student in materials science and engineering at the University of Illinois Urbana Champaign advised by Professor Dallas Trinkle. Josh’s research focuses on the development of efficient, interpretable models for describing atomic interactions, and the design of algorithms and databases for fitting those models. Josh graduated with bachelor's degrees in materials science and mathematics from the University of Arizona in 2017, interned at Sandia National Laboratories developing image analysis software (2016), worked with the OpenKIM/ColabFit team designing a framework for archiving materials data (2021), and developed multiple software packages for fitting interatomic potentials during his graduate research. He is currently a member of the “DIGI-MAT” program, an NSF-funded graduate fellowship for integrating machine learning and materials science.

### Diagrammatic Differential Equations in Physics Modeling and Simulation

I’ll discuss some results from a recent paper on applying categories of diagrams for specifying multiphysics models for PDE-based simulations. We developed a graphical formalism inspired by the graphical approach to physics pioneered by the late Enzo Tonti. We will discuss the graphical formalism based on category theoretic diagrams and some applications to heat transfer, electromagnetism, and fluid dynamics. This formalism supports automatic construction of physics simulations based on the Discrete Exterior Calculus (DEC) and I will show some results with the DECAPODES.jl software system.

James Fairbanks is an Assistant Professor in Computer and Information Science and Engineering at the University of Florida. He studies mathematical modeling and scientific computing through the lens of abstract algebra and combinatorics, and leads the AlgebraicJulia.org project. He is has won both the DARPA Young Faculty and Director’s awards supporting his work on applied category theory and scientific computing. In 2014 he was a CASC summer student under the supervision of Van Henson and Geoff Sanders studying approximate numerical linear algebra with applications to complex networks and social network analysis. Prior to joining UF, Dr. Fairbanks was a Senior Research Engineer at the Georgia Tech Research Institute, where he ran a portfolio of DARPA and ONR sponsored research programs.

### Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion

Managing and analyzing simulation data has always been challenging. Large-scale simulations have the capacity to produce massive amounts of data that HPC architectures are not designed to analyze effectively. Furthermore, simulation data must be analyzed in place since movement between systems is prohibitively expensive. The introduction of MapReduce and Big Data analytics spurred interest in adapting these tools to HPC environments, especially with the rise in popularity of using machine learning (ML) to analyze simulation data. In this talk, I will present some of the challenges we faced with using Big Data frameworks, such as Apache Spark, on HPC systems at LLNL. To address these challenges, we developed a set of best practices for using Apache Spark to perform ML and analytics on massive HPC simulation data. Our investigation focused on the real-world application of scaling ML algorithms to predict and analyze failures in multi-physics simulations, and it culminated with the demonstration of training a Random Forest regression model on 76TB of data with over one trillion training examples.

Ming Jiang is a computer scientist in LLNL's Center for Applied Scientific Computing (CASC). His current research focuses on applying machine learning to automate simulation workflows and exploiting Big Data analytics for HPC simulation data. Jiang joined LLNL in 2005 as a CASC postdoctoral researcher after receiving his PhD in Computer Science and Engineering from The Ohio State University. He has been a principal investigator and project/task lead on several large-scale collaborative research projects, including ML for arbitrary Lagrangian-Eulerian simulations, data-centric architectures, and real-time space situational awareness. His research interests include scientific ML, data-intensive computing, multiresolution analysis, and flow visualization.

### A Universal Law of Robustness via Isoperimetry

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparameterization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

Mark Sellke is a PhD student in mathematics at Stanford advised by Andrea Montanari and Sébastien Bubeck. He graduated from MIT in 2017 and received a Master of Advanced Study with distinction from the University of Cambridge in 2018, both in mathematics. Mark received the best paper and best student paper awards at SODA 2020, and the outstanding paper award at NeurIPS 2021. He has broad research interests in probability, statistics, optimization, and machine learning. Mark's research is supported by a National Science Foundation graduate research fellowship and the William R. and Sara Hart Kimball endowed Stanford Graduate Fellowship.

### Neural Representations for Volume Visualization

In this talk, I will describe two projects, both joint work with collaborators at Vanderbilt University. The first project studies how generative neural models can be used to model the process of volume rendering scalar fields. We construct a generative adversarial network that learns the mapping from volume rendering parameters, such as viewpoint and transfer function, to the rendered image. In doing so, we can analyze the volume itself and provide new mechanisms for guiding the user in transfer function editing and exploring the space of possible images that can be volume rendered. Both our training process and applications are available on the web at https://github.com/matthewberger/tfgan.

In the second part of my talk, I will explore a recent neural modeling approach for building compressive representations of volume data. This approach represents volumetric scalar fields as learned implicit functions wherein a neural network maps a point in the domain to an output scalar value. By setting the number of weights of the neural network to be smaller than the input size, we achieve compressive function approximation. Combined with carefully quantizing network weights, we show that this approach yields highly compact representations that outperform state-of-the-art volume compression approaches. We study the impact of network design choices on compression performance, highlighting how conceptually simple network architectures are beneficial for a broad range of volumes. Our compression approach is hosted at https://github.com/matthewberger/neurcomp

Joshua A. Levine is an associate professor in the Department of Computer Science at University of Arizona. Prior to starting at Arizona in 2016, he was an assistant professor at Clemson University from 2012 to 2016, and before that a postdoctoral research associate at the University of Utah’s SCI Institute from 2009 to 2012. He is a recipient of the 2018 DOE Early Career award. He received his PhD in Computer Science from The Ohio State University in 2009 after completing BS degrees in Computer Engineering and Mathematics in 2003 and an MS in Computer Science in 2004 from Case Western Reserve University. His research interests include visualization, geometric modeling, topological analysis, mesh generation, vector fields, performance analysis, and computer graphics. Joshua A. Levine is an associate professor in the Department of Computer Science at University of Arizona. Prior to starting at Arizona in 2016, he was an assistant professor at Clemson University from 2012 to 2016, and before that a postdoctoral research associate at the University of Utah’s SCI Institute from 2009 to 2012. He is a recipient of the 2018 DOE Early Career award. He received his PhD in Computer Science from The Ohio State University in 2009 after completing BS degrees in Computer Engineering and Mathematics in 2003 and an MS in Computer Science in 2004 from Case Western Reserve University. His research interests include visualization, geometric modeling, topological analysis, mesh generation, vector fields, performance analysis, and computer graphics.

### Data-Driven Mechanistic Models – Design Inference

Mechanistic models provide a flexible framework for modeling heterogeneous and dynamic systems in ways that enable prediction and control. In this talk, we focus on the application of mechanistic models for investigating dynamic biological systems. We show that by embedding these models in a hierarchical Bayesian framework, we can account for the underlying structure and stochasticity of the systems. Further, we discuss how to use a Bayesian utility theory in order to find the optimal experimental design for studying biological systems. While our proposed approach could be quite flexible and powerful, its computational complexity could hinder its feasibility. To alleviate this issue, we propose a class of scalable Bayesian inference methods that utilize deep learning algorithms for fast approximation or the likelihood function and its gradient.

Babak Shahbaba is Chancellor's Fellow and Professor of Statistics with a joint appointment in Computer Science at UC Irvine. His independent research focuses on Bayesian methods and their applications in data-intensive biomedical problems. His research experience spans a broad spectrum of areas including statistical methodologies (Bayesian nonparametrics and hierarchical Bayesian models), computational techniques (efficient sampling algorithms), and a wide range of applied and collaborative projects (statistical methods in neuroscience, genomics, and health sciences). Currently, Shahbaba is the PI on three grants: 1) NSF-HDR-DSC: Data Science Training and Practices: Preparing a Diverse Workforce via Academic and Industrial Partnership, 2) NSF-MODULUS: Data-Driven Mechanistic Modeling of Hierarchical Tissues, 3) NIH-NIMH-R01: Scalable Bayesian Stochastic Process Models for Neural Data Analysis. Before joining UC Irvine, he was a Postdoctoral Fellow at Stanford University under the supervision of Rob Tibshirani and Sylvia Plevritis. Shahbaba received his PhD at University of Toronto under Radford Neal’s supervision.

### Harnessing the Digital Revolution to Assessing Water Use Dynamics Under Climatic Stressors and Policy Regimes

Understanding water demand patterns and demand dynamics are vital in achieving long-term water resiliency and reliability, especially as traditional water supply solutions are increasingly under stress due to climate change. In this seminar, using change point detection methodology, I will closely examine various drivers that affect customer level water demand based on some of the emerging data sources. I will further assess the extent environmental and climatic stressors such as droughts and policy regimes, influence transitory behavioral modifications or structural changes in water demand and rebound patterns and how such dynamics are key in informing water supply reliability and infrastructure planning.

Newsha K. Ajami is the director of Urban Water Policy with Stanford University’s Water in the West program. A leading expert in sustainable water resource management, smart cities, and the water-energy-food nexus, she uses data science principles to study the human and policy dimensions of urban water and hydrologic systems. Her research throughout the years has been interdisciplinary and impact focused. Dr. Ajami served as a gubernatorial appointee to the Bay Area Regional Water Quality Control Board for two terms and is currently a mayoral appointee to the San Francisco Public Utilities Commission. She is a member of National Academies Board on Water Science and Technology. Dr. Ajami also serves on number of state-level and national advisory boards. Before joining Stanford, she worked as a senior research scholar at the Pacific Institute and served as a Science and Technology fellow at the California State Senate’s Natural Resources and Water Committee where she worked on various water and energy related legislation. She has published many highly cited peer-reviewed articles, coauthored two books, and contributed opinion pieces to the New York Times, San Jose Mercury, and Sacramento Bee. Dr. Ajami received her Ph.D. in Civil and Environmental Engineering from the UC, Irvine, an M.S. in Hydrology and Water Resources from the University of Arizona, and a B.S. in Civil Engineering from Amir Kabir University of Technology in Tehran.

### Julia, The Power of Language

The Julia language has become well known for its combination of performance and ease-of-use. We argue the real power of language is the ability to have impact. In this talk we will assume no or little familiarity with the Julia language and describe why Julia is not just another language for everyday and high-performance computing. We argue that the real power of a language is the ability to collaborate and have impact. We will discuss the application of Julia to domains like climate science, materials design, simulations that require optimization, differential equations, machine learning, and uncertainty quantification and highlight why we say, "humans compose when software composes."

Professor Edelman considers himself to be a pure mathematician and an applied computer scientist. He works in the areas of numerical linear algebra, Random Matrix theory, high performance computing systems, networks, software, and algorithms. He has won many prizes for his work including the prestigious Gordon Bell Prize, the Householder prize, the Sidney Fernbach award, and Babbage Prize. He was the founder of Interactive Supercomputing, a company acquired by Microsoft in its fifth year employing nearly 50 people and is a co-creator of Julia. He is an elected fellow of ACM, AMS, IEEE, and SIAM. He believes above all that math and computing go together and both should be fun.

### AI-Enabled Innovations in Validation of Sanitation and Detection of Pathogens

Food safety is one of the leading public health issues that continue to be a significant challenge for the food industry and consumers. These issues are critical for the minimally processed food products such as the fresh produce industry. Sanitation is a critical control step for the safety of the food supply. However, the current approaches for verification and validation are limited. Similarly, the current sanitation processes use conventional chemical sanitizers and copious amounts of water and energy. Thus, there is an unmet need to develop and validate novel technologies for the sanitation of food contact surfaces. Complementary to sanitation, food safety testing is a fundamental approach for detecting pathogens in food, water, and environmental samples. This presentation will focus on advances in verification and validation of sanitation of food contact surfaces, including the inactivation of biofilms using chemical sanitizers and non-thermal atmospheric plasma technologies and detection of target bacteria in water and food samples. The presentation focuses on the role of AI methods in enabling the validation of sanitation and the detection of pathogens. For the verification of sanitation, the research will illustrate the application of AI for the analysis of spectroscopy data sets acquired using engineered surrogates for bacteria and their biofilms. To detect bacteria, I will present applications of AI methods for both imaging and spectroscopy data sets. The results will illustrate the significant potential of AI technologies in addressing critical needs to improve food safety.

N. Nitin is a faculty member in the departments of food science and technology and biological and agricultural engineering. His research is at the interface of biomaterial science, biosensors, mathematical modeling, and data analytics. With these approaches, his research aims to enhance the quality, safety, and sustainability of food systems. In collaboration with his students, postdoctoral fellows, and faculty colleagues, he has co-authored over 145 peer-reviewed publications and is a co-inventor for ten patents and eight patent applications. Prof. Nitin also teaches courses in food processing, food safety engineering, and heat transfer in biological systems in both departments. His research has also enabled co-founding of two early-stage companies.

### Hypergraphs and Topology for Data Science

Data scientists and applied mathematicians must grapple with complex data when analyzing complex systems. Analytical methods almost always represent phenomena as a much simpler level than the complex structure or dynamics inherent in systems, through either simpler measured or sampled data, or simpler models, or both. As just one example, collaboration data from publications databases are often modeled as graphs of authors, in which pairs of authors (vertices) are connected if they published a paper together, perhaps weighted by the number of such papers. This graph view is also commonly found when analyzing many other kinds of data including biological, cyber, and social. But to better represent inherent complexity, researchers are striving to adopt hypergraphs, representing connections not only as pairwise, but as multi-way or higher order. In bibliometrics, where papers have multiple authors, and authors write multiple papers, hypergraphs can natively capture the complex ways that groups of authors form into collaborations as sets of authors on papers, where traditional collaboration networks can only do so via complex coding schemes. Our recent work has focused on first developing and implementing methods that extend common graph methods to hypergraphs—e.g., distance, diameter, centrality—and then using such methods to study real data sets from biology to cyber security. Moreover, the complexity of hypergraphs imbues them with significant topological properties, and we have been active in developing a theory and interpretation of hypergraphs homology, through abstract simplicial complexes and other topological representations. Additionally, graphs and hypergraphs both arise in data systems with more than two dimensions, for example adding keywords or institutions to papers and authors. These four dimensions—authors, papers, keywords, and institutions—now can form a combinatorial number of hypergraphs (e.g., author vs. papers, papers vs. keywords, institutions vs. authors). But what mathematical structure can be formed when we consider all these dimensions simultaneously? Tensors may be one such structure, but even they may be too restrictive since tensors represent a multi-relation among all dimensions, and data may only be available on certain projections. In this talk I will provide an overview of our work on hypergraphs and topology for data science, including both theory and practice of the methods we have been developing, and provide some thoughts on going beyond hypergraphs.

Dr. Emilie Purvine is a Senior Data Scientist at Pacific Northwest National Laboratory. Although her academic background is in pure mathematics, with a BS from University of Wisconsin - Madison and a PhD from Rutgers University, her research since joining PNNL in 2011 has focused on applications of combinatorics and computational topology together with theoretical advances needed to support the applications. Over her time at PNNL Emilie has been both PI and technical staff on a number of projects in applications ranging from computational chemistry and biology to cyber security and power grid modeling. She has authored over 40 technical publications and is currently an associate editor for the Notices of the American Mathematical Society. Emilie also coordinates PNNL’s Postgraduate Organization which plans career development seminars, an annual research symposium, and promotes networking and mentorship for PNNL’s post bachelors, post masters, and post doctorate research associates.

### A Biased Tour of the Uncertainty Visualization Zoo

Uncertain predictions permeate our daily lives (“will it rain today?”, “how long until my bus shows up?”, “who is most likely to win the next election?”). Fully understanding the uncertainty in such predictions would allow people to make better decisions, yet predictive systems usually communicate uncertainty poorly—or not at all. Based on my (and others') research and my own practice, I will discuss ways to combine knowledge of visualization perception, uncertainty cognition, and task requirements to design visualizations that more effectively communicate uncertainty. I will also discuss ongoing work in systematically characterizing the space of uncertainty visualization designs and in developing ways to communicate (difficult- or impossible-to-quantify) uncertainty in the data analysis process itself. As we push more predictive systems into people’s everyday lives, we must consider carefully how to communicate uncertainty in ways that people can actually use to make informed decisions.

Matthew Kay is an Assistant Professor jointly appointed in Computer Science and Communications Studies at Northwestern University. He works in human-computer interaction and information visualization; more specifically, his research areas include uncertainty visualization, personal health informatics, and the design of human-centered tools for data analysis. His current research is funded by multiple NSF awards, and he has received multiple best paper awards across human-computer interaction and information visualization venues (including ACM CHI and IEEE VIS). He co-directs the Midwest Uncertainty Collective and is the author of the tidybayes and ggdist R packages for visualizing Bayesian model output and uncertainty.

### MuyGPs: Scalable Gaussian Process Hyperparameter Estimation Using Local Cross-Validation

The utilization of large and complex data by machine learning in support of decision-making is of increasing importance in many scientific and national security domains. However, the need for uncertainty estimates or similar confidence indicators inhibits the integration of many popular machine learning pipelines, such as those that rely upon deep learning. In contrast Gaussian Process (GP) models are popular for their principled uncertainty quantification but require quadratic memory to store the covariance matrix and cubic computation to perform inference or evaluate the likelihood function. In this talk, we present MuyGPs, a novel computationally efficient GP hyperparameter estimation method for large data that has recently been released for open-source use in the python package MuyGPyS. MuyGPs builds upon prior methods that take advantage of nearest neighbor structure for sparsification and uses leave-one-out cross-validation to optimize covariance (kernel) hyperparameters without realizing the expensive multivariate normal likelihood. We describe our approximate methods and compare our implementations against the state-of-the-art competitors in approximate GP regression on a benchmark dataset and to several competitors, including convolutional neural networks, in a space-based image classification problem. We give examples of code to fit data such as these examples, and finally, we discuss future directions of MuyGPs.

Dr. Amanda Muyskens is a staff member in the Applied Statistic Group (ASG) within the Computational Engineering Division (CED) here at LLNL. She received bachelor’s degrees in both mathematics and music performance from the University of Cincinnati in 2013 and a MS and PhD from NC State University in statistics in 2015 and 2019 respectively. She began her postdoc at LLNL in 2019 in her current group. Her research interests include Gaussian processes, computationally efficient statistical methods, uncertainty quantification, and statistical consulting.

### Artificial Intelligence in Support of Biomedical Data Privacy

Privacy is a social construct that is realized in different ways under varying situations in healthcare and biomedical research. In this respect, context is king, such that the manner by which privacy can be injected into a system is dependent on a variety of factors that influence the environment. This is particularly the case when considering privacy in the big data age or what one might call, big data privacy. As computing becomes increasingly cheap and ever more ubiquitous, it seems as though upholding privacy is an impossible task. This notion is supported by the development and demonstration of a growing array of attacks on certain types of protections biomedical data managers aim to inject into clinical and genomic data shared for various purposes, such as the obfuscation of a patient’s identity or the suppression of sensitive facts about a research participant or academic medical center. At the same time, these methodologies make strong assumptions about the extent to which an adversary functions in the world, such as operating under no (or limited) constraints with respect to resources at their disposal and motivation for mounting an attack. In this brief presentation, I will review several attacks on biomedical data as they have evolved over the past several decades, but then posit a new approach to assessing data privacy risk in the real world that builds on computational economic perspectives of risk assessment and artificial data generation methods. To illustrate the potential for this approach, I will draw upon several examples of how we have applied it with respect to sharing demographic, clinical, and genomic data, both at the individual- and summary-level for several U.S.-based consortia and multinational clinical trials.

Bradley Malin, Ph.D., is the Accenture Professor of Biomedical Informatics, Biostatistics, and Computer Science at Vanderbilt University. He is the Co-Founder and Co-Director of two centers. The first is the Center for Genetic Privacy and Identity in Community Settings (GetPreCiSe), an NIH Center of Excellence in Ethical, Legal, and Social Implications Research. The second is the Health Data Science Center, which integrates over ten laboratories at Vanderbilt working on data science applications in healthcare. His research draws upon methodologies in computer science, biomedical science, and public policy to innovate novel computational techniques. In addition to running a vibrant scientific research program, since 2007, he has led a data privacy consultation service for the Electronic Medical Records and Genomics (eMERGE) network, an NIH consortium.

### Adaptive Contraction Rates and Model Selection Consistency of Variational Posteriors

This talk discusses adaptive inference based on variational Bayes. We propose a novel variational Bayes framework called adaptive variational Bayes, which can operate on a collection of model spaces with varying structures. The proposed framework averages variational posteriors over individual models with certain weights to obtain the variational posterior over the entire model space. It turns out that this averaged variational posterior minimizes the Kullback-Leibler divergence to the regular posterior distribution. We show that the proposed variational posterior can achieve optimal contraction rates adaptively in very general situations, as well as attain model selection consistency when the “true” model structure exists. We apply the adaptive variational Bayes to several classes of deep learning models and derive some new and adaptive inference results. Moreover, we propose a particle-based approach for the construction of a prior distribution and variational family, which automatically satisfies some of the theoretical conditions imposed in our general framework and provides an optimization algorithm applicable to general problems. Lastly, we consider the use of quasi-likelihood in our adaptive variational framework. We formulate conditions on quasi-likelihood to ensure the contraction rate remains the same. The proposed framework can be applied to a large class of problems including sparse linear regression, estimation of finite mixtures, and graphon estimation in network analysis.

Dr. Lizhen Lin is the Sara and Robert Lumpkings Associate Professor at the University of Notre Dame. Her areas of expertise are in Bayesian nonparametric, Bayesian asymptotic, statistics on manifolds, and geometric deep learning. She is also interested in statistical network analysis.

### Deep Networks from First Principles

In this talk, we offer an entirely “white box’’ interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). We show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective.

*This is a joint work with students Yaodong Yu, Ryan Chan, and Haozhi Qi of Berkeley; Dr. Chong You (now at Google Research) and Professor John Wright of Columbia University.*

Yi Ma is a Professor in residence at the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He received his Bachelor’s degree from Tsinghua University in 1995 and MS and PhD degrees from UC Berkeley in 2000. His research interests are in computer vision, high-dimensional data analysis, and intelligent systems. He has been on the faculty of UIUC ECE from 2000 to 2011, the manager of the Visual Computing group of Microsoft Research Asia from 2009 to 2014, and the Dean of the School of Information Science and Technology of Shanghai Tech University from 2014 to 2017. He has published over 160 papers and three textbooks in computer vision, statistical learning, and data science. He received NSF Career award in 2004 and ONR Young Investigator award in 2005. He also received the David Marr prize in computer vision in 1999 and has served as Program Chair and General Chair of ICCV 2013 and 2015, respectively. He is a Fellow of IEEE, SIAM, and ACM.

### Deep Symbolic Regression: Recovering Mathematical Expressions from Data via Risk-Seeking Policy Gradients

Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of symbolic regression. Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that leverages deep learning for symbolic regression via a simple idea: use a large model to search the space of small models. Specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions and employ a novel risk-seeking policy gradient to train the network to generate better-fitting expressions. Our algorithm outperforms several baseline methods (including Eureqa, the gold standard for symbolic regression) in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance.

A team of LLNL scientists collaborated on this effort. Brenden Petersen, Mikel Landajuela Larma, Nathan Mundhenk, Claudio Santiago, Soo Kim, and Joanne Kim. ICLR 2021 Publication.

Brenden Petersen is the group leader of the Operations Research and Systems Analysis group at Lawrence Livermore National Laboratory. He received his PhD in 2016 at a joint appointment at the University of California, Berkeley and University of California, San Francisco. His PhD background is in biological modeling and simulation. Since joining the Lab almost 5 years ago, his research explores the intersection of simulation and machine learning. His current research interests include deep reinforcement learning for simulation control and discrete optimization.

### Replication or Exploration? Sequential Design for Stochastic Simulation Experiments

We investigate the merits of replication and provide methods that search for optimal designs (including replicates), in the context of noisy computer simulation experiments. We first show that replication offers the potential to be beneficial from both design and computational perspectives, in the context of Gaussian process surrogate modeling. We then develop a look-ahead based sequential design scheme that can determine if a new run should be at an existing input location (i.e., replicate) or at a new one (explore). When paired with a newly developed heteroskedastic Gaussian process model, our dynamic design scheme facilitates learning of signal and noise relationships which can vary throughout the input space. We show that it does so efficiently, on both computational and statistical grounds. In addition to illustrative synthetic examples, we demonstrate performance on two challenging real-data simulation experiments, from inventory management and epidemiology.

Dr. Gramacy is a Professor of Statistics in the College of Science at Virginia Polytechnic and State University (Virginia Tech/VT) and affiliate faculty in VT's Computational Modeling and Data Analytics program. Previously he was an Associate Professor of Econometrics and Statistics at the Booth School of Business, and a fellow of the Computation Institute at The University of Chicago. His research interests include Bayesian modeling methodology, statistical computing, Monte Carlo inference, nonparametric regression, sequential design, and optimization under uncertainty. Dr. Gramacy recently published a book on surrogate modeling of computer experiments. Watch Gramacy's talk on YouTube.

### Data Sketching as a Tool for High Performance Computing

High-throughput and high-volume data pipelines are prevalent throughout data science. Additionally, many data problems consider structured data that is representable as graphs, matrices, or tensors. Although modern high performance software solutions are sufficient to solve many important problems, the highly un-uniform structure of many realistic data sets, such as scale-free graphs, can result in high latency and poor resource utilization in distributed memory codes. In this talk we will introduce distributed data sketching—the deployment of composable, fixed-size data summaries—as a mechanism for approximately querying distributed structured data while minimizing memory and communication overhead. We will describe several specific sketch data structures, including cardinality sketches and subspace embeddings, while providing concrete examples of their application to HPC-scale computations—including local k-neighborhood estimation and vertex embedding for clustering. We will also introduce a broad cross-section of sketches and applications from the theory of computing literature, and outline their potential future applications to high performance numerical linear algebra and graph analysis codes.

Benjamin Priest is a staff member in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. They received their PhD in 2019 from the Thayer School of Engineering at Dartmouth College. Their areas of interest include streaming and sketching algorithms, high performance computing, graph analysis, numerical linear algebra, and machine learning. Their recent research foci are the development of high performance algorithms and codes for the sub-linear analysis of graphs and for the scalable approximation of Gaussian processes.

### Deep Generative Modeling in Network Science with Applications to Public Policy Research

Network data is increasingly being used in quantitative, data-driven public policy research. These are typically very rich datasets that contain complex correlations and inter-dependencies. This richness promises to be quite useful for policy research, while at the same time poses a challenge for the useful extraction of information from these datasets —a challenge that calls for new data analysis methods. We formulate a research agenda of key methodological problems whose solutions would enable progress across many areas of policy research. We then review recent advances in applying deep learning to network data and show how these methods may be used to address many of the identified methodological problems. We particularly emphasize deep generative methods, which can be used to generate realistic synthetic networks useful for microsimulation and agent-based models capable of informing key public policy questions. We extend these recent advances by developing a new generative framework that applies to large social contact networks commonly used in epidemiological modeling. For context, we also compare these recent neural network–based approaches with the more traditional Exponential Random Graph Models. Lastly, we discuss some open problems where more progress is needed. This talk will be mainly based on our recent report. See the project's GitHub repository.

Gavin Hartnett is an Information Scientist at the RAND Corporation and a professor at the Pardee RAND Graduate School, where he serves as the Tech and Narrative Lab AI Co-Lead. As a theoretical physicist turned machine learning (ML) researcher, his research centers around the application of ML to a diverse range of public policy areas. Hartnett's recent work includes investigations into COVID-19 vaccination strategies, applications of graph neural networks to agent-based modeling, applications of natural language processing to official U.S. government policy documents, and the implications of adversarial examples in defense scenarios. He has also worked on applications of AI/ML in the physical sciences, with a particular emphasis on spin-glass systems in theoretical physics and computer science. Prior to joining RAND, Hartnett studied black holes in string theory as a postdoc at the Southampton Theory Astrophysics and Gravitation Research Centre in the UK, and before that he was a PhD student at UCSB. His research focused on the existence and stability of black holes, and in using properties of black holes to understand phenomena in strongly coupled gauge theories through the gauge/gravity duality. As an undergraduate at Syracuse University, he researched gravitational waves as part of the LIGO collaboration, the expansion of the early universe, as well as topological defects in liquid crystals

### Targeted Use of Neural Networks on Physics and Engineering

A major challenge in the study of dynamical systems is that of model discovery: turning data into models that are not just predictive but provide insight into the nature of the underlying dynamical system that generated the data. This problem is made more difficult by the fact that many systems of interest exhibit diverse behaviors across multiple time scales. We introduce a number of data-driven strategies for discovering nonlinear multiscale dynamical systems and their embeddings from data. We consider two canonical cases: (i) systems for which we have full measurements of the governing variables, and (ii) systems for which we have incomplete measurements. For systems with full state measurements, we show that the recent sparse identification of nonlinear dynamical systems (SINDy) method can discover governing equations with relatively little data and introduce a sampling method that allows SINDy to scale efficiently to problems with multiple time scales. Specifically, we can discover distinct governing equations at slow and fast scales. For systems with incomplete observations, we show that modern neural networks can discover appropriate coordinate embeddings on which to model the dynamics. Together, our approaches provide a suite of mathematical strategies for reducing the data required to discover and model nonlinear dynamical systems.

Nathan Kutz is the Yasuko Endo and Robert Bolles Professor of Applied Mathematics at the University of Washington, having served as chair of the department from 2007–2015. He has a wide range of interests, including neuroscience to fluid dynamics where he integrates machine learning with dynamical systems and control.

### Learning Particle Physics from Machines

Recent advances in artificial intelligence offer opportunities to disrupt the traditional strategies for discovery of new particles in high-energy collisions. Dr. Whiteson will describe new machine learning techniques, explain why they are particularly well suited for particle physics, present selected results that demonstrate their new capabilities, and present a strategy for translating their learned strategies into human understanding.

Daniel Whiteson is a professor of experimental particle physics at the University of California, Irvine, and a fellow of the American Physical Society. He is part of the collaboration that built, maintains, and collects data from the ATLAS experiment at the Large Hadron Collider. His research has appeared widely in popular media outlets including The New Yorker, Ars Technica, VICE, and many others. Along with his colleagues he created popular comics including “What’s in the data? The Higgs Boson Explained” and “True Tales of Dark Matters,” which were all featured on PBS. Dr. Whiteson is the co-host of the Daniel & Jorge Explain the Universe podcast and holds a PhD in Physics from UC Berkeley.

### Low-Dimensional Modeling for Data-Driven Learning

Data-driven methods such as deep learning have achieved phenomenal success in a broad range of tasks. A key to the superior performance of data-driven methods is the availability of large-scale data that is carefully collected, cleaned, organized, and annotated. However, practical data often possess many nuances such as corruption, lack of annotations, and heavy-tailed distribution, which significantly compromise the performance of data-driven methods. This talk aims to demonstrate that the intrinsic low-dimensional structure of high-dimensional data can be leveraged to address the challenges in a principled and effective manner. First, I will show that by modeling a mixture of data by a union of low-dimensional manifolds, we can develop unsupervised clustering algorithms that not only are provably correct, but also can be made scalable without a performance loss and robust to data nuances with provable guarantees. Our methods obtain state-of-the-art performance for clustering MNIST (with 98.3% accuracy) and CIFAR10 (with 68.4% accuracy) datasets. Second, I will present a double over-parameterization method that addresses the overfitting issue in over-parameterized models by exploiting the implicit algorithmic bias of discrepant learning rates. We establish the theoretical correctness of the method for low-rank matrix recovery problems and demonstrate the practical effectiveness of the method for natural image recovery tasks. I will conclude the talk with the broader implication of low-dimensional modeling for deep learning, using generalization and architectural design as two illustrative examples.

Chong You is a postdoctoral scholar in the Department of EECS at the University of California, Berkeley. He received his PhD in 2018 from the Electrical and Computer Engineering Department at Johns Hopkins University. His research areas broadly include machine learning, computer vision, optimization, and signal processing. He is interested in the development of mathematical principles and practical numerical algorithms for analyzing and interpreting modern data, with the goal of addressing real-world challenges. He is the recipient of the Doctoral Dissertation Award from MINDS at Johns Hopkins University.

### Discovering Symbolic Models in Physical Systems Using Deep Learning

We develop a general approach to distill symbolic representations of a learned deep model by introducing strong inductive biases. We focus on graph neural networks (GNNs). The technique works as follows: We first encourage sparse latent representations when we train a GNN in a supervised setting, then we apply symbolic regression to components of the learned model to extract explicit physical relations. We find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural networks. We then apply our method to a non-trivial cosmology example—a detailed dark matter simulation—and discover a new analytic formula that can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution-data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.

Dr. Shirley Ho’s research interests have ranged from using machine learning and statistics to tackle fundamental challenges in cosmology to finding new structures in the Milky Way. She has broad expertise in theoretical astrophysics, observational astronomy, and data science. Ho’s recent interest has been on understanding and developing novel tools in machine learning techniques and applying them to astrophysical challenges. Her goal is to understand the universe’s beginning, evolution, and its ultimate fate. Ho works with international collaborators both within the Cosmology X Data Science Group at the Flatiron Institute, at the Department of Astrophysical Sciences at Princeton University, and beyond. She holds a Ph.D. in Astrophysical Sciences from Princeton University.

### Why Do Machine Learning Models Fail?

Our current machine learning (ML) models achieve impressive performance on many benchmark tasks. Yet these models remain remarkably brittle and susceptible to manipulation. Why is this the case? In this talk, Dr. Madry will take a closer look at this question and pinpoint some of the roots of this observed brittleness. Specifically, the seminar will discuss how the way current ML models “learn” and are evaluated gives rise to widespread vulnerabilities, and then outline possible approaches to alleviate these deficiencies.

Dr. Aleksander Madry is a Professor of Computer Science in the EECS Department at the Massachusetts Institute of Technology (MIT) and a Principal Investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his Ph.D. from MIT in 2011; prior to joining the MIT faculty, he spent some time at Microsoft Research New England and on the faculty of EPFL. Madry’s research interests span algorithms, continuous optimization, science of deep learning, and understanding machine learning from robustness and deployability perspectives. His work has been recognized with a number of awards, including a National Science Foundation CAREER Award, an Alfred P. Sloan Research Fellowship, an ACM Doctoral Dissertation Award Honorable Mention, and a Presburger Award.

### Marrying AI and Physics Towards Accelerated Discovery

Scientific discovery is one of the primary factors underlying advancement of human race. However, the traditional discovery process is slow compared to the growing need for new inventions—for example, antibiotic discovery or design of next-generation energy materials. In recent years, data-driven approaches such as machine learning and especially deep learning have achieved remarkable performance in many domains including computer vision, speech recognition, audio synthesis, and natural language processing and generation. These methods have also infiltrated other scientific fields including physics, chemistry, and medicine. Despite these successes and the potential for huge societal impact, machine learning models are still in their infancy in terms of driving and transforming scientific discovery. This talk will introduce a closed-loop paradigm to accelerate scientific discovery, which can seamlessly integrate machine learning, physics-based simulations, and wet-lab experiments and enable new hypothesis and/or artefact generation and validation thereof. Development and use of deep generative models and reinforcement learning–based methods for designing novel peptides and materials with desired functionality will be discussed. Das will also examine the importance of adding creativity, robustness, and interpretability to machine learning models to enable and add value to artificial intelligence–driven discovery.

Dr. Payel Das is a research staff member and manager in the AI Science Department of IBM Thomas J. Watson Research Center in Yorktown Heights, NY. She is also an adjunct associate professor in the Department of Applied Physics and Applied Mathematics at Columbia University. At IBM, she leads and manages research projects related to artificial intelligence (AI) for creativity and discovery, with inspirations from and applications in materials science, chemistry, physics, biology, and neuroscience. Many of these projects lie at the intersection of data-driven and physics-based modeling. A major focus of her work is developing novel deep generative models for heterogeneous data, which is abundant in real-world applications. Das holds a PhD in Theoretical Physical Chemistry from Rice University and has won numerous awards including IBM’s highest award for Outstanding Technical Achievement. She has co-authored over 30 peer-reviewed publications and serves on the editorial advisory board of the ACS Central Science journal.

### Data and Ethics: Old Issues and New Challenges

The increasing availability of data and raw computational power, along with recent developments in models and algorithms, are changing the way businesses, academics, and governments operate. However, this revolution has both created new ethical challenges and changed the nature of many familiar ones. For example, notions of informed consent, which were originally developed in the context of biomedical research after the atrocities of the Second World War, are a poor fit for an environment in which individuals are constantly monitored by scores of agents with vague (and often unenforceable) consent disclosures.

Similarly, notions of confidentiality and privacy—originally devised for a world in which governments were the only agents with detailed information about large numbers of individuals—are not necessarily appropriate for an environment in which this kind of data is in the hands of a myriad of private entities. This talk uses a number of recent case studies to explore these and other issues related to the ethics of data collection, management, and analysis, in an attempt to highlight issues that would appear to be relevant to the kinds of activities carried out by the national laboratories.

Dr. Abel Rodriguez is Professor of Statistics at the Baskin School of Engineering at the University of California, Santa Cruz (UCSC). He is also the Associate Director of the Center for Data, Discovery and Decisions (D3) and one of the PIs of the NSF-supported TRIPODS Center at UCSC. A former recipient of the DARPA Young Faculty Award in 2010, he was also awarded the prestigious Donald D. Harrington Faculty Fellowship by the University of Texas at Austin in 2012. Dr. Rodriguez came to UCSC in 2007 after completing an M.A. in Economics and a Ph.D. in Statistics and Decision Sciences from Duke University. Before that, he received a B.A. in Law and B.S. in Industrial Engineering in his native Venezuela. Dr. Rodriguez is an expert in Bayesian statistical methods and their applications in the biomedical and social sciences. His interests range widely and include nonparametric methods, spatiotemporal modeling, relational data, and extreme value theory. Starting September 1, he will be joining the University of Washington as Professor and Chair of the Statistics Department.

### Dungeons and Discourse: Using Computational Storytelling to Look at Natural Language Use

Although we are currently riding a technological wave of personal assistants, many of these agents still struggle to communicate appropriately. Humans are natural storytellers, so it would be fitting if artificial intelligence (AI) could tell stories as well. Automated story generation is an area of AI research that aims to create agents that tell “good” stories. Previous story-generation systems use planning to create new stories, but these systems require a vast amount of knowledge engineering. The stories created by these systems are coherent, but only a finite set of stories can be generated. In contrast, very large language models have recently made the headlines in the natural language processing community. Though impressive on the surface, these models begin to lose coherence over time. Lara Martin’s research looks at various techniques of automated story generation, focusing on the perceived creativity of the generated stories. In this talk, she will define a creative product as one that is both novel and useful, as well as show how a jointly probabilistic and causal model can provide more creative stories for readers of stories generated from an improvisational storytelling system than from solely probabilistic or causal models.

Lara J. Martin is a Human-Centered Computing Ph.D. Candidate in the College of Computing at Georgia Tech. Her work resides in human-centered AI with a focus on natural language applications. Lara has worked in the areas of automated story generation, speech processing, and affective computing, publishing in top-tier conferences such as AAAI and IJCAI. She earned a Masters of Language Technologies from Carnegie Mellon University in 2015 and a B.S. in Computer Science and Linguistics from Rutgers University–New Brunswick in 2013. In 2019, she received Georgia Tech’s prestigious Foley Scholar Award for her innovative research and the Best Doctoral Consortium Presentation award at the 2019 ACM Richard Tapia Celebration of Diversity in Computing Conference. She has also been featured in Wired.

### Blending Noisy Organic Signals with Traditional Movement Variables to Predict Forced Migration in Iraq

Worldwide displacement due to war and conflict is at an all-time high. Unfortunately, determining if, when, and where people will move is a complex problem. This talk will describe a multi-university project that develops methods for blending variables constructed from publicly available organic data (social media and newspapers) with more traditional indicators of forced migration to better understand when and where people will move.

Dr. Singh will demonstrate an approach that uses a case study involving displacement in Iraq, and show that incorporating open-source generated conversation and event variables maintains or improves predictive accuracy over traditional variables alone. She will conclude with a discussion on strengths and limitations of leveraging organic big data for societal-scale problems.

Dr. Lisa Singh is a professor in the Department of Computer Science and a research professor in the Massive Data Institute at Georgetown University. She has co-authored over 70 peer-reviewed publications and book chapters related to data-centric computing. Current projects include studying privacy on the Web; identifying noise and poor-quality information on social media; developing methods and tools to better understand forced movement due to conflict; and learning from public, open-source big data to advance social science research of human behavior/opinion. Her research has been supported by the National Science Foundation, the Office of Naval Research, the Social Science and Humanities Research Council, the National Collaborative on Gun Violence Research, the Department of Defense, and the Department of State. Dr. Singh recently organized three workshops involving future directions of big data research and is currently involved in different organizations working on increasing participation of women in computing and integrating computational thinking into K-12 curricula. Dr. Singh received a BSE from Duke University and MS and PhD from Northwestern University.

### Discovering and Normalizing Part Names in Noisy Text Data

Part identification plays a key role in vehicle prognostics and health management. Part identifiers are often expressed as nomenclature and buried in noisy free text data found in maintenance reports, supply chain management records, service and support communication logs, and manufacturing quality data. There is little consistency in how part names are actually described in noisy free text, with variations spawned by typos, ad hoc abbreviations, acronyms, and incomplete names. This makes search and analysis of parts involved in this data extremely challenging. In this talk, Kao will discuss Boeing’s tool PANDA (PArt Name Discovery Analytics) based on a unique method that exploits statistical, linguistic, and machine learning techniques in a unique way to discover part names in noisy free text. Normalization of such terms is also crucial for many applications. Part names pose an additional major challenge because they tend to be in the form of multi-word terms. Kao’s team also developed a novel normalization method called UNAMER (Unification and Normalization Analysis, Misspelling Evaluation and Recognition) for identifying term variants, including variants of multi-word terms, and normalizing them under a canonical name. PANDA and UNAMER have been deployed in practical applications to extract and normalize part names in the aerospace domain.

Dr. Anne Kao is an internationally recognized expert in text analytics and natural language processing. As a Senior Technical Fellow at Boeing Research & Technology, she is responsible for coordinating R&D in data analytics and artificial intelligence, creating an intellectual property strategy with respect to these, leveraging data analytics and artificial intelligence as key Boeing technology differentiators for government programs, collaborating with national and international universities and laboratories, and building Boeing’s depth and breadth in the field. Dr. Kao has more than 25 years of success in analytics methods including artificial intelligence, data analytics, visual analytics, and social network analysis. She holds 17 U.S. patents, has published dozens of papers in peer-reviewed journals and conference proceedings, and is active in professional societies. Dr. Kao won the BEYA Senior Technology Fellow Award and the Asian American Engineer of the Year Award in 2015 as well as the National Women of Color in Technology Research Leadership Award in 2006. She holds a bachelor’s in philosophy from the National Chengchi University (Taiwan), a master’s and PhD in philosophy from the Chinese Culture University (Taiwan), and a master’s in computer science from San Diego State University.

### Engineering Data Science Objective Functions for Social Network Analysis

David Gleich is the Jyoti and Aditya Mathur Associate Professor in the Computer Science Department at Purdue University whose research is on novel models and fast, large-scale algorithms for data-driven scientific computing including scientific data analysis, bioinformatics, and network analysis. He presented a November 6, 2019, DSI seminar titled “Engineering Data Science Objective Functions for Social Network Analysis.”

A common setting in many data science applications from social network analysis to bioinformatics is to be given a dataset in the form of a graph along with a small number of interesting sets in that graph. In social networks, these are often called communities. In protein interaction networks, these could be pathways or functional groups. Given these examples, the problem is then to find more like them. Gleich presented a technique to engineer an objective function that captures characteristic features of these examples, demonstrated a framework in the context of community-detection algorithms for graphs that will determine an objective function from a single example, and discussed how this can result in interesting findings about the structure of college social networks in Facebook networks. His presentation also touched on ongoing work using the same ideas in drug discovery and chemistry.

Gleich is committed to making software available based on this research and has written software packages such as MatlabBGL with thousands of users worldwide. He has received numerous awards for his research including a Society for Industrial and Applied Mathematics (SIAM) Outstanding Publication prize (2018), a Sloan Research Fellowship (2016), a National Science Foundation (NSF) CAREER Award (2011), and the John von Neumann postdoctoral fellowship at Sandia National Laboratories in Livermore (2009). His research is funded by the NSF, DOE, DARPA, and NASA.

### Machine Learning Applied Research and Challenges at Ubisoft

Francois Nadeau is an analytics and business intelligence veteran who has worked more than a decade in various roles from analyst to business intelligence developer to data scientist in the telecommunications, manufacturing, and entertainment industries. His October 29, 2019, DSI seminar—titled “Machine Learning Applied Research and Challenges at Ubisoft”—reviewed some of the applied research conducted at Ubisoft and associated challenges.

The projects discussed include how Ubisoft is solving metadata standardization by recognizing 3D models, helping art managers find 3D models based on photos, and assisting the browsing of internal text documents by tagging relevant abstract concepts. The presentation also showed how Ubisoft has tackled some of the challenges associated with those projects, such as risk-averse stakeholders, automation fear, and cold starts.

Nadeau has been studying artificial intelligence and machine learning since 2009 and cofounded an applied research group within Ubisoft specializing in machine learning. Since then, he has researched, developed, and put in production many learning systems covering computer vision, natural language understanding, and predictive analysis.

### Decentralized Autonomous Networks for Cooperative Estimation

Dr. Ryan Goldhahn, an LLNL computation engineer, presented a September 18, 2019, DSI seminar titled “Decentralized Autonomous Networks for Cooperative Estimation.” Collaborative autonomous networks have recently been used in national security, critical infrastructure, and commercial applications such as the Internet of Things. Decentralized approaches in particular offer scalable, low-cost solutions that are robust to failures in multiple individual agents. However, such networks face challenges related to latency, bandwidth, scalability, and adversarial attacks, and new decentralized approaches are needed for distributed data processing and optimization. Effective solutions push as much of the data processing and intelligence as possible to the individual agents and efficiently communicate information, fuse data while allowing for the possibility of unreliable information from neighboring agents, and achieve scalable network behaviors from only local coordination of actions between agents. This talk summarized recent work on signal processing and network intelligence algorithms for decentralized sensor networks, results of simulations in large (~10K agents) networks, and current efforts toward the implementation of these algorithms in low size, weight, and power embedded systems.

Dr. Goldhahn has a BE in engineering from Dartmouth College and a PhD in electrical and computer engineering from Duke University. Before joining LLNL, he led a project at the NATO Centre for Maritime Research and Experimentation (CMRE) using multiple unmanned underwater vehicles (UUVs) to detect and track submarines. This work developed collaborative autonomous behaviors to collectively detect targets and optimally reposition UUVs to improve tracking performance without human intervention, and tested these autonomous sensor networks at sea with submarines from multiple NATO nations. At LLNL, Dr. Goldhahn has continued to work in collaborative autonomy and model-based and statistical signal processing in various applications. He has specifically focused on decentralized detection/estimation/tracking and optimization algorithms for autonomous sensor networks.

### Deep Learning Scaling is Predictable

Dr. Joel Hestness is a senior research scientist at Cerebras Systems, an artificial intelligence (AI)–focused hardware startup. His August 22, 2019, DSI seminar—titled “Deep Learning Scaling is Predictable, Your Data is (Probably) Hierarchical”—focused on deep learning (DL) scaling. DL creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. A common belief in DL is that growing training sets and models should improve accuracy. Dr. Hestness described Baidu’s large-scale empirical studies: As training set size increases, DL model generalization error and model sizes scale as particular power-law relationships (not entirely consistent with theoretical results). As model size grows, training time remains roughly constant—larger models require fewer steps to converge to the same accuracy. With these scaling relationships, the expected accuracy and training time can be accurately predicted for models trained on larger data sets. In the second part of his talk, Dr. Hestness touched on more recent studies in model architecture search: DL models are overparameterized but can still generalize well. Most DL models are inductively biased, designed to capture hierarchy or fractal structures in data, indicating that most real-world data must be hierarchical.

At Cerebras Systems, Dr. Hestness helps formulate strategies to support machine learning researchers/practitioners to use the hardware, and he leads some natural language understanding research. Previously, he was a research scientist at Baidu's Silicon Valley AI Lab, where he worked on techniques to understand and scale out deep learning speech and language model training. Dr. Hestness holds a PhD in computer architecture from the University of Wisconsin–Madison. He has broad experience with computing applications including numerical methods, graph analytics, and machine/deep learning.

### Active Optimization of Chemical Catalysts

Drastic changes in climate and global losses in biodiversity are increasing the need to shift the incumbent energy and chemical infrastructure from a fossil-fuel based system to a sustainable-energy based system. Such a system will require that the production of fuels and chemicals use only sustainable energy (e.g., solar) and simple, abundant feedstocks like carbon dioxide, water, or nitrogen.

In a DSI seminar on June 20, 2019, Kevin Tran joined electrochemistry and machine learning (ML) in a seminar titled “Active Optimization of Catalysts for Sustainable Energy and Chemistry.” *Active optimization* means iteratively using ML to decide which experiment to conduct. According to Tran, this approach can make a significant impact on the development of catalysts that could turn renewable electricity into sustainable fuels and chemicals.

Tran’s team at Carnegie Mellon University has developed a method for optimizing such chemistries. It combines an active optimization routine with a fully automated simulation framework—nicknamed GASpy—to screen the appropriate catalysts and reaction conditions. The seminar included an overview of the chemistry, simulation, and software aspects of this framework before detailing the team’s ML techniques, experimental designs, and statistical methods.

For example, the research team “tunes” variables (e.g., catalysts or voltages) and then uses density functional theory (DFT) to calculate the resulting effects on the performance of target chemistries. Thousands of calculations are needed, though, so Tran created a high-throughput, Python-based framework that automates these DFT calculations. Still, each calculation can take an hour or even days to run. GASpy speeds up this process by using ML models to automatically decide which calculations to perform next.

Tran noted, “Active optimization needs to balance exploitation of the model with exploration of the search space.” GASpy uses this balance to perform iterative DFT calculations and in recent tests found more than 100 high-performing catalyst surfaces. As planned, these results informed subsequent experiments: University of Toronto colleagues began experimenting with a number of these promising catalysts.

The research team aims to broaden GASpy’s capabilities with multi-objective and multi-fidelity optimization, which will make the framework more scalable and holistic. For instance, the roadmap includes optimizing catalyst efficiency and stability simultaneously while varying catalyst composition and other processing conditions. Tran’s team is also improving quantification and calibration of the model’s uncertainty. He added, “We’re looking at ways to judge how different active optimization methods compare to each other via retrospective and prospective performance metrics.”

Tran is pursuing a PhD in chemical engineering at Carnegie Mellon University, advised by Dr. Zachary Ulissi, and interning this summer at LLNL under the mentorship of Dr. Joel Varley. Tran was previously a fluoropolymer processing engineer at W. L. Gore & Associates, working on implantable medical devices. He received his bachelor’s in chemical engineering from the University of Delaware, where his research focused on microkinetic modeling for biopharmaceutical applications.

### Artificial Intelligence at Google

In a standing-room-only DSI seminar on January 15, 2019, Dr. Massimo Mascaro reviewed what has changed dramatically in the world of machine learning (ML) in the last five years and how the new techniques have enabled unthinkable advances in applications of artificial intelligence (AI) at Google. He outlined some of the most interesting emerging techniques that have the potential of further revolutionizing AI usage in the near future, particularly in the world of engineering and science. The seminar closed with some consideration on hardware and software demands for large-scale modern AI workloads.

AI is “changing Google from the bone,” Dr. Mascaro said. Nearly every employee receives ML training, and every Google product has at least some ML component. Google Photos can now find images by a keyword search, Gmail can formulate its own automated responses by learning writing styles from the user, and deep neural networks are revolutionizing Google’s search rankings and Waymo’s self-driving autonomous cars. (Waymo is owned by Google’s parent company Alphabet.)

Google has also applied ML to science and engineering problems, helping NASA find exoplanets by recognizing signatures in data from the Transiting Exoplanet Survey Satellite TESS. Deep learning has been used with brain imaging to analyze neural connections and better understand how the brain works, and is performing some tasks better than humans, such as detecting diabetic retinopathy from retinal images.

The biggest advancement coming down the pike, Dr. Mascaro explained, is deep reinforcement learning, where programmers create a learning loop that allows the AI to come up with its own solutions to problems with zero input from humans. “Agents” based on computer models perform repetitive actions and receive feedback (rewards) for figuring out strategies that work, improving as the loop continues.

In his role as Technical Director of Applied AI in the Office of the CTO for Google Cloud, Dr. Mascaro helps VIP customers reimagine the production of goods and services and how value is exchanged in free markets by leveraging the power of AI and the Google technologies that enable it. Prior to Google, he worked at Intuit where he founded and led the data science group as Chief Data Scientist and Director of Data Engineering for the Consumer Group. In that role, he was responsible for all TurboTax analytics data ingestion systems and worked on many challenging but rewarding predictive analytics and personalization features that power TurboTax and help tens of millions of people do their taxes more easily. Before Intuit, Dr. Mascaro worked as lead of the R&D group of Intellisis, a small San Diego startup that builds advanced speech processing software for various U.S. government and defense entities.

### Cosmological Surveys and Deep Learning

Dr. François Lanusse, a postdoctoral fellow at the Berkeley Center for Cosmological Physics and the Foundation of Data Analysis institute at UC Berkeley, presented a DSI seminar on November 29, 2018. The upcoming generation of cosmological surveys such as the Large Synoptic Survey Telescope (LSST) will aim to shed some much-needed light on the physical nature of dark energy and dark matter by mapping the Universe in great detail and on an unprecedented scale. While this implies a great potential for discoveries, it also involves new and outstanding challenges at every step of the science analysis, from image processing to the cosmological inference.

Dr. Lanusse discussed how these challenges can be addressed with some of the latest developments in Deep Learning, in particular graph neural networks, deep generative models, and neural density estimation. At the image level, he demonstrated how deep convolutional networks can outperform human accuracy on tasks such as finding rare strong gravitational lenses, a problem which used to require significant human visual inspection. Another important aspect of the analysis of modern surveys is the ability to generate realistic mocks of the observations. In situations where physical models either do not exist or are intractable, he presented how deep generative models can be used as an alternative—for example, learning to generate realistic galaxy intrinsic alignments inside large-volume cosmological simulations. The presentation concluded with an explanation of how neural density estimation can be used for performing dimensionality reduction and inference in a likelihood-free setting. This allows the building of complex summary statistics of the data—which can be more sensitive to cosmological models than conventional 2pt statistics—for use in a consistent Bayesian framework.

Dr. Lanusse is a member of the LSST Dark Energy Science Collaboration (DESC). Most of his current research is focused on exploring new applications of the latest machine learning and statistical signal processing techniques for future large-scale cosmological surveys. He holds a PhD in astrophysics from Paris-Saclay University as well as an engineering degree from CentraleSupelec.

### Big Data in the Social Sciences

The DSI and LLNL’s Center for Global Security Research co-sponsored a November 7, 2018, seminar presented by Dr. Lisa Garcia Bedolla, director of the Institute of Governmental Studies and a professor in the Graduate School of Education at the University of California, Berkeley. The growth of data science, both in terms of the availability of massive data sources as well as powerful computational methods for analyzing them, opens up new possibilities for scientific advancement. In the social sciences, it raises the possibility that scholars can address a longstanding lack of high-quality information about the social, political, and economic status of marginal populations.

However, all social data has weaknesses and biases, regardless of the size of the data set. Dr. Bedolla’s talk explored the new possibilities big data has opened up within the social sciences with tools such as social network analyses and geospatial information systems, among others. Yet, the transformational potential of data science to advance social well-being can only be realized if scholars are mindful of the potential for these new approaches to re-inscribe bias and misrepresentations of vulnerable populations. The seminar concluded with practical suggestions for researchers to take into consideration as they embark on this work.

Dr. Bedolla studies why people choose to engage politically, using a variety of social science methods—field observation, in-depth interviews, survey research, field experiments, and geographic information systems—to shed light on this question. Her research focuses on how marginalization and inequality structure the political and educational opportunities available to members of ethno-racial groups, with a particular emphasis on the intersections of race, class, and gender. Her current projects include an analysis of how technology can facilitate voter mobilization among voters of color in California and a historical exploration of the race, gender, and class inequality at the heart of the founding of California’s public school system.

Watch a video of Dr. Bedolla's presentation on YouTube.

### Real-Time Traffic Prediction on ESnet Links

Being able to predict network traffic could potentially help efficient rerouting of traffic to prevent network crashes and link failures. In recent years, deep learning has been at the forefront of learning sequential data, namely with the success of the Long Short Term Memory Network (LSTM). In an October 29, 2018, DSI seminar, Dr. Mariam Kiran of Lawrence Berkeley National Lab discussed LSTM architecture.

While LSTMs have been applied to network traffic data, their capabilities have only extended to predicting a single bandwidth value, not providing enough context for a comprehensive traffic routing algorithm. The seminar presented a sequence-to-sequence (seq2seq) LSTM architecture for network traffic to predict multiple hourly intervals into the future. Dr. Kiran’s method uses sliding windows with optimal lookback lengths to predict traffic bandwidth 8 hours into the future. The performance of this architecture is demonstrated on simple network management protocol (SNMP) data on the Energy Sciences Network (ESnet) to understand and predict various ESnet traffic across its links.

Dr. Kiran belongs to both ESnet and computational research division groups. Her research is focused on automating and improving usage of distributed networks and related facilities, to enable high-performance science applications. Developing methods from machine learning, multi-agent control and optimization, her work aims to improve how networks operations and application performances can be optimized in high-speed transfers.

### Transcriptional Signatures in Human Cells

Dr. Gerald Quon of the University of California at Davis visited LLNL on October 10, 2018, to present a seminar titled “Using Deep Neural Networks and Generative Models to Characterize Transcriptional Signatures in Human Cells.” Single-cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, when the similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies.

Quon presented a novel computational strategy for identifying cell type specific responses using a novel deep neural network for performing domain adaptation and transfer learning. Compared to other existing approaches, this one does not require identification of all cell types before alignment and can align more than two conditions simultaneously. He discussed ongoing applications of the model to two problem domains: (1) characterizing hematopoietic progenitor populations and their response to inflammatory challenges (LPS), in which Quon’s team has identified putative subpopulations of long-term HSCs that differentially respond to the challenge, and (2) characterizing the malaria cell cycle process, in which they identified transcriptional changes associated with sexual commitment. Quon also discussed his lab’s work in building deep generative models of transcriptional plasticity, which aims to reprogram cancer cells from a malignant to non-malignant phenotype.

### Localization and Anonymization with Indirect Supervision

On September 5, 2018, the DSI hosted Dr. Yong Jae Lee from the University of California at Davis for a seminar titled “Learning to Localize and Anonymize Objects with Indirect Supervision.” Lee’s computer science research team explores innovative approaches to visual recognition, including two indirect supervision methods he described for the LLNL audience: (1) scalable object localization and (2) anonymization while preserving action information.

Computer vision has made great strides for problems that can be learned with direct supervision, in which the goal can be precisely defined (e.g., drawing a box that tightly fits an object). However, direct supervision is often not only costly, but also challenging to obtain when the goal is more ambiguous. Lee discussed his team’s recent work on learning within direct supervision by first presenting an approach that learns to focus on the relevant image regions given only indirect image-level supervision (e.g., an image tagged with “car”). This is enabled by a novel data augmentation technique that hides image patches randomly.

Second, Lee described an approach that learns to anonymize sensitive video regions while preserving activity signals in an adversarial framework. It accomplishes this by simultaneously optimizing for the indirectly-related task of misclassifying face identity and maximizing activity detection accuracy. His team showed that their anonymization method leads to superior performance compared to conventional hand-crafted anonymization methods including masking, blurring, and noise adding.

### Fingerprints in Brain Networks

In the 17th century, physician Marcello Malpighi observed the existence of patterns of ridges and sweat glands on fingertips. This was a major breakthrough and originated a long and continuing quest for ways to uniquely identify individuals based on fingerprints. In the modern era, the concept of fingerprinting has expanded to other sources of data, such as voice recognition and retinal scans. It is only in the last few years that technologies and methodologies have achieved high-quality data for individual human brain imaging, and the subsequent estimation of structural and functional connectivity. In this context, the next challenge for human identifiability is posed on brain data, particularly on brain networks, both structural and functional.

In an August 9, 2018, DSI seminar, Dr. Joaquin Goni of Purdue University presented his work showing how the individual fingerprint of a connectome (as represented by a network) can be uncovered (or in a way, maximized) from a reconstruction procedure based on group-wise decomposition in a finite number of brain connectivity modes. By using data from the Human Connectome Project, Goni introduced different extensions of this work, including subject identifiability, heritability analysis of brain networks, as well as identifiability when assessing inter-task brain functional networks. Finally, results on this framework for inter-scan identifiability based on a second dataset acquired at Purdue University were also discussed.

### Data-Driven Method for Improving Brain Tumor Imaging

Brain tumor incidence is expected to rise by 6% over the next 20 years. Nearly 79,000 patients will be diagnosed in the U.S. this year alone. In a DSI seminar on August 2, 2018, Dr. Maryam Vareth outlined the University of California at San Francisco’s (UCSF’s) efforts to improve brain tumor outcomes through data-driven medicine.

Standard magnetic resonance imaging (MRI) is a mainstay of brain tumor diagnosis and evaluation, but it poses challenges when clinicians attempt to distinguish treatment effects from recurrent tumors. More advanced imaging is needed to better define tumor regions so that radiation treatments can target areas with high probability of recurrence.

Vareth described a potential solution called magnetic resonance spectroscopic imaging (MRSI)—static metabolic imaging that zeroes in on tumor chemistry. With MRSI, clinicians can identify metabolic changes in the brain earlier than when a recurrent tumor would show up with standard MRI. Moreover, MRSI is noninvasive and can be performed on a regular MRI machine.

The MRSI process creates indices of signals from choline, creatine, N-acetyl-aspartate, lipid, and lactate. Data can then be analyzed in map of voxels (3D pixels). With an inherently low signal, however, MRSI scans take a long time—especially if clinicians need to scan the entire brain, not just one region. Faster MRSI scans will help encourage clinicians to adopt this type of imaging.

Vareth’s team is developing a fast-trajectory MRSI analysis method to reduce scan time significantly. “An MRI is a very expensive Fourier transform machine,” she explained, so acceleration can be achieved through modified k-space sampling (below the Nyquist rate) of raw data. This process involves compressed sensing and parallel imaging as well as weighting images according to their sensor proximity (i.e., sensitivity is higher closer to a sensor within the machine).

Vareth and her UCSF colleagues are working toward “super-resolution” of MRSI and exploring the potential of deep learning to further enhance image quality while reducing scan duration. The team has developed software, called SIVIC, for processing automated prescription and reconstruction of MRSI data. SIVIC is available on GitHub.

### A Better Gradient Estimator for Black-Box Functions of Random Variables

On July 31, 2018, the DSI continued its seminar series with a talk on gradient optimization for black-box functions of random variables. University of Toronto Ph.D. candidate and LLNL alumnus Will Grathwohl presented “Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.”

Existing gradient-based optimization methods have advantages and disadvantages, and none offers unbiased, low-variance gradient estimates for arbitrary black-box functions. For example, the REINFORCE estimator is unbiased and works on any function but has high variance. The REPARAMETERIZATION and CONCRETE estimators achieve lower variance but require the black-box function to be known and differentiable.

Grathwohl’s team created an improved general gradient estimator by combining REINFORCE and REPARAMETERIZATION in a control variate framework. Their approach, named LAX, begins with the REINFORCE estimator of the black-box function, introduces a surrogate function with desired properties, subtracts the REINFORCE estimator of the surrogate, and then adds the REPARAMETERIZATION estimator of the surrogate. These steps make it possible for LAX to achieve an unbiased, low-variance estimate of arbitrary black-box function gradients.

The team also developed an extension of this estimator—called RELAX—that introduces a relaxed distribution to handle black-box functions of discrete random variables. Both LAX and RELAX were tested alongside other high-variance gradient estimators and attained lower variance, which significantly reduced optimization time compared with other existing methods. This work was recently published (PDF) at the 2018 ICLR Workshop (the Sixth International Conference on Learning Representations).

### Machine Learning Solutions for Detecting Synthetic Bacteria

The DSI hosted Dr. Paul Gamble from Lab 41 for a seminar on June 8, 2018. Gamble presented two machine learning-based approaches developed by his team to detect and distinguish various forms of genetic engineering. Recently, synthetic biology has become increasingly common, having been used to drive down costs in perfumes, detect pollution, produce vaccines, as well as treat agricultural waste while simultaneously reducing greenhouse emissions by 75% in some cases. However, its rapid rise has also created new dangers, including biohacking and engineered bioweapons with increased virulence.

At Lab 41, Gamble is developing a machine learning pipeline for detecting synthetically engineered DNA. He also studies methods for detecting and defending against adversarial attacks on neural networks. Gamble received an M.D. and a Master’s in Biomedical Engineering from Washington University in St. Louis. During medical school, he developed nerve-computer interfaces and tested them in animal models. His research also focused on applying machine learning to clinical practice—he built a computer vision system to assist radiation oncologists with organ contouring and radiation dosimetry planning.

### Machine Learning Meets Astrophysics

The DSI sponsored a seminar on May 22, 2018, featuring Dr. Andreas Zoglauer of the UC Berkeley Institute for Data Science. Zoglauer works with Berkeley’s Space Sciences Laboratory on the NASA-sponsored project COSI—the Compton Spectrometer and Imager, a balloon-borne gamma-ray telescope. COSI’s science objectives focus on galactic nucleosynthesis and the polarization of gamma-ray bursts caused by astronomical events such as neutron star mergers and core-collapse supernovae of heavily rotating massive stars. COSI’s 2016 flight around the southern hemisphere generated data that Zoglauer’s team continues to analyze.

According to Zoglauer, gamma-ray astronomy research relies heavily on data science and statistics. To analyze the data from COSI’s detectors, he developed an open-source toolkit called MEGAlib (Medium-Energy Gamma-ray Astronomy Library), which has applications beyond astrophysics in nuclear medicine and nuclear monitoring. MEGAlib enables researchers to perform Monte Carlo simulations of their detectors, reconstruct Compton events, and create images based on Compton scattering data. Zoglauer stated that COSI’s biggest computational challenge is generating up to 9-dimensional response files with Monte Carlo simulations for the reconstruction of all-sky images. Those simulations were performed on Berkeley Lab’s cori supercomputer.

With the help of data science undergraduates, Zoglauer is applying machine learning to COSI data such as random forests and neural networks. Research projects include determining photon paths in the germanium detectors, finding interaction locations in the detectors, and identifying not-contained gamma rays. Zoglauer outlined several lessons learned through his team’s work with machine learning tools, such as the importance of preparing data, splitting a big research question into smaller questions, and verifying that the trained neural networks have no “blind spots.” Researchers using machine learning algorithms should also expect “a lot of trial and error” in finding the best input data representation.

COSI is preparing for another flight in 2019–2020. An upgraded version, COSI-X, is planned for launch in 2022 with additional detectors, better shielding, and improved resolution.

#### In a Galaxy Not So Far Away

- In space, gamma rays are generated by radioactive decays, annihilation, and charged particle interactions. Astronomical sources include pulsars, supernovae, and the regions near black holes. In our Milky Way, the Crab Nebula, Cygnus X-1, and the area around the center of our galaxy known for its 511-keV positron annihilation emission, are of particular interest to the COSI team.
- Germanium (Ge, atomic number 32) is a semiconductor used to detect gamma rays. COSI’s detector array consists of 12 Ge detectors, each measuring 8x8x1.5 cubic centimeters, combined with specialized cooling and shielding systems.
- Compton scattering refers to photons scattering off electrons and, thus, transferring momentum to them. Arthur Holly Compton received the Nobel prize in 1927 for the discovery of this “Compton effect.” COSI measures gamma rays via multiple Compton interactions in its germanium detectors.
- Powered by solar panels and a 300-foot super-pressure helium balloon, COSI took off from New Zealand and flew around Antarctica and the Pacific Ocean before landing in Peru. The trip lasted 46 days. According to Zoglauer, the southern hemisphere provides a good view of the center of the Milky Way.

### Data Vulnerabilities in Machine Learning

The DSI welcomed Dr. Philip Kegelmeyer from Sandia National Laboratory on April 23, 2018, for a presentation titled “Machine Learning Adversarial Label Tampering: Design and Detection.” Attacks on machine learning include distortion, hiding, or manipulation of data. The presentation focused on falsely labeled data with examples of empirical methods for “quantified paranoia.”

The chief danger in a data label tampering attack is that even a small amount of tampering can greatly decrease accuracy in a fashion that cannot be detected in advance. Kegelmeyer’s team at Sandia has created several heuristics for generating such attacks. A simple but effective example is the “brute clustering” attack, in which all the data points in a single cluster are relabeled before moving on to the next cluster. Defenses against these attacks exist, though they are relatively weak. Kegelmeyer described one such defense, dubbed “quantified paranoia,” a statistical technique that uses pseudo-Bayes factors to signal the presence of label tampering.