Held at LLNL on an ongoing basis, our seminars feature speakers from other institutions around the Bay Area and beyond. We host these events to introduce new ideas and potential collaborators to the Lab. We are pleased to share seminar information here with the broader data science community.
Drastic changes in climate and global losses in biodiversity are increasing the need to shift the incumbent energy and chemical infrastructure from a fossil-fuel based system to a sustainable-energy based system. Such a system will require that the production of fuels and chemicals use only sustainable energy (e.g., solar) and simple, abundant feedstocks like carbon dioxide, water, or nitrogen.
In a DSI seminar on June 20, 2019, Kevin Tran joined electrochemistry and machine learning (ML) in a seminar titled “Active Optimization of Catalysts for Sustainable Energy and Chemistry.” Active optimization means iteratively using ML to decide which experiment to conduct. According to Tran, this approach can make a significant impact on the development of catalysts that could turn renewable electricity into sustainable fuels and chemicals.
Tran’s team at Carnegie Mellon University has developed a method for optimizing such chemistries. It combines an active optimization routine with a fully automated simulation framework—nicknamed GASpy—to screen the appropriate catalysts and reaction conditions. The seminar included an overview of the chemistry, simulation, and software aspects of this framework before detailing the team’s ML techniques, experimental designs, and statistical methods.
For example, the research team “tunes” variables (e.g., catalysts or voltages) and then uses density functional theory (DFT) to calculate the resulting effects on the performance of target chemistries. Thousands of calculations are needed, though, so Tran created a high-throughput, Python-based framework that automates these DFT calculations. Still, each calculation can take an hour or even days to run. GASpy speeds up this process by using ML models to automatically decide which calculations to perform next.
Tran noted, “Active optimization needs to balance exploitation of the model with exploration of the search space.” GASpy uses this balance to perform iterative DFT calculations and in recent tests found more than 100 high-performing catalyst surfaces. As planned, these results informed subsequent experiments: University of Toronto colleagues began experimenting with a number of these promising catalysts.
The research team aims to broaden GASpy’s capabilities with multi-objective and multi-fidelity optimization, which will make the framework more scalable and holistic. For instance, the roadmap includes optimizing catalyst efficiency and stability simultaneously while varying catalyst composition and other processing conditions. Tran’s team is also improving quantification and calibration of the model’s uncertainty. He added, “We’re looking at ways to judge how different active optimization methods compare to each other via retrospective and prospective performance metrics.”
Tran is pursuing a PhD in chemical engineering at Carnegie Mellon University, advised by Dr. Zachary Ulissi, and interning this summer at LLNL under the mentorship of Dr. Joel Varley. Tran was previously a fluoropolymer processing engineer at W. L. Gore & Associates, working on implantable medical devices. He received his bachelor’s in chemical engineering from the University of Delaware, where his research focused on microkinetic modeling for biopharmaceutical applications.
In a standing-room-only DSI seminar on January 15, 2019, Dr. Massimo Mascaro reviewed what has changed dramatically in the world of machine learning (ML) in the last five years and how the new techniques have enabled unthinkable advances in applications of artificial intelligence (AI) at Google. He outlined some of the most interesting emerging techniques that have the potential of further revolutionizing AI usage in the near future, particularly in the world of engineering and science. The seminar closed with some consideration on hardware and software demands for large-scale modern AI workloads.
AI is “changing Google from the bone,” Dr. Mascaro said. Nearly every employee receives ML training, and every Google product has at least some ML component. Google Photos can now find images by a keyword search, Gmail can formulate its own automated responses by learning writing styles from the user, and deep neural networks are revolutionizing Google’s search rankings and Waymo’s self-driving autonomous cars. (Waymo is owned by Google’s parent company Alphabet.)
Google has also applied ML to science and engineering problems, helping NASA find exoplanets by recognizing signatures in data from the Transiting Exoplanet Survey Satellite TESS. Deep learning has been used with brain imaging to analyze neural connections and better understand how the brain works, and is performing some tasks better than humans, such as detecting diabetic retinopathy from retinal images.
The biggest advancement coming down the pike, Dr. Mascaro explained, is deep reinforcement learning, where programmers create a learning loop that allows the AI to come up with its own solutions to problems with zero input from humans. “Agents” based on computer models perform repetitive actions and receive feedback (rewards) for figuring out strategies that work, improving as the loop continues.
In his role as Technical Director of Applied AI in the Office of the CTO for Google Cloud, Dr. Mascaro helps VIP customers reimagine the production of goods and services and how value is exchanged in free markets by leveraging the power of AI and the Google technologies that enable it. Prior to Google, he worked at Intuit where he founded and led the data science group as Chief Data Scientist and Director of Data Engineering for the Consumer Group. In that role, he was responsible for all TurboTax analytics data ingestion systems and worked on many challenging but rewarding predictive analytics and personalization features that power TurboTax and help tens of millions of people do their taxes more easily. Before Intuit, Dr. Mascaro worked as lead of the R&D group of Intellisis, a small San Diego startup that builds advanced speech processing software for various U.S. government and defense entities.
Dr. François Lanusse, a postdoctoral fellow at the Berkeley Center for Cosmological Physics and the Foundation of Data Analysis institute at UC Berkeley, presented a DSI seminar on November 29, 2018. The upcoming generation of cosmological surveys such as the Large Synoptic Survey Telescope (LSST) will aim to shed some much-needed light on the physical nature of dark energy and dark matter by mapping the Universe in great detail and on an unprecedented scale. While this implies a great potential for discoveries, it also involves new and outstanding challenges at every step of the science analysis, from image processing to the cosmological inference.
Dr. Lanusse discussed how these challenges can be addressed with some of the latest developments in Deep Learning, in particular graph neural networks, deep generative models, and neural density estimation. At the image level, he demonstrated how deep convolutional networks can outperform human accuracy on tasks such as finding rare strong gravitational lenses, a problem which used to require significant human visual inspection. Another important aspect of the analysis of modern surveys is the ability to generate realistic mocks of the observations. In situations where physical models either do not exist or are intractable, he presented how deep generative models can be used as an alternative—for example, learning to generate realistic galaxy intrinsic alignments inside large-volume cosmological simulations. The presentation concluded with an explanation of how neural density estimation can be used for performing dimensionality reduction and inference in a likelihood-free setting. This allows the building of complex summary statistics of the data—which can be more sensitive to cosmological models than conventional 2pt statistics—for use in a consistent Bayesian framework.
Dr. Lanusse is a member of the LSST Dark Energy Science Collaboration (DESC). Most of his current research is focused on exploring new applications of the latest machine learning and statistical signal processing techniques for future large-scale cosmological surveys. He holds a PhD in astrophysics from Paris-Saclay University as well as an engineering degree from CentraleSupelec.
The DSI and LLNL’s Center for Global Security Research co-sponsored a November 7, 2018, seminar presented by Dr. Lisa Garcia Bedolla, director of the Institute of Governmental Studies and a professor in the Graduate School of Education at the University of California, Berkeley. The growth of data science, both in terms of the availability of massive data sources as well as powerful computational methods for analyzing them, opens up new possibilities for scientific advancement. In the social sciences, it raises the possibility that scholars can address a longstanding lack of high-quality information about the social, political, and economic status of marginal populations.
However, all social data has weaknesses and biases, regardless of the size of the data set. Dr. Bedolla’s talk explored the new possibilities big data has opened up within the social sciences with tools such as social network analyses and geospatial information systems, among others. Yet, the transformational potential of data science to advance social well-being can only be realized if scholars are mindful of the potential for these new approaches to re-inscribe bias and misrepresentations of vulnerable populations. The seminar concluded with practical suggestions for researchers to take into consideration as they embark on this work.
Dr. Bedolla studies why people choose to engage politically, using a variety of social science methods—field observation, in-depth interviews, survey research, field experiments, and geographic information systems—to shed light on this question. Her research focuses on how marginalization and inequality structure the political and educational opportunities available to members of ethno-racial groups, with a particular emphasis on the intersections of race, class, and gender. Her current projects include an analysis of how technology can facilitate voter mobilization among voters of color in California and a historical exploration of the race, gender, and class inequality at the heart of the founding of California’s public school system.
Watch a video of Dr. Bedolla's presentation on YouTube.
Being able to predict network traffic could potentially help efficient rerouting of traffic to prevent network crashes and link failures. In recent years, deep learning has been at the forefront of learning sequential data, namely with the success of the Long Short Term Memory Network (LSTM). In an October 29, 2018, DSI seminar, Dr. Mariam Kiran of Lawrence Berkeley National Lab discussed LSTM architecture.
While LSTMs have been applied to network traffic data, their capabilities have only extended to predicting a single bandwidth value, not providing enough context for a comprehensive traffic routing algorithm. The seminar presented a sequence-to-sequence (seq2seq) LSTM architecture for network traffic to predict multiple hourly intervals into the future. Dr. Kiran’s method uses sliding windows with optimal lookback lengths to predict traffic bandwidth 8 hours into the future. The performance of this architecture is demonstrated on simple network management protocol (SNMP) data on the Energy Sciences Network (ESnet) to understand and predict various ESnet traffic across its links.
Dr. Kiran belongs to both ESnet and computational research division groups. Her research is focused on automating and improving usage of distributed networks and related facilities, to enable high-performance science applications. Developing methods from machine learning, multi-agent control and optimization, her work aims to improve how networks operations and application performances can be optimized in high-speed transfers.
Dr. Gerald Quon of the University of California at Davis visited LLNL on October 10, 2018, to present a seminar titled “Using Deep Neural Networks and Generative Models to Characterize Transcriptional Signatures in Human Cells.” Single-cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, when the similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies.
Quon presented a novel computational strategy for identifying cell type specific responses using a novel deep neural network for performing domain adaptation and transfer learning. Compared to other existing approaches, this one does not require identification of all cell types before alignment and can align more than two conditions simultaneously. He discussed ongoing applications of the model to two problem domains: (1) characterizing hematopoietic progenitor populations and their response to inflammatory challenges (LPS), in which Quon’s team has identified putative subpopulations of long-term HSCs that differentially respond to the challenge, and (2) characterizing the malaria cell cycle process, in which they identified transcriptional changes associated with sexual commitment. Quon also discussed his lab’s work in building deep generative models of transcriptional plasticity, which aims to reprogram cancer cells from a malignant to non-malignant phenotype.
On September 5, 2018, the DSI hosted Dr. Yong Jae Lee from the University of California at Davis for a seminar titled “Learning to Localize and Anonymize Objects with Indirect Supervision.” Lee’s computer science research team explores innovative approaches to visual recognition, including two indirect supervision methods he described for the LLNL audience: (1) scalable object localization and (2) anonymization while preserving action information.
Computer vision has made great strides for problems that can be learned with direct supervision, in which the goal can be precisely defined (e.g., drawing a box that tightly fits an object). However, direct supervision is often not only costly, but also challenging to obtain when the goal is more ambiguous. Lee discussed his team’s recent work on learning within direct supervision by first presenting an approach that learns to focus on the relevant image regions given only indirect image-level supervision (e.g., an image tagged with “car”). This is enabled by a novel data augmentation technique that hides image patches randomly.
Second, Lee described an approach that learns to anonymize sensitive video regions while preserving activity signals in an adversarial framework. It accomplishes this by simultaneously optimizing for the indirectly-related task of misclassifying face identity and maximizing activity detection accuracy. His team showed that their anonymization method leads to superior performance compared to conventional hand-crafted anonymization methods including masking, blurring, and noise adding.
In the 17th century, physician Marcello Malpighi observed the existence of patterns of ridges and sweat glands on fingertips. This was a major breakthrough and originated a long and continuing quest for ways to uniquely identify individuals based on fingerprints. In the modern era, the concept of fingerprinting has expanded to other sources of data, such as voice recognition and retinal scans. It is only in the last few years that technologies and methodologies have achieved high-quality data for individual human brain imaging, and the subsequent estimation of structural and functional connectivity. In this context, the next challenge for human identifiability is posed on brain data, particularly on brain networks, both structural and functional.
In an August 9, 2018, DSI seminar, Dr. Joaquin Goni of Purdue University presented his work showing how the individual fingerprint of a connectome (as represented by a network) can be uncovered (or in a way, maximized) from a reconstruction procedure based on group-wise decomposition in a finite number of brain connectivity modes. By using data from the Human Connectome Project, Goni introduced different extensions of this work, including subject identifiability, heritability analysis of brain networks, as well as identifiability when assessing inter-task brain functional networks. Finally, results on this framework for inter-scan identifiability based on a second dataset acquired at Purdue University were also discussed.
Brain tumor incidence is expected to rise by 6% over the next 20 years. Nearly 79,000 patients will be diagnosed in the U.S. this year alone. In a DSI seminar on August 2, 2018, Dr. Maryam Vareth outlined the University of California at San Francisco’s (UCSF’s) efforts to improve brain tumor outcomes through data-driven medicine.
Standard magnetic resonance imaging (MRI) is a mainstay of brain tumor diagnosis and evaluation, but it poses challenges when clinicians attempt to distinguish treatment effects from recurrent tumors. More advanced imaging is needed to better define tumor regions so that radiation treatments can target areas with high probability of recurrence.
Vareth described a potential solution called magnetic resonance spectroscopic imaging (MRSI)—static metabolic imaging that zeroes in on tumor chemistry. With MRSI, clinicians can identify metabolic changes in the brain earlier than when a recurrent tumor would show up with standard MRI. Moreover, MRSI is noninvasive and can be performed on a regular MRI machine.
The MRSI process creates indices of signals from choline, creatine, N-acetyl-aspartate, lipid, and lactate. Data can then be analyzed in map of voxels (3D pixels). With an inherently low signal, however, MRSI scans take a long time—especially if clinicians need to scan the entire brain, not just one region. Faster MRSI scans will help encourage clinicians to adopt this type of imaging.
Vareth’s team is developing a fast-trajectory MRSI analysis method to reduce scan time significantly. “An MRI is a very expensive Fourier transform machine,” she explained, so acceleration can be achieved through modified k-space sampling (below the Nyquist rate) of raw data. This process involves compressed sensing and parallel imaging as well as weighting images according to their sensor proximity (i.e., sensitivity is higher closer to a sensor within the machine).
Vareth and her UCSF colleagues are working toward “super-resolution” of MRSI and exploring the potential of deep learning to further enhance image quality while reducing scan duration. The team has developed software, called SIVIC, for processing automated prescription and reconstruction of MRSI data. SIVIC is available on GitHub.
On July 31, 2018, the DSI continued its seminar series with a talk on gradient optimization for black-box functions of random variables. University of Toronto Ph.D. candidate and LLNL alumnus Will Grathwohl presented “Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.”
Existing gradient-based optimization methods have advantages and disadvantages, and none offers unbiased, low-variance gradient estimates for arbitrary black-box functions. For example, the REINFORCE estimator is unbiased and works on any function but has high variance. The REPARAMETERIZATION and CONCRETE estimators achieve lower variance but require the black-box function to be known and differentiable.
Grathwohl’s team created an improved general gradient estimator by combining REINFORCE and REPARAMETERIZATION in a control variate framework. Their approach, named LAX, begins with the REINFORCE estimator of the black-box function, introduces a surrogate function with desired properties, subtracts the REINFORCE estimator of the surrogate, and then adds the REPARAMETERIZATION estimator of the surrogate. These steps make it possible for LAX to achieve an unbiased, low-variance estimate of arbitrary black-box function gradients.
The team also developed an extension of this estimator—called RELAX—that introduces a relaxed distribution to handle black-box functions of discrete random variables. Both LAX and RELAX were tested alongside other high-variance gradient estimators and attained lower variance, which significantly reduced optimization time compared with other existing methods. This work was recently published (PDF) at the 2018 ICLR Workshop (the Sixth International Conference on Learning Representations).
The DSI hosted Dr. Paul Gamble from Lab 41 for a seminar on June 8, 2018. Gamble presented two machine learning-based approaches developed by his team to detect and distinguish various forms of genetic engineering. Recently, synthetic biology has become increasingly common, having been used to drive down costs in perfumes, detect pollution, produce vaccines, as well as treat agricultural waste while simultaneously reducing greenhouse emissions by 75% in some cases. However, its rapid rise has also created new dangers, including biohacking and engineered bioweapons with increased virulence.
At Lab 41, Gamble is developing a machine learning pipeline for detecting synthetically engineered DNA. He also studies methods for detecting and defending against adversarial attacks on neural networks. Gamble received an M.D. and a Master’s in Biomedical Engineering from Washington University in St. Louis. During medical school, he developed nerve-computer interfaces and tested them in animal models. His research also focused on applying machine learning to clinical practice—he built a computer vision system to assist radiation oncologists with organ contouring and radiation dosimetry planning.
The DSI sponsored a seminar on May 22, 2018, featuring Dr. Andreas Zoglauer of the UC Berkeley Institute for Data Science. Zoglauer works with Berkeley’s Space Sciences Laboratory on the NASA-sponsored project COSI—the Compton Spectrometer and Imager, a balloon-borne gamma-ray telescope. COSI’s science objectives focus on galactic nucleosynthesis and the polarization of gamma-ray bursts caused by astronomical events such as neutron star mergers and core-collapse supernovae of heavily rotating massive stars. COSI’s 2016 flight around the southern hemisphere generated data that Zoglauer’s team continues to analyze.
According to Zoglauer, gamma-ray astronomy research relies heavily on data science and statistics. To analyze the data from COSI’s detectors, he developed an open-source toolkit called MEGAlib (Medium-Energy Gamma-ray Astronomy Library), which has applications beyond astrophysics in nuclear medicine and nuclear monitoring. MEGAlib enables researchers to perform Monte Carlo simulations of their detectors, reconstruct Compton events, and create images based on Compton scattering data. Zoglauer stated that COSI’s biggest computational challenge is generating up to 9-dimensional response files with Monte Carlo simulations for the reconstruction of all-sky images. Those simulations were performed on Berkeley Lab’s cori supercomputer.
With the help of data science undergraduates, Zoglauer is applying machine learning to COSI data such as random forests and neural networks. Research projects include determining photon paths in the germanium detectors, finding interaction locations in the detectors, and identifying not-contained gamma rays. Zoglauer outlined several lessons learned through his team’s work with machine learning tools, such as the importance of preparing data, splitting a big research question into smaller questions, and verifying that the trained neural networks have no “blind spots.” Researchers using machine learning algorithms should also expect “a lot of trial and error” in finding the best input data representation.
COSI is preparing for another flight in 2019–2020. An upgraded version, COSI-X, is planned for launch in 2022 with additional detectors, better shielding, and improved resolution.
In a Galaxy Not So Far Away
- In space, gamma rays are generated by radioactive decays, annihilation, and charged particle interactions. Astronomical sources include pulsars, supernovae, and the regions near black holes. In our Milky Way, the Crab Nebula, Cygnus X-1, and the area around the center of our galaxy known for its 511-keV positron annihilation emission, are of particular interest to the COSI team.
- Germanium (Ge, atomic number 32) is a semiconductor used to detect gamma rays. COSI’s detector array consists of 12 Ge detectors, each measuring 8x8x1.5 cubic centimeters, combined with specialized cooling and shielding systems.
- Compton scattering refers to photons scattering off electrons and, thus, transferring momentum to them. Arthur Holly Compton received the Nobel prize in 1927 for the discovery of this “Compton effect.” COSI measures gamma rays via multiple Compton interactions in its germanium detectors.
- Powered by solar panels and a 300-foot super-pressure helium balloon, COSI took off from New Zealand and flew around Antarctica and the Pacific Ocean before landing in Peru. The trip lasted 46 days. According to Zoglauer, the southern hemisphere provides a good view of the center of the Milky Way.
The DSI welcomed Dr. Philip Kegelmeyer from Sandia National Laboratory on April 23, 2018, for a presentation titled “Machine Learning Adversarial Label Tampering: Design and Detection.” Attacks on machine learning include distortion, hiding, or manipulation of data. The presentation focused on falsely labeled data with examples of empirical methods for “quantified paranoia.”
The chief danger in a data label tampering attack is that even a small amount of tampering can greatly decrease accuracy in a fashion that cannot be detected in advance. Kegelmeyer’s team at Sandia has created several heuristics for generating such attacks. A simple but effective example is the “brute clustering” attack, in which all the data points in a single cluster are relabeled before moving on to the next cluster. Defenses against these attacks exist, though they are relatively weak. Kegelmeyer described one such defense, dubbed “quantified paranoia,” a statistical technique that uses pseudo-Bayes factors to signal the presence of label tampering.