Held at LLNL on an ongoing basis, our seminars feature speakers from other institutions around the Bay Area and beyond. We host these events to introduce new ideas and potential collaborators to the Lab. We are pleased to share seminar information here with the broader data science community.
The DSI and LLNL’s Center for Global Security Research co-sponsored a November 7, 2018, seminar presented by Dr. Lisa Garcia Bedolla, director of the Institute of Governmental Studies and a professor in the Graduate School of Education at the University of California, Berkeley. The growth of data science, both in terms of the availability of massive data sources as well as powerful computational methods for analyzing them, opens up new possibilities for scientific advancement. In the social sciences, it raises the possibility that scholars can address a longstanding lack of high-quality information about the social, political, and economic status of marginal populations.
However, all social data has weaknesses and biases, regardless of the size of the data set. Dr. Bedolla’s talk explored the new possibilities big data has opened up within the social sciences with tools such as social network analyses and geospatial information systems, among others. Yet, the transformational potential of data science to advance social well-being can only be realized if scholars are mindful of the potential for these new approaches to re-inscribe bias and misrepresentations of vulnerable populations. The seminar concluded with practical suggestions for researchers to take into consideration as they embark on this work.
Dr. Bedolla studies why people choose to engage politically, using a variety of social science methods—field observation, in-depth interviews, survey research, field experiments, and geographic information systems—to shed light on this question. Her research focuses on how marginalization and inequality structure the political and educational opportunities available to members of ethno-racial groups, with a particular emphasis on the intersections of race, class, and gender. Her current projects include an analysis of how technology can facilitate voter mobilization among voters of color in California and a historical exploration of the race, gender, and class inequality at the heart of the founding of California’s public school system.
Being able to predict network traffic could potentially help efficient rerouting of traffic to prevent network crashes and link failures. In recent years, deep learning has been at the forefront of learning sequential data, namely with the success of the Long Short Term Memory Network (LSTM). In an October 29, 2018, DSI seminar, Dr. Mariam Kiran of Lawrence Berkeley National Lab discussed LSTM architecture.
While LSTMs have been applied to network traffic data, their capabilities have only extended to predicting a single bandwidth value, not providing enough context for a comprehensive traffic routing algorithm. The seminar presented a sequence-to-sequence (seq2seq) LSTM architecture for network traffic to predict multiple hourly intervals into the future. Dr. Kiran’s method uses sliding windows with optimal lookback lengths to predict traffic bandwidth 8 hours into the future. The performance of this architecture is demonstrated on simple network management protocol (SNMP) data on the Energy Sciences Network (ESnet) to understand and predict various ESnet traffic across its links.
Dr. Kiran belongs to both ESnet and computational research division groups. Her research is focused on automating and improving usage of distributed networks and related facilities, to enable high-performance science applications. Developing methods from machine learning, multi-agent control and optimization, her work aims to improve how networks operations and application performances can be optimized in high-speed transfers.
Dr. Gerald Quon of the University of California at Davis visited LLNL on October 10, 2018, to present a seminar titled “Using Deep Neural Networks and Generative Models to Characterize Transcriptional Signatures in Human Cells.” Single-cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, when the similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies.
Quon presented a novel computational strategy for identifying cell type specific responses using a novel deep neural network for performing domain adaptation and transfer learning. Compared to other existing approaches, this one does not require identification of all cell types before alignment and can align more than two conditions simultaneously. He discussed ongoing applications of the model to two problem domains: (1) characterizing hematopoietic progenitor populations and their response to inflammatory challenges (LPS), in which Quon’s team has identified putative subpopulations of long-term HSCs that differentially respond to the challenge, and (2) characterizing the malaria cell cycle process, in which they identified transcriptional changes associated with sexual commitment. Quon also discussed his lab’s work in building deep generative models of transcriptional plasticity, which aims to reprogram cancer cells from a malignant to non-malignant phenotype.
On September 5, 2018, the DSI hosted Dr. Yong Jae Lee from the University of California at Davis for a seminar titled “Learning to Localize and Anonymize Objects with Indirect Supervision.” Lee’s computer science research team explores innovative approaches to visual recognition, including two indirect supervision methods he described for the LLNL audience: (1) scalable object localization and (2) anonymization while preserving action information.
Computer vision has made great strides for problems that can be learned with direct supervision, in which the goal can be precisely defined (e.g., drawing a box that tightly fits an object). However, direct supervision is often not only costly, but also challenging to obtain when the goal is more ambiguous. Lee discussed his team’s recent work on learning within direct supervision by first presenting an approach that learns to focus on the relevant image regions given only indirect image-level supervision (e.g., an image tagged with “car”). This is enabled by a novel data augmentation technique that hides image patches randomly.
Second, Lee described an approach that learns to anonymize sensitive video regions while preserving activity signals in an adversarial framework. It accomplishes this by simultaneously optimizing for the indirectly-related task of misclassifying face identity and maximizing activity detection accuracy. His team showed that their anonymization method leads to superior performance compared to conventional hand-crafted anonymization methods including masking, blurring, and noise adding.
In the 17th century, physician Marcello Malpighi observed the existence of patterns of ridges and sweat glands on fingertips. This was a major breakthrough and originated a long and continuing quest for ways to uniquely identify individuals based on fingerprints. In the modern era, the concept of fingerprinting has expanded to other sources of data, such as voice recognition and retinal scans. It is only in the last few years that technologies and methodologies have achieved high-quality data for individual human brain imaging, and the subsequent estimation of structural and functional connectivity. In this context, the next challenge for human identifiability is posed on brain data, particularly on brain networks, both structural and functional.
In an August 9, 2018, DSI seminar, Dr. Joaquin Goni of Purdue University presented his work showing how the individual fingerprint of a connectome (as represented by a network) can be uncovered (or in a way, maximized) from a reconstruction procedure based on group-wise decomposition in a finite number of brain connectivity modes. By using data from the Human Connectome Project, Goni introduced different extensions of this work, including subject identifiability, heritability analysis of brain networks, as well as identifiability when assessing inter-task brain functional networks. Finally, results on this framework for inter-scan identifiability based on a second dataset acquired at Purdue University were also discussed.
Brain tumor incidence is expected to rise by 6% over the next 20 years. Nearly 79,000 patients will be diagnosed in the U.S. this year alone. In a DSI seminar on August 2, 2018, Dr. Maryam Vareth outlined the University of California at San Francisco’s (UCSF’s) efforts to improve brain tumor outcomes through data-driven medicine.
Standard magnetic resonance imaging (MRI) is a mainstay of brain tumor diagnosis and evaluation, but it poses challenges when clinicians attempt to distinguish treatment effects from recurrent tumors. More advanced imaging is needed to better define tumor regions so that radiation treatments can target areas with high probability of recurrence.
Vareth described a potential solution called magnetic resonance spectroscopic imaging (MRSI)—static metabolic imaging that zeroes in on tumor chemistry. With MRSI, clinicians can identify metabolic changes in the brain earlier than when a recurrent tumor would show up with standard MRI. Moreover, MRSI is noninvasive and can be performed on a regular MRI machine.
The MRSI process creates indices of signals from choline, creatine, N-acetyl-aspartate, lipid, and lactate. Data can then be analyzed in map of voxels (3D pixels). With an inherently low signal, however, MRSI scans take a long time—especially if clinicians need to scan the entire brain, not just one region. Faster MRSI scans will help encourage clinicians to adopt this type of imaging.
Vareth’s team is developing a fast-trajectory MRSI analysis method to reduce scan time significantly. “An MRI is a very expensive Fourier transform machine,” she explained, so acceleration can be achieved through modified k-space sampling (below the Nyquist rate) of raw data. This process involves compressed sensing and parallel imaging as well as weighting images according to their sensor proximity (i.e., sensitivity is higher closer to a sensor within the machine).
Vareth and her UCSF colleagues are working toward “super-resolution” of MRSI and exploring the potential of deep learning to further enhance image quality while reducing scan duration. The team has developed software, called SIVIC, for processing automated prescription and reconstruction of MRSI data. SIVIC is available on GitHub.
On July 31, 2018, the DSI continued its seminar series with a talk on gradient optimization for black-box functions of random variables. University of Toronto Ph.D. candidate and LLNL alumnus Will Grathwohl presented “Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.”
Existing gradient-based optimization methods have advantages and disadvantages, and none offers unbiased, low-variance gradient estimates for arbitrary black-box functions. For example, the REINFORCE estimator is unbiased and works on any function but has high variance. The REPARAMETERIZATION and CONCRETE estimators achieve lower variance but require the black-box function to be known and differentiable.
Grathwohl’s team created an improved general gradient estimator by combining REINFORCE and REPARAMETERIZATION in a control variate framework. Their approach, named LAX, begins with the REINFORCE estimator of the black-box function, introduces a surrogate function with desired properties, subtracts the REINFORCE estimator of the surrogate, and then adds the REPARAMETERIZATION estimator of the surrogate. These steps make it possible for LAX to achieve an unbiased, low-variance estimate of arbitrary black-box function gradients.
The team also developed an extension of this estimator—called RELAX—that introduces a relaxed distribution to handle black-box functions of discrete random variables. Both LAX and RELAX were tested alongside other high-variance gradient estimators and attained lower variance, which significantly reduced optimization time compared with other existing methods. This work was recently published (PDF) at the 2018 ICLR Workshop (the Sixth International Conference on Learning Representations).
The DSI hosted Dr. Paul Gamble from Lab 41 for a seminar on June 8, 2018. Gamble presented two machine learning-based approaches developed by his team to detect and distinguish various forms of genetic engineering. Recently, synthetic biology has become increasingly common, having been used to drive down costs in perfumes, detect pollution, produce vaccines, as well as treat agricultural waste while simultaneously reducing greenhouse emissions by 75% in some cases. However, its rapid rise has also created new dangers, including biohacking and engineered bioweapons with increased virulence.
At Lab 41, Gamble is developing a machine learning pipeline for detecting synthetically engineered DNA. He also studies methods for detecting and defending against adversarial attacks on neural networks. Gamble received an M.D. and a Master’s in Biomedical Engineering from Washington University in St. Louis. During medical school, he developed nerve-computer interfaces and tested them in animal models. His research also focused on applying machine learning to clinical practice—he built a computer vision system to assist radiation oncologists with organ contouring and radiation dosimetry planning.
The DSI sponsored a seminar on May 22, 2018, featuring Dr. Andreas Zoglauer of the UC Berkeley Institute for Data Science. Zoglauer works with Berkeley’s Space Sciences Laboratory on the NASA-sponsored project COSI—the Compton Spectrometer and Imager, a balloon-borne gamma-ray telescope. COSI’s science objectives focus on galactic nucleosynthesis and the polarization of gamma-ray bursts caused by astronomical events such as neutron star mergers and core-collapse supernovae of heavily rotating massive stars. COSI’s 2016 flight around the southern hemisphere generated data that Zoglauer’s team continues to analyze.
According to Zoglauer, gamma-ray astronomy research relies heavily on data science and statistics. To analyze the data from COSI’s detectors, he developed an open-source toolkit called MEGAlib (Medium-Energy Gamma-ray Astronomy Library), which has applications beyond astrophysics in nuclear medicine and nuclear monitoring. MEGAlib enables researchers to perform Monte Carlo simulations of their detectors, reconstruct Compton events, and create images based on Compton scattering data. Zoglauer stated that COSI’s biggest computational challenge is generating up to 9-dimensional response files with Monte Carlo simulations for the reconstruction of all-sky images. Those simulations were performed on Berkeley Lab’s cori supercomputer.
With the help of data science undergraduates, Zoglauer is applying machine learning to COSI data such as random forests and neural networks. Research projects include determining photon paths in the germanium detectors, finding interaction locations in the detectors, and identifying not-contained gamma rays. Zoglauer outlined several lessons learned through his team’s work with machine learning tools, such as the importance of preparing data, splitting a big research question into smaller questions, and verifying that the trained neural networks have no “blind spots.” Researchers using machine learning algorithms should also expect “a lot of trial and error” in finding the best input data representation.
COSI is preparing for another flight in 2019–2020. An upgraded version, COSI-X, is planned for launch in 2022 with additional detectors, better shielding, and improved resolution.
In a Galaxy Not So Far Away
- In space, gamma rays are generated by radioactive decays, annihilation, and charged particle interactions. Astronomical sources include pulsars, supernovae, and the regions near black holes. In our Milky Way, the Crab Nebula, Cygnus X-1, and the area around the center of our galaxy known for its 511-keV positron annihilation emission, are of particular interest to the COSI team.
- Germanium (Ge, atomic number 32) is a semiconductor used to detect gamma rays. COSI’s detector array consists of 12 Ge detectors, each measuring 8x8x1.5 cubic centimeters, combined with specialized cooling and shielding systems.
- Compton scattering refers to photons scattering off electrons and, thus, transferring momentum to them. Arthur Holly Compton received the Nobel prize in 1927 for the discovery of this “Compton effect.” COSI measures gamma rays via multiple Compton interactions in its germanium detectors.
- Powered by solar panels and a 300-foot super-pressure helium balloon, COSI took off from New Zealand and flew around Antarctica and the Pacific Ocean before landing in Peru. The trip lasted 46 days. According to Zoglauer, the southern hemisphere provides a good view of the center of the Milky Way.
The DSI welcomed Dr. Philip Kegelmeyer from Sandia National Laboratory on April 23, 2018, for a presentation titled “Machine Learning Adversarial Label Tampering: Design and Detection.” Attacks on machine learning include distortion, hiding, or manipulation of data. The presentation focused on falsely labeled data with examples of empirical methods for “quantified paranoia.”
The chief danger in a data label tampering attack is that even a small amount of tampering can greatly decrease accuracy in a fashion that cannot be detected in advance. Kegelmeyer’s team at Sandia has created several heuristics for generating such attacks. A simple but effective example is the “brute clustering” attack, in which all the data points in a single cluster are relabeled before moving on to the next cluster. Defenses against these attacks exist, though they are relatively weak. Kegelmeyer described one such defense, dubbed “quantified paranoia,” a statistical technique that uses pseudo-Bayes factors to signal the presence of label tampering.