Hosted onsite at LLNL—and now virtually—on an ongoing basis, our seminars feature speakers from other institutions around the Bay Area and beyond. We host these events to introduce new ideas and potential collaborators to the Lab. We are pleased to share seminar information here with the broader data science community.
Recent advances in artificial intelligence offer opportunities to disrupt the traditional strategies for discovery of new particles in high-energy collisions. Dr. Whiteson will describe new machine learning techniques, explain why they are particularly well suited for particle physics, present selected results that demonstrate their new capabilities, and present a strategy for translating their learned strategies into human understanding.
Daniel Whiteson is a professor of experimental particle physics at the University of California, Irvine, and a fellow of the American Physical Society. He is part of the collaboration that built, maintains, and collects data from the ATLAS experiment at the Large Hadron Collider. His research has appeared widely in popular media outlets including The New Yorker, Ars Technica, VICE, and many others. Along with his colleagues he created popular comics including “What’s in the data? The Higgs Boson Explained” and “True Tales of Dark Matters,” which were all featured on PBS. Dr. Whiteson is the co-host of the Daniel & Jorge Explain the Universe podcast and holds a PhD in Physics from UC Berkeley.
Data-driven methods such as deep learning have achieved phenomenal success in a broad range of tasks. A key to the superior performance of data-driven methods is the availability of large-scale data that is carefully collected, cleaned, organized, and annotated. However, practical data often possess many nuances such as corruption, lack of annotations, and heavy-tailed distribution, which significantly compromise the performance of data-driven methods. This talk aims to demonstrate that the intrinsic low-dimensional structure of high-dimensional data can be leveraged to address the challenges in a principled and effective manner. First, I will show that by modeling a mixture of data by a union of low-dimensional manifolds,
we can develop unsupervised clustering algorithms that not only are provably correct, but also can be made scalable without a performance loss and robust to data nuances with provable guarantees. Our methods obtain state-of-the-art performance for clustering MNIST (with 98.3% accuracy) and CIFAR10 (with 68.4% accuracy) datasets. Second, I will present a double over-parameterization method that addresses the overfitting issue in over-parameterized models by exploiting the implicit algorithmic bias of discrepant learning rates. We establish the theoretical correctness of the method for low-rank matrix recovery problems and demonstrate the practical effectiveness of the method for natural image recovery tasks. I will conclude the talk with the broader implication of low-dimensional modeling for deep learning, using generalization and architectural design as two illustrative examples.
Chong You is a postdoctoral scholar in the Department of EECS at the University of California, Berkeley. He received his PhD in 2018 from the Electrical and Computer Engineering Department at Johns Hopkins University. His research areas broadly include machine learning, computer vision, optimization, and signal processing. He is interested in the development of mathematical principles and practical numerical algorithms for analyzing and interpreting modern data, with the goal of addressing real-world challenges. He is the recipient of the Doctoral Dissertation Award from MINDS at Johns Hopkins University.
We develop a general approach to distill symbolic representations of a learned deep model by introducing strong inductive biases. We focus on graph neural networks (GNNs). The technique works as follows: We first encourage sparse latent representations when we train a GNN in a supervised setting, then we apply symbolic regression to components of the learned model to extract explicit physical relations. We find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural networks. We then apply our method to a non-trivial cosmology example—a detailed dark matter simulation—and discover a new analytic formula that can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution-data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.
Dr. Shirley Ho’s research interests have ranged from using machine learning and statistics to tackle fundamental challenges in cosmology to finding new structures in the Milky Way. She has broad expertise in theoretical astrophysics, observational astronomy, and data science. Ho’s recent interest has been on understanding and developing novel tools in machine learning techniques and applying them to astrophysical challenges. Her goal is to understand the universe’s beginning, evolution, and its ultimate fate. Ho works with international collaborators both within the Cosmology X Data Science Group at the Flatiron Institute, at the Department of Astrophysical Sciences at Princeton University, and beyond. She holds a Ph.D. in Astrophysical Sciences from Princeton University.
Our current machine learning (ML) models achieve impressive performance on many benchmark tasks. Yet these models remain remarkably brittle and susceptible to manipulation. Why is this the case? In this talk, Dr. Madry will take a closer look at this question and pinpoint some of the roots
of this observed brittleness. Specifically, the seminar will discuss how the way current ML models
“learn” and are evaluated gives rise to widespread vulnerabilities, and then outline possible approaches to alleviate these deficiencies.
Dr. Aleksander Madry is a Professor of Computer Science in the EECS Department at the Massachusetts Institute of Technology (MIT) and a Principal Investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his Ph.D. from MIT in 2011; prior to joining the MIT faculty, he spent some time at Microsoft Research New England and on the faculty of EPFL. Madry’s research interests span algorithms, continuous optimization, science of deep learning, and understanding machine learning from robustness and deployability perspectives. His work has been recognized with a number of awards, including a National Science Foundation CAREER Award, an Alfred P. Sloan Research Fellowship, an ACM Doctoral Dissertation Award Honorable Mention, and a Presburger Award.
Scientific discovery is one of the primary factors underlying advancement of human race. However, the traditional discovery process is slow compared to the growing need for new inventions—for example, antibiotic discovery or design of next-generation energy materials. In recent years, data-driven approaches such as machine learning and especially deep learning have achieved remarkable performance in many domains including computer vision, speech recognition, audio synthesis, and natural language processing and generation. These methods have also infiltrated other scientific fields including physics, chemistry, and medicine. Despite these successes and the potential for huge societal impact, machine learning models are still in their infancy in terms of driving and transforming scientific discovery. This talk will introduce a closed-loop paradigm to accelerate scientific discovery, which can seamlessly integrate machine learning, physics-based simulations, and wet-lab experiments and enable new hypothesis and/or artefact generation and validation thereof. Development and use of deep generative models and reinforcement learning–based methods for designing novel peptides and materials with desired functionality will be discussed. Das will also examine the importance of adding creativity, robustness, and interpretability to machine learning models to enable and add value to artificial intelligence–driven discovery.
Dr. Payel Das is a research staff member and manager in the AI Science Department of IBM Thomas J. Watson Research Center in Yorktown Heights, NY. She is also an adjunct associate professor in the Department of Applied Physics and Applied Mathematics at Columbia University. At IBM, she leads and manages research projects related to artificial intelligence (AI) for creativity and discovery, with inspirations from and applications in materials science, chemistry, physics, biology, and neuroscience. Many of these projects lie at the intersection of data-driven and physics-based modeling. A major focus of her work is developing novel deep generative models for heterogeneous data, which is abundant in real-world applications. Das holds a PhD in Theoretical Physical Chemistry from Rice University and has won numerous awards including IBM’s highest award for Outstanding Technical Achievement. She has co-authored over 30 peer-reviewed publications and serves on the editorial advisory board of the ACS Central Science journal.
The increasing availability of data and raw computational power, along with recent developments in models and algorithms, are changing the way businesses, academics, and governments operate. However, this revolution has both created new ethical challenges and changed the nature of many familiar ones. For example, notions of informed consent, which were originally developed in the context of biomedical research after the atrocities of the Second World War, are a poor fit for an environment in which individuals are constantly monitored by scores of agents with vague
(and often unenforceable) consent disclosures.
Similarly, notions of confidentiality and privacy—originally devised for a world in which governments were the only agents with detailed information about large numbers of individuals—are not necessarily appropriate for an environment in which this kind of data is in the hands of a myriad of private entities. This talk uses a number of recent case studies to explore these and other issues related to the ethics of data collection, management, and analysis, in an attempt to highlight issues that would appear to be relevant to the kinds of activities carried out by the national laboratories.
Dr. Abel Rodriguez is Professor of Statistics at the Baskin School of Engineering at the University of California, Santa Cruz (UCSC). He is also the Associate Director of the Center for Data, Discovery and Decisions (D3) and one of the PIs of the NSF-supported TRIPODS Center at UCSC. A former recipient of the DARPA Young Faculty Award in 2010, he was also awarded the prestigious Donald D. Harrington Faculty Fellowship by the University of Texas at Austin in 2012. Dr. Rodriguez came to UCSC in 2007 after completing an M.A. in Economics and a Ph.D. in Statistics and Decision Sciences from Duke University. Before that, he received a B.A. in Law and B.S. in Industrial Engineering in his native Venezuela. Dr. Rodriguez is an expert in Bayesian statistical methods and their applications in the biomedical and social sciences. His interests range widely and include nonparametric methods, spatiotemporal modeling, relational data, and extreme value theory. Starting September 1, he will be joining the University of Washington as Professor and Chair of the Statistics Department.
Although we are currently riding a technological wave of personal assistants, many of these agents still struggle to communicate appropriately. Humans are natural storytellers, so it would be fitting if artificial intelligence (AI) could tell stories as well. Automated story generation is an area of AI research that aims to create agents that tell “good” stories. Previous story-generation systems use planning to create new stories, but these systems require a vast amount of knowledge engineering. The stories created by these systems are coherent, but only a finite set of stories can be generated. In contrast, very large language models have recently made the headlines in the natural language processing community. Though impressive on the surface, these models begin to lose coherence over time. Lara Martin’s research looks at various techniques of automated story generation, focusing on the perceived creativity of the generated stories. In this talk, she will define a creative product as one that is both novel and useful, as well as show how a jointly probabilistic and causal model can provide more creative stories for readers of stories generated from an improvisational storytelling system than from solely probabilistic or causal models.
Lara J. Martin is a Human-Centered Computing Ph.D. Candidate in the College of Computing at Georgia Tech. Her work resides in human-centered AI with a focus on natural language applications. Lara has worked in the areas of automated story generation, speech processing, and affective computing, publishing in top-tier conferences such as AAAI and IJCAI. She earned a Masters of Language Technologies from Carnegie Mellon University in 2015 and a B.S. in Computer Science and Linguistics from Rutgers University–New Brunswick in 2013. In 2019, she received Georgia Tech’s prestigious Foley Scholar Award for her innovative research and the Best Doctoral Consortium Presentation award at the 2019 ACM Richard Tapia Celebration of Diversity in Computing Conference. She has also been featured in Wired.
Worldwide displacement due to war and conflict is at an all-time high. Unfortunately, determining if, when, and where people will move is a complex problem. This talk will describe a multi-university project that develops methods for blending variables constructed from publicly available organic data (social media and newspapers) with more traditional indicators of forced migration to better understand when and where people will move.
Dr. Singh will demonstrate an approach that uses a case study involving displacement in Iraq, and show that incorporating open-source generated conversation and event variables maintains or improves predictive accuracy over traditional variables alone. She will conclude with a discussion on strengths and limitations of leveraging organic big data for societal-scale problems.
Dr. Lisa Singh is a professor in the Department of Computer Science and a research professor in the Massive Data Institute at Georgetown University. She has co-authored over 70 peer-reviewed publications and book chapters related to data-centric computing. Current projects include studying privacy on the Web; identifying noise and poor-quality information on social media; developing methods and tools to better understand forced movement due to conflict; and learning from public, open-source big data to advance social science research of human behavior/opinion. Her research has been supported by the National Science Foundation, the Office of Naval Research, the Social Science and Humanities Research Council, the National Collaborative on Gun Violence Research, the Department of Defense, and the Department of State. Dr. Singh recently organized three workshops involving future directions of big data research and is currently involved in different organizations working on increasing participation of women in computing and integrating computational thinking into K-12 curricula. Dr. Singh received a BSE from Duke University and MS and PhD from Northwestern University.
Part identification plays a key role in vehicle prognostics and health management. Part identifiers are often expressed as nomenclature and buried in noisy free text data found in maintenance reports, supply chain management records, service and support communication logs, and manufacturing quality data. There is little consistency in how part names are actually described in noisy free text, with variations spawned by typos, ad hoc abbreviations, acronyms, and incomplete names. This makes search and analysis of parts involved in this data extremely challenging. In this talk, Kao will discuss Boeing’s tool PANDA (PArt Name Discovery Analytics) based on a unique method that exploits statistical, linguistic, and machine learning techniques in a unique way to discover part names in noisy free text. Normalization of such terms is also crucial for many applications. Part names pose an additional major challenge because they tend to be in the form of multi-word terms. Kao’s team also developed a novel normalization method called UNAMER (Unification and Normalization Analysis, Misspelling Evaluation and Recognition) for identifying term variants, including variants of multi-word terms, and normalizing them under a canonical name. PANDA and UNAMER have been deployed in practical applications to extract and normalize part names in the aerospace domain.
Dr. Anne Kao is an internationally recognized expert in text analytics and natural language processing. As a Senior Technical Fellow at Boeing Research & Technology, she is responsible for coordinating R&D in data analytics and artificial intelligence, creating an intellectual property strategy with respect to these, leveraging data analytics and artificial intelligence as key Boeing technology differentiators for government programs, collaborating with national and international universities and laboratories, and building Boeing’s depth and breadth in the field. Dr. Kao has more than 25 years of success in analytics methods including artificial intelligence, data analytics, visual analytics, and social network analysis. She holds 17 U.S. patents, has published dozens of papers in peer-reviewed journals and conference proceedings, and is active in professional societies. Dr. Kao won the BEYA Senior Technology Fellow Award and the Asian American Engineer of the Year Award in 2015 as well as the National Women of Color in Technology Research Leadership Award in 2006. She holds a bachelor’s in philosophy from the National Chengchi University (Taiwan), a master’s and PhD in philosophy from the Chinese Culture University (Taiwan), and a master’s in computer science from San Diego State University.
David Gleich is the Jyoti and Aditya Mathur Associate Professor in the Computer Science Department at Purdue University whose research is on novel models and fast, large-scale algorithms for data-driven scientific computing including scientific data analysis, bioinformatics, and network analysis. He presented a November 6, 2019, DSI seminar titled “Engineering Data Science Objective Functions for Social Network Analysis.”
A common setting in many data science applications from social network analysis to bioinformatics is to be given a dataset in the form of a graph along with a small number of interesting sets in that graph. In social networks, these are often called communities. In protein interaction networks, these could be pathways or functional groups. Given these examples, the problem is then to find more like them. Gleich presented a technique to engineer an objective function that captures characteristic features of these examples, demonstrated a framework in the context of community-detection algorithms for graphs that will determine an objective function from a single example, and discussed how this can result in interesting findings about the structure of college social networks in Facebook networks. His presentation also touched on ongoing work using the same ideas in drug discovery and chemistry.
Gleich is committed to making software available based on this research and has written software packages such as MatlabBGL with thousands of users worldwide. He has received numerous awards for his research including a Society for Industrial and Applied Mathematics (SIAM) Outstanding Publication prize (2018), a Sloan Research Fellowship (2016), a National Science Foundation (NSF) CAREER Award (2011), and the John von Neumann postdoctoral fellowship at Sandia National Laboratories in Livermore (2009). His research is funded by the NSF, DOE, DARPA, and NASA.
Francois Nadeau is an analytics and business intelligence veteran who has worked more than a decade in various roles from analyst to business intelligence developer to data scientist in the telecommunications, manufacturing, and entertainment industries. His October 29, 2019, DSI seminar—titled “Machine Learning Applied Research and Challenges at Ubisoft”—reviewed some of the applied research conducted at Ubisoft and associated challenges.
The projects discussed include how Ubisoft is solving metadata standardization by recognizing 3D models, helping art managers find 3D models based on photos, and assisting the browsing of internal text documents by tagging relevant abstract concepts. The presentation also showed how Ubisoft has tackled some of the challenges associated with those projects, such as risk-averse stakeholders, automation fear, and cold starts.
Nadeau has been studying artificial intelligence and machine learning since 2009 and cofounded an applied research group within Ubisoft specializing in machine learning. Since then, he has researched, developed, and put in production many learning systems covering computer vision, natural language understanding, and predictive analysis.
Dr. Ryan Goldhahn, an LLNL computation engineer, presented a September 18, 2019, DSI seminar titled “Decentralized Autonomous Networks for Cooperative Estimation.” Collaborative autonomous networks have recently been used in national security, critical infrastructure, and commercial applications such as the Internet of Things. Decentralized approaches in particular offer scalable, low-cost solutions that are robust to failures in multiple individual agents. However, such networks face challenges related to latency, bandwidth, scalability, and adversarial attacks, and new decentralized approaches are needed for distributed data processing and optimization. Effective solutions push as much of the data processing and intelligence as possible to the individual agents and efficiently communicate information, fuse data while allowing for the possibility of unreliable information from neighboring agents, and achieve scalable network behaviors from only local coordination of actions between agents. This talk summarized recent work on signal processing and network intelligence algorithms for decentralized sensor networks, results of simulations in large (~10K agents) networks, and current efforts toward the implementation of these algorithms in low size, weight, and power embedded systems.
Dr. Goldhahn has a BE in engineering from Dartmouth College and a PhD in electrical and computer engineering from Duke University. Before joining LLNL, he led a project at the NATO Centre for Maritime Research and Experimentation (CMRE) using multiple unmanned underwater vehicles (UUVs) to detect and track submarines. This work developed collaborative autonomous behaviors to collectively detect targets and optimally reposition UUVs to improve tracking performance without human intervention, and tested these autonomous sensor networks at sea with submarines from multiple NATO nations. At LLNL, Dr. Goldhahn has continued to work in collaborative autonomy and model-based and statistical signal processing in various applications. He has specifically focused on decentralized detection/estimation/tracking and optimization algorithms for autonomous sensor networks.
Dr. Joel Hestness is a senior research scientist at Cerebras Systems, an artificial intelligence (AI)–focused hardware startup. His August 22, 2019, DSI seminar—titled “Deep Learning Scaling is Predictable, Your Data is (Probably) Hierarchical”—focused on deep learning (DL) scaling. DL creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. A common belief in DL is that growing training sets and models should improve accuracy. Dr. Hestness described Baidu’s large-scale empirical studies: As training set size increases, DL model generalization error and model sizes scale as particular power-law relationships (not entirely consistent with theoretical results). As model size grows, training time remains roughly constant—larger models require fewer steps to converge to the same accuracy. With these scaling relationships, the expected accuracy and training time can be accurately predicted for models trained on larger data sets. In the second part of his talk, Dr. Hestness touched on more recent studies in model architecture search: DL models are overparameterized but can still generalize well. Most DL models are inductively biased, designed to capture hierarchy or fractal structures in data, indicating that most real-world data must be hierarchical.
At Cerebras Systems, Dr. Hestness helps formulate strategies to support machine learning researchers/practitioners to use the hardware, and he leads some natural language understanding research. Previously, he was a research scientist at Baidu's Silicon Valley AI Lab, where he worked on techniques to understand and scale out deep learning speech and language model training. Dr. Hestness holds a PhD in computer architecture from the University of Wisconsin–Madison. He has broad experience with computing applications including numerical methods, graph analytics, and machine/deep learning.
Drastic changes in climate and global losses in biodiversity are increasing the need to shift the incumbent energy and chemical infrastructure from a fossil-fuel based system to a sustainable-energy based system. Such a system will require that the production of fuels and chemicals use only sustainable energy (e.g., solar) and simple, abundant feedstocks like carbon dioxide, water, or nitrogen.
In a DSI seminar on June 20, 2019, Kevin Tran joined electrochemistry and machine learning (ML) in a seminar titled “Active Optimization of Catalysts for Sustainable Energy and Chemistry.” Active optimization means iteratively using ML to decide which experiment to conduct. According to Tran, this approach can make a significant impact on the development of catalysts that could turn renewable electricity into sustainable fuels and chemicals.
Tran’s team at Carnegie Mellon University has developed a method for optimizing such chemistries. It combines an active optimization routine with a fully automated simulation framework—nicknamed GASpy—to screen the appropriate catalysts and reaction conditions. The seminar included an overview of the chemistry, simulation, and software aspects of this framework before detailing the team’s ML techniques, experimental designs, and statistical methods.
For example, the research team “tunes” variables (e.g., catalysts or voltages) and then uses density functional theory (DFT) to calculate the resulting effects on the performance of target chemistries. Thousands of calculations are needed, though, so Tran created a high-throughput, Python-based framework that automates these DFT calculations. Still, each calculation can take an hour or even days to run. GASpy speeds up this process by using ML models to automatically decide which calculations to perform next.
Tran noted, “Active optimization needs to balance exploitation of the model with exploration of the search space.” GASpy uses this balance to perform iterative DFT calculations and in recent tests found more than 100 high-performing catalyst surfaces. As planned, these results informed subsequent experiments: University of Toronto colleagues began experimenting with a number of these promising catalysts.
The research team aims to broaden GASpy’s capabilities with multi-objective and multi-fidelity optimization, which will make the framework more scalable and holistic. For instance, the roadmap includes optimizing catalyst efficiency and stability simultaneously while varying catalyst composition and other processing conditions. Tran’s team is also improving quantification and calibration of the model’s uncertainty. He added, “We’re looking at ways to judge how different active optimization methods compare to each other via retrospective and prospective performance metrics.”
Tran is pursuing a PhD in chemical engineering at Carnegie Mellon University, advised by Dr. Zachary Ulissi, and interning this summer at LLNL under the mentorship of Dr. Joel Varley. Tran was previously a fluoropolymer processing engineer at W. L. Gore & Associates, working on implantable medical devices. He received his bachelor’s in chemical engineering from the University of Delaware, where his research focused on microkinetic modeling for biopharmaceutical applications.
In a standing-room-only DSI seminar on January 15, 2019, Dr. Massimo Mascaro reviewed what has changed dramatically in the world of machine learning (ML) in the last five years and how the new techniques have enabled unthinkable advances in applications of artificial intelligence (AI) at Google. He outlined some of the most interesting emerging techniques that have the potential of further revolutionizing AI usage in the near future, particularly in the world of engineering and science. The seminar closed with some consideration on hardware and software demands for large-scale modern AI workloads.
AI is “changing Google from the bone,” Dr. Mascaro said. Nearly every employee receives ML training, and every Google product has at least some ML component. Google Photos can now find images by a keyword search, Gmail can formulate its own automated responses by learning writing styles from the user, and deep neural networks are revolutionizing Google’s search rankings and Waymo’s self-driving autonomous cars. (Waymo is owned by Google’s parent company Alphabet.)
Google has also applied ML to science and engineering problems, helping NASA find exoplanets by recognizing signatures in data from the Transiting Exoplanet Survey Satellite TESS. Deep learning has been used with brain imaging to analyze neural connections and better understand how the brain works, and is performing some tasks better than humans, such as detecting diabetic retinopathy from retinal images.
The biggest advancement coming down the pike, Dr. Mascaro explained, is deep reinforcement learning, where programmers create a learning loop that allows the AI to come up with its own solutions to problems with zero input from humans. “Agents” based on computer models perform repetitive actions and receive feedback (rewards) for figuring out strategies that work, improving as the loop continues.
In his role as Technical Director of Applied AI in the Office of the CTO for Google Cloud, Dr. Mascaro helps VIP customers reimagine the production of goods and services and how value is exchanged in free markets by leveraging the power of AI and the Google technologies that enable it. Prior to Google, he worked at Intuit where he founded and led the data science group as Chief Data Scientist and Director of Data Engineering for the Consumer Group. In that role, he was responsible for all TurboTax analytics data ingestion systems and worked on many challenging but rewarding predictive analytics and personalization features that power TurboTax and help tens of millions of people do their taxes more easily. Before Intuit, Dr. Mascaro worked as lead of the R&D group of Intellisis, a small San Diego startup that builds advanced speech processing software for various U.S. government and defense entities.
Dr. François Lanusse, a postdoctoral fellow at the Berkeley Center for Cosmological Physics and the Foundation of Data Analysis institute at UC Berkeley, presented a DSI seminar on November 29, 2018. The upcoming generation of cosmological surveys such as the Large Synoptic Survey Telescope (LSST) will aim to shed some much-needed light on the physical nature of dark energy and dark matter by mapping the Universe in great detail and on an unprecedented scale. While this implies a great potential for discoveries, it also involves new and outstanding challenges at every step of the science analysis, from image processing to the cosmological inference.
Dr. Lanusse discussed how these challenges can be addressed with some of the latest developments in Deep Learning, in particular graph neural networks, deep generative models, and neural density estimation. At the image level, he demonstrated how deep convolutional networks can outperform human accuracy on tasks such as finding rare strong gravitational lenses, a problem which used to require significant human visual inspection. Another important aspect of the analysis of modern surveys is the ability to generate realistic mocks of the observations. In situations where physical models either do not exist or are intractable, he presented how deep generative models can be used as an alternative—for example, learning to generate realistic galaxy intrinsic alignments inside large-volume cosmological simulations. The presentation concluded with an explanation of how neural density estimation can be used for performing dimensionality reduction and inference in a likelihood-free setting. This allows the building of complex summary statistics of the data—which can be more sensitive to cosmological models than conventional 2pt statistics—for use in a consistent Bayesian framework.
Dr. Lanusse is a member of the LSST Dark Energy Science Collaboration (DESC). Most of his current research is focused on exploring new applications of the latest machine learning and statistical signal processing techniques for future large-scale cosmological surveys. He holds a PhD in astrophysics from Paris-Saclay University as well as an engineering degree from CentraleSupelec.
The DSI and LLNL’s Center for Global Security Research co-sponsored a November 7, 2018, seminar presented by Dr. Lisa Garcia Bedolla, director of the Institute of Governmental Studies and a professor in the Graduate School of Education at the University of California, Berkeley. The growth of data science, both in terms of the availability of massive data sources as well as powerful computational methods for analyzing them, opens up new possibilities for scientific advancement. In the social sciences, it raises the possibility that scholars can address a longstanding lack of high-quality information about the social, political, and economic status of marginal populations.
However, all social data has weaknesses and biases, regardless of the size of the data set. Dr. Bedolla’s talk explored the new possibilities big data has opened up within the social sciences with tools such as social network analyses and geospatial information systems, among others. Yet, the transformational potential of data science to advance social well-being can only be realized if scholars are mindful of the potential for these new approaches to re-inscribe bias and misrepresentations of vulnerable populations. The seminar concluded with practical suggestions for researchers to take into consideration as they embark on this work.
Dr. Bedolla studies why people choose to engage politically, using a variety of social science methods—field observation, in-depth interviews, survey research, field experiments, and geographic information systems—to shed light on this question. Her research focuses on how marginalization and inequality structure the political and educational opportunities available to members of ethno-racial groups, with a particular emphasis on the intersections of race, class, and gender. Her current projects include an analysis of how technology can facilitate voter mobilization among voters of color in California and a historical exploration of the race, gender, and class inequality at the heart of the founding of California’s public school system.
Watch a video of Dr. Bedolla's presentation on YouTube.
Being able to predict network traffic could potentially help efficient rerouting of traffic to prevent network crashes and link failures. In recent years, deep learning has been at the forefront of learning sequential data, namely with the success of the Long Short Term Memory Network (LSTM). In an October 29, 2018, DSI seminar, Dr. Mariam Kiran of Lawrence Berkeley National Lab discussed LSTM architecture.
While LSTMs have been applied to network traffic data, their capabilities have only extended to predicting a single bandwidth value, not providing enough context for a comprehensive traffic routing algorithm. The seminar presented a sequence-to-sequence (seq2seq) LSTM architecture for network traffic to predict multiple hourly intervals into the future. Dr. Kiran’s method uses sliding windows with optimal lookback lengths to predict traffic bandwidth 8 hours into the future. The performance of this architecture is demonstrated on simple network management protocol (SNMP) data on the Energy Sciences Network (ESnet) to understand and predict various ESnet traffic across its links.
Dr. Kiran belongs to both ESnet and computational research division groups. Her research is focused on automating and improving usage of distributed networks and related facilities, to enable high-performance science applications. Developing methods from machine learning, multi-agent control and optimization, her work aims to improve how networks operations and application performances can be optimized in high-speed transfers.
Dr. Gerald Quon of the University of California at Davis visited LLNL on October 10, 2018, to present a seminar titled “Using Deep Neural Networks and Generative Models to Characterize Transcriptional Signatures in Human Cells.” Single-cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, when the similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies.
Quon presented a novel computational strategy for identifying cell type specific responses using a novel deep neural network for performing domain adaptation and transfer learning. Compared to other existing approaches, this one does not require identification of all cell types before alignment and can align more than two conditions simultaneously. He discussed ongoing applications of the model to two problem domains: (1) characterizing hematopoietic progenitor populations and their response to inflammatory challenges (LPS), in which Quon’s team has identified putative subpopulations of long-term HSCs that differentially respond to the challenge, and (2) characterizing the malaria cell cycle process, in which they identified transcriptional changes associated with sexual commitment. Quon also discussed his lab’s work in building deep generative models of transcriptional plasticity, which aims to reprogram cancer cells from a malignant to non-malignant phenotype.
On September 5, 2018, the DSI hosted Dr. Yong Jae Lee from the University of California at Davis for a seminar titled “Learning to Localize and Anonymize Objects with Indirect Supervision.” Lee’s computer science research team explores innovative approaches to visual recognition, including two indirect supervision methods he described for the LLNL audience: (1) scalable object localization and (2) anonymization while preserving action information.
Computer vision has made great strides for problems that can be learned with direct supervision, in which the goal can be precisely defined (e.g., drawing a box that tightly fits an object). However, direct supervision is often not only costly, but also challenging to obtain when the goal is more ambiguous. Lee discussed his team’s recent work on learning within direct supervision by first presenting an approach that learns to focus on the relevant image regions given only indirect image-level supervision (e.g., an image tagged with “car”). This is enabled by a novel data augmentation technique that hides image patches randomly.
Second, Lee described an approach that learns to anonymize sensitive video regions while preserving activity signals in an adversarial framework. It accomplishes this by simultaneously optimizing for the indirectly-related task of misclassifying face identity and maximizing activity detection accuracy. His team showed that their anonymization method leads to superior performance compared to conventional hand-crafted anonymization methods including masking, blurring, and noise adding.
In the 17th century, physician Marcello Malpighi observed the existence of patterns of ridges and sweat glands on fingertips. This was a major breakthrough and originated a long and continuing quest for ways to uniquely identify individuals based on fingerprints. In the modern era, the concept of fingerprinting has expanded to other sources of data, such as voice recognition and retinal scans. It is only in the last few years that technologies and methodologies have achieved high-quality data for individual human brain imaging, and the subsequent estimation of structural and functional connectivity. In this context, the next challenge for human identifiability is posed on brain data, particularly on brain networks, both structural and functional.
In an August 9, 2018, DSI seminar, Dr. Joaquin Goni of Purdue University presented his work showing how the individual fingerprint of a connectome (as represented by a network) can be uncovered (or in a way, maximized) from a reconstruction procedure based on group-wise decomposition in a finite number of brain connectivity modes. By using data from the Human Connectome Project, Goni introduced different extensions of this work, including subject identifiability, heritability analysis of brain networks, as well as identifiability when assessing inter-task brain functional networks. Finally, results on this framework for inter-scan identifiability based on a second dataset acquired at Purdue University were also discussed.
Brain tumor incidence is expected to rise by 6% over the next 20 years. Nearly 79,000 patients will be diagnosed in the U.S. this year alone. In a DSI seminar on August 2, 2018, Dr. Maryam Vareth outlined the University of California at San Francisco’s (UCSF’s) efforts to improve brain tumor outcomes through data-driven medicine.
Standard magnetic resonance imaging (MRI) is a mainstay of brain tumor diagnosis and evaluation, but it poses challenges when clinicians attempt to distinguish treatment effects from recurrent tumors. More advanced imaging is needed to better define tumor regions so that radiation treatments can target areas with high probability of recurrence.
Vareth described a potential solution called magnetic resonance spectroscopic imaging (MRSI)—static metabolic imaging that zeroes in on tumor chemistry. With MRSI, clinicians can identify metabolic changes in the brain earlier than when a recurrent tumor would show up with standard MRI. Moreover, MRSI is noninvasive and can be performed on a regular MRI machine.
The MRSI process creates indices of signals from choline, creatine, N-acetyl-aspartate, lipid, and lactate. Data can then be analyzed in map of voxels (3D pixels). With an inherently low signal, however, MRSI scans take a long time—especially if clinicians need to scan the entire brain, not just one region. Faster MRSI scans will help encourage clinicians to adopt this type of imaging.
Vareth’s team is developing a fast-trajectory MRSI analysis method to reduce scan time significantly. “An MRI is a very expensive Fourier transform machine,” she explained, so acceleration can be achieved through modified k-space sampling (below the Nyquist rate) of raw data. This process involves compressed sensing and parallel imaging as well as weighting images according to their sensor proximity (i.e., sensitivity is higher closer to a sensor within the machine).
Vareth and her UCSF colleagues are working toward “super-resolution” of MRSI and exploring the potential of deep learning to further enhance image quality while reducing scan duration. The team has developed software, called SIVIC, for processing automated prescription and reconstruction of MRSI data. SIVIC is available on GitHub.
On July 31, 2018, the DSI continued its seminar series with a talk on gradient optimization for black-box functions of random variables. University of Toronto Ph.D. candidate and LLNL alumnus Will Grathwohl presented “Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.”
Existing gradient-based optimization methods have advantages and disadvantages, and none offers unbiased, low-variance gradient estimates for arbitrary black-box functions. For example, the REINFORCE estimator is unbiased and works on any function but has high variance. The REPARAMETERIZATION and CONCRETE estimators achieve lower variance but require the black-box function to be known and differentiable.
Grathwohl’s team created an improved general gradient estimator by combining REINFORCE and REPARAMETERIZATION in a control variate framework. Their approach, named LAX, begins with the REINFORCE estimator of the black-box function, introduces a surrogate function with desired properties, subtracts the REINFORCE estimator of the surrogate, and then adds the REPARAMETERIZATION estimator of the surrogate. These steps make it possible for LAX to achieve an unbiased, low-variance estimate of arbitrary black-box function gradients.
The team also developed an extension of this estimator—called RELAX—that introduces a relaxed distribution to handle black-box functions of discrete random variables. Both LAX and RELAX were tested alongside other high-variance gradient estimators and attained lower variance, which significantly reduced optimization time compared with other existing methods. This work was recently published (PDF) at the 2018 ICLR Workshop (the Sixth International Conference on Learning Representations).
The DSI hosted Dr. Paul Gamble from Lab 41 for a seminar on June 8, 2018. Gamble presented two machine learning-based approaches developed by his team to detect and distinguish various forms of genetic engineering. Recently, synthetic biology has become increasingly common, having been used to drive down costs in perfumes, detect pollution, produce vaccines, as well as treat agricultural waste while simultaneously reducing greenhouse emissions by 75% in some cases. However, its rapid rise has also created new dangers, including biohacking and engineered bioweapons with increased virulence.
At Lab 41, Gamble is developing a machine learning pipeline for detecting synthetically engineered DNA. He also studies methods for detecting and defending against adversarial attacks on neural networks. Gamble received an M.D. and a Master’s in Biomedical Engineering from Washington University in St. Louis. During medical school, he developed nerve-computer interfaces and tested them in animal models. His research also focused on applying machine learning to clinical practice—he built a computer vision system to assist radiation oncologists with organ contouring and radiation dosimetry planning.
The DSI sponsored a seminar on May 22, 2018, featuring Dr. Andreas Zoglauer of the UC Berkeley Institute for Data Science. Zoglauer works with Berkeley’s Space Sciences Laboratory on the NASA-sponsored project COSI—the Compton Spectrometer and Imager, a balloon-borne gamma-ray telescope. COSI’s science objectives focus on galactic nucleosynthesis and the polarization of gamma-ray bursts caused by astronomical events such as neutron star mergers and core-collapse supernovae of heavily rotating massive stars. COSI’s 2016 flight around the southern hemisphere generated data that Zoglauer’s team continues to analyze.
According to Zoglauer, gamma-ray astronomy research relies heavily on data science and statistics. To analyze the data from COSI’s detectors, he developed an open-source toolkit called MEGAlib (Medium-Energy Gamma-ray Astronomy Library), which has applications beyond astrophysics in nuclear medicine and nuclear monitoring. MEGAlib enables researchers to perform Monte Carlo simulations of their detectors, reconstruct Compton events, and create images based on Compton scattering data. Zoglauer stated that COSI’s biggest computational challenge is generating up to 9-dimensional response files with Monte Carlo simulations for the reconstruction of all-sky images. Those simulations were performed on Berkeley Lab’s cori supercomputer.
With the help of data science undergraduates, Zoglauer is applying machine learning to COSI data such as random forests and neural networks. Research projects include determining photon paths in the germanium detectors, finding interaction locations in the detectors, and identifying not-contained gamma rays. Zoglauer outlined several lessons learned through his team’s work with machine learning tools, such as the importance of preparing data, splitting a big research question into smaller questions, and verifying that the trained neural networks have no “blind spots.” Researchers using machine learning algorithms should also expect “a lot of trial and error” in finding the best input data representation.
COSI is preparing for another flight in 2019–2020. An upgraded version, COSI-X, is planned for launch in 2022 with additional detectors, better shielding, and improved resolution.
In a Galaxy Not So Far Away
- In space, gamma rays are generated by radioactive decays, annihilation, and charged particle interactions. Astronomical sources include pulsars, supernovae, and the regions near black holes. In our Milky Way, the Crab Nebula, Cygnus X-1, and the area around the center of our galaxy known for its 511-keV positron annihilation emission, are of particular interest to the COSI team.
- Germanium (Ge, atomic number 32) is a semiconductor used to detect gamma rays. COSI’s detector array consists of 12 Ge detectors, each measuring 8x8x1.5 cubic centimeters, combined with specialized cooling and shielding systems.
- Compton scattering refers to photons scattering off electrons and, thus, transferring momentum to them. Arthur Holly Compton received the Nobel prize in 1927 for the discovery of this “Compton effect.” COSI measures gamma rays via multiple Compton interactions in its germanium detectors.
- Powered by solar panels and a 300-foot super-pressure helium balloon, COSI took off from New Zealand and flew around Antarctica and the Pacific Ocean before landing in Peru. The trip lasted 46 days. According to Zoglauer, the southern hemisphere provides a good view of the center of the Milky Way.
The DSI welcomed Dr. Philip Kegelmeyer from Sandia National Laboratory on April 23, 2018, for a presentation titled “Machine Learning Adversarial Label Tampering: Design and Detection.” Attacks on machine learning include distortion, hiding, or manipulation of data. The presentation focused on falsely labeled data with examples of empirical methods for “quantified paranoia.”
The chief danger in a data label tampering attack is that even a small amount of tampering can greatly decrease accuracy in a fashion that cannot be detected in advance. Kegelmeyer’s team at Sandia has created several heuristics for generating such attacks. A simple but effective example is the “brute clustering” attack, in which all the data points in a single cluster are relabeled before moving on to the next cluster. Defenses against these attacks exist, though they are relatively weak. Kegelmeyer described one such defense, dubbed “quantified paranoia,” a statistical technique that uses pseudo-Bayes factors to signal the presence of label tampering.