Guiding students as they tackle challenging problems is a lot of fun. Our students are incredibly talented and usually only need small hints to discover new ways of doing things that I didn’t anticipate.
—Dave Buttler, LLNL computer scientist and DSSI mentor
As a DSSI intern, you'll work on real projects with real data that represent the breadth and depth of data science research at LLNL. Our students tackle Challenge Problems that leverage large and varied datasets used in or generated from actual LLNL projects such as building networks from interaction data, large-scale data mining for predictive medicine, drug discovery using HPC simulations, video data summarization and classification, energy efficiency analysis using HPC, classification and forward modeling of hyper-spectral data, and much more. This curriculum helps students build technical experience and teamwork.
Our interns are also paired with mentors—experts across many data science fields at the Lab—whose projects align with students' skills and interests. Check out some of the recent Challenge Problems and mentors below.
2020 Challenge Problems
The Materials Science Division employs state-of-the-art experimental, theoretical, and computational tools to support the Lab's missions. Success requires timely development and deployment of diverse materials including nanomaterials, small molecules, polymers, metals, alloys, composites, ceramics, and semiconductors.
Materials discovery, optimization, and deployment at scale require significant time and effort to develop. Data science techniques play a large role in this effort to expedite the development cycle. We combine design, simulation, experiments, characterization, data analytics, and machine learning to accelerate materials discovery, process optimization, and tune materials performance. One method we’re using is natural language processing to extract relevant texts and information from scientific literature to evaluate synthesis conditions, chemicals, and ingredients used to create new materials. For example, chemical ingredients can be extracted from literature. Examining a research paper describing synthesis of silver nanowires, we can extract key chemicals (i.e., AgNO3 + ethanol + PVP).
Challenge: Use data science techniques to identify key ingredients for making nanomaterials.
- Chemical compositions: Au, Ag, Cu, Pt, Pd, Fe2O3
- Morphologies: particles, tubes, wires, rods, crystals, sheets, spheres, cubes octahedral, flower, ribbon, star, and triangle
- Data: chemical composition, morphologies, and chemicals extracted from 35,000 nanomaterials synthesis papers
- Develop nanomaterials synthesis prediction algorithm (90%+ accuracy on evaluation).
- Provide a list of chemicals importance for each class of composition/morphology.
- Provide a list of dominant chemicals for morphologies regardless of compositions.
- Evaluate the data thoroughly. Is the data sufficient to make good prediction? If not, what else would you need? Develop a tool to get it!
- What should you do about a data skew problem? Is one class (particles) dominating over other? How should you sample your data? How should you set up training and validation/testing?
- Extra credit: Provide uncertainty quantification for each prediction (how confident the model is on the prediction it made).
Mentors: Alan Kaplan, Uttara Tipnis, Duy Duong-Tran
Magnetic resonance imaging (MRI) is used in diagnostic radiology to acquire images of the human brain. Functional MRI (fMRI) can be used to infer brain activity in different brain regions. Connectomics is the study of how regions in the brain interact, and MRI technology is especially valuable in imaging the brain. Functional connectivity can be derived from fMRI data. The multi-institutional Human Connectome Project aims to build a “network map” of anatomical and functional connectivity within the human brain, producing data that will advance research of brain-related disorders such as schizophrenia, dyslexia, autism, and Alzheimer’s disease. For example, one of the project’s datasets includes comprehensive, high-quality MRI results for 1,200 healthy young adults aged 22–35. One of the fundamental questions is how much of the fMRI activity is unique to an individual, and how much is common among individuals? This question can be addressed with data-driven machine learning approaches.
Challenge: Identify whether two given MRI sequences came from the same person.
- Uniqueness of brain function: How different is brain function among healthy young adults? Is the MRI "fingerprint" signature unique across individuals?
- Stability of brain function: What does the brain do during rest or a task? Does the fingerprint change over time?
- Data: parcellated time-series data, 3D images and video (spatial patterns over time), connectome matrices of functional connectivity
- Decide how to process, transform, normalize, and sample the data.
- Build a function/classifier to label the data. Which method(s) work better than others?
- Determine your algorithm's accuracy. How are the data skewed? What improvements can be made?
After completing a PhD in Statistics at Penn State in 2016, Jason joined the Applied Statistics Group (ASG) at LLNL as a postdoc and is now a staff member in the group. He enjoys working with DSSI interns on research problems that are application driven and involve statistical computing, uncertainty quantification, and machine learning. For the first virtual DSSI program in 2020, Jason worked with a graduate student on applying reinforcement learning to spacecraft in stochastic environments. In 2018, Jason and fellow ASG member Katie Schmidt worked with a student on Bayesian calibration of material strength models using different data types. On the weekends, he enjoys exploring San Francisco and hiking around the Bay Area.