Together with LLNL’s Center for Applied Scientific Computing (CASC), the DSI welcomed a new academic partner to the 2021 Data Science Challenge (DSC) internship program: the University of California (UC) Riverside campus. The intensive program has run for three years with UC Merced, and it tasks undergraduate and graduate students with addressing a real-world scientific problem using data science techniques such as machine learning (ML) and statistical analysis.
In past years, students have tackled publicly available materials science and precision medicine datasets—using ML to classify potential explosive materials by their molecular structures and to predict antibody/antigen interactions, respectively. This year’s Challenge focused on detecting near-Earth objects (NEOs), such as asteroids, that could threaten global security. UC Riverside’s DSC followed the same three-week virtual format as with UC Merced’s DSC earlier in the summer, including technical seminars, virtual tours of Lab facilities, one-on-one mentoring, and team-based activities.
DSC director Brian Gallagher joined forces with UC Riverside associate professor Vagelis Papalexakis to guide the 23 students from August 30 to September 17. Lab mentors included Indra Chakraborty, Ryan Dana, Hyojin Kim, Kerianne Pruett, and Braden Soper. Researchers James Buchanan, Will Dawson, Amanda Muyskens, Brenden Petersen, Marisa Torres, and Travis Yeager joined Dana and Pruett in supporting the program with tutorials and seminars. DSI administrator Jennifer Bellig coordinated the program’s onboarding and activities.
When not working on the Challenge itself, students heard a panel discussion by former interns who are now full-time LLNL staff. They also attended a resume-writing seminar and learned about the DSI and CASC from the respective directors, Michael Goldman and Jeffrey Hittinger.
Object Classification and Detection
Smart planetary defense strategies require space situational awareness. Of the millions of orbiting asteroids and comets in the Solar System, several thousand are potentially hazardous to Earth. “The larger the NEO, the easier is it to detect, but the more damage it could inflict,” Dana explained to the students. Finding these objects within telescope images is difficult: They reflect sunlight instead of emitting their own light, but brightness alone does not correspond to size.
The first phase of the DSC involved distinguishing stars and galaxies within a dataset of 34,000 images from the Subaru Telescope in Hawaii. Working in five teams, each led by a graduate student, interns built image classifiers using convolutional neural networks (CNNs) and other deep learning techniques. The exercise included normalizing the data, labeling the images, and training the classifier. The second phase of the Challenge required students to create and apply detection algorithms to a dataset “injected” with asteroids. This dataset, from California’s Zwicky Transient Facility, contained 1,000 difference images, each seeded with 20 asteroids. Again, the students processed the data, developed and trained a deep learning model, and validated their results.
Dana instructed the students on the basics of CNNs and provided a tutorial on the PyTorch and TensorFlow software libraires used to create and train neural networks. In addition, Gallagher noted, “The UC Merced Challenge taught us we needed to provide more astronomy context on the first day.” Accordingly, Pruett presented the principles of asteroid formation, optical astronomy technologies, and difference imaging—i.e., an imaging technique that uncovers transient phenomena (such as asteroids) by comparing observations in the same region of space at different times.
“This is not a competition,” Dana emphasized during the DSC kickoff. “The challenge is to push yourself, work with data you haven’t worked with before, and learn new skills. If you can walk away from this program accomplishing more than you thought you ever could in only three weeks, that’s success.”
Rutuja Gurav, a PhD student in computer science and a DSC team leader, was excited to apply data science techniques to astronomy problems. “I have a chance to innovate and perhaps shape the field in a small way, which will hopefully be useful in making scientific progress,” she stated. Gurav’s team tried multiple approaches to classification besides CNNs, including Gaussian process and naïve Bayes models. Other teams explored probabilistic models and decision trees.
During final presentations, teams explained why they chose their methods, how they developed and trained their models, the accuracy of their results, and what they would do differently with additional time. Some results were more accurate than others, prompting Gallagher to encourage students to work through failure and formulate the right question when building ML models.
With students having different levels of experience, the DSC’s team format echoes the value of multidisciplinary teamwork at the Lab. Papalexakis noted that the variety within teams contributed to the breadth of solutions they presented. “Cross-disciplinary projects take a bit longer in the beginning when everyone has to establish a common language, but they tend to be extremely rewarding. A lot of cool stuff happens along the way,” he said. Pruett added, “There is something for everyone to learn, and it’s a different experience for each student.”
The program’s immersion and intensity enabled teams to make tangible progress—in both technical and “soft” skills—in just three weeks. Gallagher stated, “It’s not unusual for students to encounter doubts about their own abilities, as we all do. It’s great seeing them present their work at the end, having realized they can face a challenging situation and overcome it, as well as contribute to a larger team effort.” As for Gurav, “Research is seldom a solitary endeavor, and the DSC provided me an opportunity to hone my team-building skills.”
Expanding the Academic Collaboration
DSC organizers aim to provide opportunities for students who may not otherwise consider a graduate degree or career in STEM (science, technology, engineering, math) fields. LLNL’s long-standing partnership with the UC system was a natural avenue for this effort, and leaders from the DSI and CASC were eager to expand their organizations’ academic outreach.
Another university means another chance to reach a different community of future data scientists. According to Goldman, “One benefit of the pandemic is that we’ve learned how to scale our activities to include more students.” Each year’s Challenge problem, and the effort of designing it, can be reused.
Gallagher, who leads CASC’s Data Science and Analytics Group, explained, “UC Merced’s student body represents multiple groups that are historically unrepresented in STEM. We looked to UC Riverside next because it also has a diverse student population, and because I already knew Professor Papalexakis in the Computing Science and Engineering Department.”
Papalexakis had met Gallagher at data science conferences and via overlapping collaborations. “When Brian reached out to me about organizing the Challenge with UC Riverside, I was happy to get on board,” Papalexakis said. “It was the perfect mix of working on an interesting project with him while building a program in the service of our students.”
Both Papalexakis and Gallagher look forward to continuing the DSC in 2022 and have high expectations for this generation of data scientists. “When you see all of the students’ presentations together in one day, their accomplishments are impressive,” Gallagher stated. “It’s great for the students to see how much they can accomplish in a short time with just a little bit of structure and teamwork.”