Department of Energy researchers share data management strategies at first-ever “Data Day”
It’s become something of a mantra of the digital age: Data is the new currency. Especially in science, where it’s hard to find a single project that doesn’t involve generating or consuming massive amounts of data.
In light of the growing awareness of the critical importance of data management across the Department of Energy complex, more than 100 researchers from DOE national laboratories across the country came to Lawrence Livermore National Laboratory Sept. 25–26 for the inaugural DOE Data Day (D3) workshop, a meeting of the minds aimed at discussing challenges and solutions in accessing, curating and sharing data.
During the two-day event, attendees and presenters from 10 national laboratories, the National Nuclear Security Administration (NNSA), Mission Support and Test Services, Pantex Plant, the Y-12 National Security Complex, the DOE Office of Science Technical Institute (OSTI) and elsewhere in DOE shared project overviews, examples of their challenges and successes and lessons learned in handling massive amounts of data. Sprinkled into the nearly two dozen talks were focused breakout sessions, working groups and a poster session.
D3 lead organizer Jessie Gaylord said the workshop stemmed from a meeting she had at Sandia Albuquerque, where she realized that computer scientists at other labs were addressing the same issues in dealing with data but doing so in isolation. She proposed Data Day as a venue for DOE scientists to raise awareness about their approaches and pitfalls, in hopes they would see what tools are being applied both at other labs and in disparate program areas.
“What we’re often doing is appropriating data management platforms and methods developed to address commercial or academic needs for use at the national labs, which is not a straight one-to-one fit,” Gaylord explained. “We regularly deal with compliance, sensitivity and legacy data issues that outside data management efforts don’t have. The purpose of D3 is to bring people who are working in something related to data management at the labs together to talk about what they’re doing. It’s about raising awareness. Even if it’s a different mission area, sponsor or set of tools, you can learn from what others are doing.”
Gaylord said she was pleased with the “overwhelming response” and breadth of institutions that engaged with the event. In addition to packing their space at the National Atmospheric Research Advisory Center to capacity, organizers received nearly 50 abstracts for talks, selecting 18 of them to show the diversity of institutions and mission spaces affected by big data challenges.
Talks by LLNL researchers included computer scientist Sasha Ames on managing the data collected and distributed by the global climate modeling project Earth Science Grid Federation (ESGF), National Ignition Facility chief technical officer Philip Adams on curating data at NIF, and computer scientist Mark Miller on the fusion of big data and traditional visualization tools. Other presentations came from scientists and researchers at the National Energy Technology Laboratory, the SLAC National Accelerator Laboratory, OSTI and the Sandia, Pacific Northwest, Argonne, Oak Ridge, Lawrence Berkeley and Los Alamos national laboratories.
The event’s organizing committee represented a cross-section of mission areas at LLNL, including Global Security computer scientist Stan Ruppert, Computing senior data scientist Ghaleb Abdulla, and Weapons & Complex Integration (WCI) computer scientist Dan Laney. Abdulla, who is the principal investigator on the ESGF project, said the planning group recognized the urgent need for the data community to come together to discuss pressing issues.
“The data models used for climate are not suitable for other domains. Creating the synergy between the data science and the domain is one of the most important challenges — if you can do that, then they can work together to define different models and tools that need to be used,” Abdulla said. “The second challenge is data sharing, and the willingness to share the data. Sometimes it could be classified or have some tangible value. The environment and tools that you work with for data management also evolve very quickly, and you need to keep an eye on those tools and find out how you can integrate them in a way to help your project.”
Abdulla expressed his desire for attendees to come away from the D3 workshop with an understanding of existing challenges and use the breakout sessions to come up with recommendations and propose solutions.
LLNL computer scientist and data storage specialist Elsa Gonsiorowski, who presented a poster on “Tools for HPC File Management,” said high-performance computing is increasingly being used for data science. The problem isn’t that there’s too much data, she said, it’s getting access to it and making it usable. She hoped attending the event would inform her about future data challenges and allow her to gauge what users are most concerned about.
“It’s super beneficial just to meet other people who are interested in these sorts of issues,” Gonsiorowski said. “We’re facing the same challenges — too often we think we’re the first person who’s ever seen this problem. Usually that’s not the case. Usually someone’s solved it in some other way or seen other challenges and thought of something we haven’t.”
Gaylord said other laboratories have expressed interest in organizing more Data Day workshops. While the initial event was focused on awareness, Gaylord said data scientists could take advantage of future D3 events to devise best practices or standards.
“I really feel like we’re facilitating a broader discussion that obviously needs to happen, and I’m glad we’re able to provide the lightning rod for these discussions,” Gaylord said.
The Data Day event was sponsored by the National Nuclear Security Administration and the WCI directorate.