The DSI’s Open Data Initiative (ODI) enables us to share LLNL’s rich, challenging, and unique datasets with the larger data science community. Our goal is for these datasets to help support curriculum development, raise awareness around LLNL’s data science efforts, foster new collaborations, and be leveraged across other learning opportunities.
As we develop this catalog over time, the data will represent a wide variety of key LLNL mission areas and may include subsets of some of the world’s largest datasets. We plan to provide data ranging in complexity from dense, featureful, labeled datasets with well understood solutions to those that are sparse, noisy, and largely unexplored. These datasets can also be used to test novel hardware solutions for scalable machine learning (ML) platforms.
The dataset comprises trace files from high-performance computing (HPC) simulations. The trace files contain records of every I/O operation executed by a simulation application run, including I/O operations from HDF5, MPI-IO, and POSIX and all of the parameters supplied to those operations (e.g., file name, offset, and flags). The traces are generated by executing a simulation application that is linked with the Recorder tracing tool. The Recorder treacing tool intercepts the I/O calls made by the application, records the I/O trace record, and then calls the intended I/O call so that the operation executes.
This dataset includes 20-year-long records from 1996 to 2015 of the Community Atmospheric Model v5 (CAM5) dataset. It contains snapshots of the global atmospheric states for every 3 hours (1 timestep = 3 hours). Each snapshot contains multiple physical variables among which we use the six most important climate variables to define hurricane from scientific literature:
- PSL (Sea level pressure)
- U850 (Zonal wind)
- V850 (Meridional wind)
- PRECT (Precipitation)
- TS (Surface temperature)
- QREFHT (Reference high humidity)
The Cars Overhead With Context (COWC) dataset is a large set of annotated cars from overhead. It is useful for training a device such as a deep neural network to learn to detect and/or count cars. More information is available via the researchers’ paper and poster.
The data includes wide area imagery with annotations as well as precompiled image sets for training/validation of classification and counting. The dataset and research to create this data was done by members of the Computer Vision group within LLNL’s Computation Engineering Division under grant from NA-22 in the Global Security Directorate.
The dataset has the following attributes:
- Data from overhead at 15 cm per pixel resolution at ground.
- Data from 6 distinct locations: Toronto, Canada; Selwyn, New Zealand; Potsdam and Vaihingen, Germany; Columbus, Ohio, USA; and Utah, USA.
- 32,716 unique annotated cars. 58,247 unique negative examples.
- Intentional selection of hard negative examples.
- Established baseline for detection and counting tasks.
- Extra testing scenes for use after validation.
The JAG model has been designed to give a rapid description of the observables from inertial confinement fusion (ICF) experiments, which are all generated very late in the implosion. This code contains pre-trained ML models, architectures and implementations for building surrogate models in scientific ML. The provided dataset is intended for testing/training the models. It is a tarball inside 'data/', which contains .npy files for images, scalars, and the corresponding input parameters. This 10K dataset represents a larger 100M dataset and is beginning to push the 1B threshold with more to come.