The DSI’s Open Data Initiative (ODI) enables us to share LLNL’s rich, challenging, and unique datasets with the larger data science community. Our goal is for these datasets to help support curriculum development, raise awareness around LLNL’s data science efforts, foster new collaborations, and be leveraged across other learning opportunities.
As we develop this catalog over time, the data will represent a wide variety of key LLNL mission areas and may include subsets of some of the world’s largest datasets. We plan to provide data ranging in complexity from dense, featureful, labeled datasets with well understood solutions to those that are sparse, noisy, and largely unexplored. These datasets can also be used to test novel hardware solutions for scalable machine learning (ML) platforms.
The Cars Overhead With Context (COWC) data set is a large set of annotated cars from overhead. It is useful for training a device such as a deep neural network to learn to detect and/or count cars. More information is available via the researchers’ paper and poster.
The data includes wide area imagery with annotations as well as precompiled image sets for training/validation of classification and counting. The dataset and research to create this data was done by members of the Computer Vision group within LLNL’s Computation Engineering Division under grant from NA-22 in the Global Security Directorate.
The dataset has the following attributes:
- Data from overhead at 15 cm per pixel resolution at ground.
- Data from 6 distinct locations: Toronto, Canada; Selwyn, New Zealand; Potsdam and Vaihingen, Germany; Columbus, Ohio, USA; and Utah, USA.
- 32,716 unique annotated cars. 58,247 unique negative examples.
- Intentional selection of hard negative examples.
- Established baseline for detection and counting tasks.
- Extra testing scenes for use after validation.
The JAG model has been designed to give a rapid description of the observables from inertial confinement fusion (ICF) experiments, which are all generated very late in the implosion. This code contains pre-trained ML models, architectures and implementations for building surrogate models in scientific ML. The provided dataset is intended for testing/training the models. It is a tarball inside 'data/', which contains .npy files for images, scalars, and the corresponding input parameters. This 10K dataset represents a larger 100M dataset and is beginning to push the 1B threshold with more to come.