Building knowledge and insights using machine learning of scientific articles
Nanomaterials are widely used at LLNL and in industry for many applications from catalysis to optics to additive manufacturing. The combination of nanomaterials’ shape, size, and composition can impart unique optical, electrical, mechanical, or catalytic properties needed for a specific application. However, synthesizing a specific nanomaterial and scaling up its production is often challenging because a small change in the process or the addition of a specific chemical can have a dramatic effect on what is made. These effects are only discovered by time-consuming trial-and-error experimentation and reading of others’ reported experiments in the scientific literature, which together build experts’ intuition.
To help address these challenges, a team of LLNL materials and computer scientists developed machine learning tools that extract and structure information from the text and figures of nanomaterials articles using state-of-the-art natural language processing, image analysis, computer vision, and visualization techniques. These tools have enabled the creation of a personalized knowledge base for nanomaterials synthesis that can be mined to help inform further nanomaterials development. Starting with a corpus of approximately 35k nanomaterials-related articles, the team developed models to classify articles according to the nanomaterial composition and morphology; extract synthesis protocols from within the articles’ text; and extract, normalize, and categorize chemical terms within synthesis protocols.
In addition to processing articles’ text, microscopy images of nanomaterials within the articles are also automatically identified and analyzed to determine the nanomaterials’ morphologies and size distributions. To enable users to easily explore the database, a complementary browser-based visualization tool was also developed that provides flexibility in comparing across subsets of articles of interest.
“These tools and information can be used to identify trends in nanomaterials synthesis, such as the correlation of certain reagents with various nanomaterial morphologies, which is useful in guiding hypotheses and reducing the potential parameter space during experimental design," said Anna Hiszpanski, the lead author of the paper. Their work was recently published in the Journal for Chemical Information and Modeling, an American Chemical Society publication.
“While the focus of our work was on the synthesis of nanomaterials, the framework in terms of model selection and training methodology can be used for extracting other targeted data of interest from scientific articles in other materials and chemistry subfields and beyond,” said the project's principal investigator Yong Han. The team is currently applying its machine learning tools to SARS-COV-2 and COVID-19 literature to determine if information extracted from that large body of literature can aide in accelerating COVID-19 research.
Co-authors of the work includes Brian Gallagher, Karthik Chellappan, Peggy Li, Shusen Liu, Hyojin Kim, Jinkyu Han, Bhavya Kailkhura, and David J. Buttler. This work was funded by LLNL’s LDRD program.