Publications

Stevenson G.A., Jones .D., Kim .H., et al. (2021). “High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models.” International Conference for High Performance Computing, Networking, Storage and Analysis, SC. []

Structure-based Deep Fusion models were recently shown to outperform several physics and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhancements to Deep Fusion were made in order to evaluate more than 5 billion docked poses on SARS-CoV-2 protein targets. First, the Deep Fusion concept was refined by formulating the architecture as one, coherently backpropagated model (Coherent Fusion) to improve binding affinity prediction accuracy. Secondly, the model was trained using a distributed, genetic hyper-parameter optimization. Finally, a scalable, high-Throughput screening capability was developed to maximize the number of ligands evaluated and expedite the path to experimental evaluation. In this work, we present both the methods developed for machine learning-based high-Throughput screening and results from using our computational pipeline to find SARS-CoV-2 inhibitors.

Bhatia H., Natale F.D., Moon J.Y., et al. (2021). “Generalizable Coordination of Large Multiscale Workflows: Challenges and Learnings at Scale.” International Conference for High Performance Computing, Networking, Storage and Analysis, SC. []

The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.

Bhatia H., Kirby R.M., Pascucci V., Bremer P.-T. (2021). “Vector Field Decompositions Using Multiscale Poisson Kernel.” IEEE Transactions on Visualization and Computer Graphics. []

Extraction of multiscale features using scale-space is one of the fundamental approaches to analyze scalar fields. However, similar techniques for vector fields are much less common, even though it is well known that, for example, turbulent flows contain cascades of nested vortices at different scales. The challenge is that the ideas related to scale-space are based upon iteratively smoothing the data to extract features at progressively larger scale, making it difficult to extract overlapping features. Instead, we consider spatial regions of influence in vector fields as scale, and introduce a new approach for the multiscale analysis of vector fields. Rather than smoothing the flow, we use the natural Helmholtz-Hodge decomposition to split it into small-scale and large-scale components using progressively larger neighborhoods. Our approach creates a natural separation of features by extracting local flow behavior, for example, a small vortex, from large-scale effects, for example, a background flow. We demonstrate our technique on large-scale, turbulent flows, and show multiscale features that cannot be extracted using state-of-the-art techniques.

Islam T.Z., Liang P.W., Sweeney F., et al. (2021). “College Life Is Hard! - Shedding Light on Stress Prediction for Autistic College Students using Data-Driven Analysis.” Proceedings - 2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC 2021. []

Autistic college students face significant challenges in college settings and have a higher dropout rate than neurotypical college students. High physiological distress, depression, and anxiety are identified as critical challenges that contribute to this less than optimal college experience. In this paper, we leverage affordable mobile and wearable devices to collect large amounts of physiological and contextual data (biomarkers) and leverage a data-driven analysis approach for building stress prediction models. Such models can be used to provide real-time intervention for better stress management. We conducted a mixed-method study where we collected physiological and contextual data from 20 college students (10 neurotypical and 10 autistic). Our proposed data-driven analysis pipeline leverages an unsupervised representation learning technique with a semi-supervised label approximation method to predict the onset of stress based on biomarkers for autistic students, neurotypical students, and both populations with accuracies 69%, 72%, and 70%, respectively.

Huang X., Klacansky P., Petruzza S., et al. (2021). “Distributed Merge Forest: A New Fast and Scalable Approach for Topological Analysis at Scale.” Proceedings of the International Conference on Supercomputing. []

Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.

McDonald T., Shrestha R., Yi X., Bhatia H., et al. (2021). “Leveraging Topological Events in Tracking Graphs for Understanding Particle Diffusion.” Computer Graphics Forum. []

Single particle tracking (SPT) of fluorescent molecules provides significant insights into the diffusion and relative motion of tagged proteins and other structures of interest in biology. However, despite the latest advances in high-resolution microscopy, individual particles are typically not distinguished from clusters of particles. This lack of resolution obscures potential evidence for how merging and splitting of particles affect their diffusion and any implications on the biological environment. The particle tracks are typically decomposed into individual segments at observed merge and split events, and analysis is performed without knowing the true count of particles in the resulting segments. Here, we address the challenges in analyzing particle tracks in the context of cancer biology. In particular, we study the tracks of KRAS protein, which is implicated in nearly 20% of all human cancers, and whose clustering and aggregation have been linked to the signaling pathway leading to uncontrolled cell growth. We present a new analysis approach for particle tracks by representing them as tracking graphs and using topological events – merging and splitting, to disambiguate the tracks. Using this analysis, we infer a lower bound on the count of particles as they cluster and create conditional distributions of diffusion speeds before and after merge and split events. Using thousands of time-steps of simulated and in-vitro SPT data, we demonstrate the efficacy of our method, as it offers the biologists a new, detailed look into the relationship between KRAS clustering and diffusion speeds.

Song H., Thiagarajan J.J., Kailkhura B. (2021). “Preventing Failures by Dataset Shift Detection in Safety-Critical Graph Applications.” Frontiers in Artificial Intelligence. []

Dataset shift refers to the problem where the input data distribution may change over time (e.g., between training and test stages). Since this can be a critical bottleneck in several safety-critical applications such as healthcare, drug-discovery, etc., dataset shift detection has become an important research issue in machine learning. Though several existing efforts have focused on image/video data, applications with graph-structured data have not received sufficient attention. Therefore, in this paper, we investigate the problem of detecting shifts in graph structured data through the lens of statistical hypothesis testing. Specifically, we propose a practical two-sample test based approach for shift detection in large-scale graph structured data. Our approach is very flexible in that it is suitable for both undirected and directed graphs, and eliminates the need for equal sample sizes. Using empirical studies, we demonstrate the effectiveness of the proposed test in detecting dataset shifts. We also corroborate these findings using real-world datasets, characterized by directed graphs and a large number of nodes.

Anirudh R., Thiagarajan J.J., Sridhar R., Bremer P.-T. (2021). “MARGIN: Uncovering Deep Neural Networks Using Graph Signal Analysis.” Frontiers in Big Data. []

Interpretability has emerged as a crucial aspect of building trust in machine learning systems, aimed at providing insights into the working of complex neural networks that are otherwise opaque to a user. There are a plethora of existing solutions addressing various aspects of interpretability ranging from identifying prototypical samples in a dataset to explaining image predictions or explaining mis-classifications. While all of these diverse techniques address seemingly different aspects of interpretability, we hypothesize that a large family of interpretability tasks are variants of the same central problem which is identifying relative change in a model’s prediction. This paper introduces MARGIN, a simple yet general approach to address a large set of interpretability tasks MARGIN exploits ideas rooted in graph signal analysis to determine influential nodes in a graph, which are defined as those nodes that maximally describe a function defined on the graph. By carefully defining task-specific graphs and functions, we demonstrate that MARGIN outperforms existing approaches in a number of disparate interpretability challenges.

Bhatia H., Carpenter T.S., Ingólfsson H.I., et al. (2021). “Machine-Learning-Based Dynamic-Importance Sampling for Adaptive Multiscale Simulations.” Nature Machine Intelligence. []

Multiscale simulations are a well-accepted way to bridge the length and time scales required for scientific studies with the solution accuracy achievable through available computational resources. Traditional approaches either solve a coarse model with selective refinement or coerce a detailed model into faster sampling, both of which have limitations. Here, we present a paradigm of adaptive, multiscale simulations that couple different scales using a dynamic-importance sampling approach. Our method uses machine learning to dynamically and exhaustively sample the phase space explored by a macro model using microscale simulations and enables an automatic feedback from the micro to the macro scale, leading to a self-healing multiscale simulation. As a result, our approach delivers macro length and time scales, but with the effective precision of the micro scale. Our approach is arbitrarily scalable as well as transferable to many different types of simulations. Our method made possible a multiscale scientific campaign of unprecedented scale to understand the interactions of RAS proteins with a plasma membrane in the context of cancer research running over several days on Sierra, which is currently the second-most-powerful supercomputer in the world.

Jones D., Kim H., Zhang X., et al. (2021). “Improved Protein-Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference.” Journal of Chemical Information and Modeling. []

Predicting accurate protein-ligand binding affinities is an important task in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the application of deep convolutional and graph neural network-based approaches, it remains unclear what the relative advantages of each approach are and how they compare with physics-based methodologies that have found more mainstream success in virtual screening pipelines. We present fusion models that combine features and inference from complementary representations to improve binding affinity prediction. This, to our knowledge, is the first comprehensive study that uses a common series of evaluations to directly compare the performance of three-dimensional (3D)-convolutional neural networks (3D-CNNs), spatial graph neural networks (SG-CNNs), and their fusion. We use temporal and structure-based splits to assess performance on novel protein targets. To test the practical applicability of our models, we examine their performance in cases that assume that the crystal structure is not available. In these cases, binding free energies are predicted using docking pose coordinates as the inputs to each model. In addition, we compare these deep learning approaches to predictions based on docking scores and molecular mechanic/generalized Born surface area (MM/GBSA) calculations. Our results show that the fusion models make more accurate predictions than their constituent neural network models as well as docking scoring and MM/GBSA rescoring, with the benefit of greater computational efficiency than the MM/GBSA method. Finally, we provide the code to reproduce our results and the parameter files of the trained models used in this work.

Ramadan T., Islam T.Z., Phelps C., et al. (2021). “Comparative Code Structure Analysis using Deep Learning for Performance Prediction.” Proceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021. []

Performance analysis has always been an afterthought during the application development process, focusing on application correctness first. The learning curve of the existing static and dynamic analysis tools are steep, which requires understanding low-level details to interpret the findings for actionable optimizations. Additionally, application performance is a function of a number of unknowns stemming from the application-, runtime-, and interactions between the OS and underlying hardware, making it difficult to model using any deep learning technique, especially without a large labeled dataset. In this paper, we address both of these problems by presenting a large corpus of a labeled dataset for the community and take a comparative analysis approach to mitigate all unknowns except their source code differences between different correct implementations of the same problem. We put the power of deep learning to the test for automatically extracting information from the hierarchical structure of abstract syntax trees to represent source code. This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure. This research will enable performance-aware application development since every version of the application will continue to contribute to the corpora, which will enhance the performance of the model. We evaluate several deep learning-based representation learning techniques for source code. Our results show that tree-based Long Short-Term Memory (LSTM) models can leverage source code's hierarchical structure to discover latent representations. Specifically, LSTM-based predictive models built using a single problem and a combination of multiple problems can correctly predict if a source code will perform better or worse up to 84% and 73% of the time, respectively.

Muniraju G., Kailkhura B., Thiagarajan J.J., et al. (2021). “Coverage-Based Designs Improve Sample Mining and Hyperparameter Optimization.” IEEE Transactions on Neural Networks and Learning Systems. []

Sampling one or more effective solutions from large search spaces is a recurring idea in machine learning (ML), and sequential optimization has become a popular solution. Typical examples include data summarization, sample mining for predictive modeling, and hyperparameter optimization. Existing solutions attempt to adaptively trade off between global exploration and local exploitation, in which the initial exploratory sample is critical to their success. While discrepancy-based samples have become the de facto approach for exploration, results from computer graphics suggest that coverage-based designs, e.g., Poisson disk sampling, can be a superior alternative. In order to successfully adopt coverage-based sample designs to ML applications, which were originally developed for 2-D image analysis, we propose fundamental advances by constructing a parameterized family of designs with provably improved coverage characteristics and developing algorithms for effective sample synthesis. Using experiments in sample mining and hyperparameter optimization for supervised learning, we show that our approach consistently outperforms the existing exploratory sampling methods in both blind exploration and sequential search with Bayesian optimization.

Hoang D., Summa B., Bhatia H., et al. (2021). “Efficient and Flexible Hierarchical Data Layouts for a Unified Encoding of Scalar Field Precision and Resolution.” IEEE Transactions on Visualization and Computer Graphics. []

To address the problem of ever-growing scientific data sizes making data movement a major hindrance to analysis, we introduce a novel encoding for scalar fields: a unified tree of resolution and precision, specifically constructed so that valid cuts correspond to sensible approximations of the original field in the precision-resolution space. Furthermore, we introduce a highly flexible encoding of such trees that forms a parameterized family of data hierarchies. We discuss how different parameter choices lead to different trade-offs in practice, and show how specific choices result in known data representation schemes such as zfp [52], idx [58], and jpeg2000 [76]. Finally, we provide system-level details and empirical evidence on how such hierarchies facilitate common approximate queries with minimal data movement and time, using real-world data sets ranging from a few gigabytes to nearly a terabyte in size. Experiments suggest that our new strategy of combining reductions in resolution and precision is competitive with state-of-the-art compression techniques with respect to data quality, while being significantly more flexible and orders of magnitude faster, and requiring significantly reduced resources.

Kesavan S., Bhatia H., Bhatele A., et al. (2021). “Scalable Comparative Visualization of Ensembles of Call Graphs.” IEEE Transactions on Visualization and Computer Graphics. []

Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present EnsembleCallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes.

Narayanaswamy V.S., Thiagarajan J.J., Spanias A. (2021). “On the Design of Deep Priors for Unsupervised Audio Restoration.” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. []

Unsupervised deep learning methods for solving audio restoration problems extensively rely on carefully tailored neural architectures that carry strong inductive biases for defining priors in the time or spectral domain. In this context, lot of recent success has been achieved with sophisticated convolutional network constructions that recover audio signals in the spectral domain. However, in practice, audio priors require careful engineering of the convolutional kernels to be effective at solving ill-posed restoration tasks, while also being easy to train. To this end, in this paper, we propose a new U-Net based prior that does not impact either the network complexity or convergence behavior of existing convolutional architectures, yet leads to significantly improved restoration. In particular, we advocate the use of carefully designed dilation schedules and dense connections in the U-Net architecture to obtain powerful audio priors. Using empirical studies on standard benchmarks and a variety of ill-posed restoration tasks, such as audio denoising, in-painting and source separation, we demonstrate that our proposed approach consistently outperforms widely adopted audio prior architectures.

Narayanaswamy V., Thiagarajan J.J., Spanias A. (2021). “Using Deep Image Priors to Generate Counterfactual Explanations.” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. []

Through the use of carefully tailored convolutional neural network architectures, a deep image prior (DIP) can be used to obtain pre-images from latent representation encodings. Though DIP inversion has been known to be superior to conventional regularized inversion strategies such as total variation, such an over-parameterized generator is able to effectively reconstruct even images that are not in the original data distribution. This limitation makes it challenging to utilize such priors for tasks such as counterfactual reasoning, wherein the goal is to generate small, interpretable changes to an image that systematically leads to changes in the model prediction. To this end, we propose a novel regularization strategy based on an auxiliary loss estimator jointly trained with the predictor, which efficiently guides the prior to recover natural pre-images. Our empirical studies with a real-world ISIC skin lesion detection problem clearly evidence the effectiveness of the proposed approach in synthesizing meaningful counterfactuals. In comparison, we find that the standard DIP inversion often proposes visually imperceptible perturbations to irrelevant parts of the image, thus providing no additional insights into the model behavior.

Jacobs, S. A., Moon, T., McLoughlin, et al. (2021). “Enabling Rapid COVID-19 Small Molecule Drug Design through Scalable Deep Learning of Generative Models.” The International Journal of High Performance Computing Applications. []

We improved the quality and reduced the time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model trained on 1.613 billion compounds in 23 minutes while the previous state of the art takes a day on 1 million compounds. Reducing training time from a day to minutes shifts the model creation bottleneck from computer job turnaround time to human innovation time. Our implementation achieves 318 PFLOPs for 17.1% of half-precision peak. We will incorporate this model into our molecular design loop enabling the generation of more diverse compounds; searching for novel, candidate antiviral drugs improves and reduces the time to synthesize compounds to be tested in the lab.

Zhang, Z., Kailkhura, B., Han, T. Y.-J. (2021). “Leveraging Uncertainty from Deep Learning for Trustworthy Material Discovery Workflows.” ACS Omega. []

In this paper, we leverage predictive uncertainty of deep neural networks to answer challenging questions material scientists usually encounter in machine learning-based material application workflows. First, we show that by leveraging predictive uncertainty, a user can determine the required training data set size to achieve a certain classification accuracy. Next, we propose uncertainty-guided decision referral to detect and refrain from making decisions on confusing samples. Finally, we show that predictive uncertainty can also be used to detect out-of-distribution test samples. We find that this scheme is accurate enough to detect a wide range of real-world shifts in data, e.g., changes in the image acquisition conditions or changes in the synthesis conditions. Using microstructure information from scanning electron microscope (SEM) images as an example use case, we show that leveraging uncertainty-aware deep learning can significantly improve the performance and dependability of classification models.

Nguyen, P., Loveland, D., Kim, J. T., et al. (2021). “Predicting Energetics Materials’ Crystalline Density from Chemical Structure by Machine Learning.” Journal of Chemical Information and Modeling. []

To expedite new molecular compound development, a long-sought goal within the chemistry community has been to predict molecules’ bulk properties of interest a priori to synthesis from a chemical structure alone. In this work, we demonstrate that machine learning methods can indeed be used to directly learn the relationship between chemical structures and bulk crystalline properties of molecules, even in the absence of any crystal structure information or quantum mechanical calculations. We focus specifically on a class of organic compounds categorized as energetic materials called high explosives (HE) and predicting their crystalline density. An ongoing challenge within the chemistry machine learning community is deciding how best to featurize molecules as inputs into machine learning models—whether expert handcrafted features or learned molecular representations via graph-based neural network models—yield better results and why. We evaluate both types of representations in combination with a number of machine learning models to predict the crystalline densities of HE-like molecules curated from the Cambridge Structural Database, and we report the performance and pros and cons of our methods. Our message passing neural network (MPNN) based models with learned molecular representations generally perform best, outperforming current state-of-the-art methods at predicting crystalline density and performing well even when testing on a data set not representative of the training data. However, these models are traditionally considered black boxes and less easily interpretable. To address this common challenge, we also provide a comparison analysis between our MPNN-based model and models with fixed feature representations that provides insights as to what features are learned by the MPNN to accurately predict density.

Hatfield, P. W., Gaffney, J. A., Anderson, G. J., et al. (2021). “The Data-Driven Future of High-Energy-Density Physics.” Nature. []

High-energy-density physics is the field of physics concerned with studying matter at extremely high temperatures and densities. Such conditions produce highly nonlinear plasmas, in which several phenomena that can normally be treated independently of one another become strongly coupled. The study of these plasmas is important for our understanding of astrophysics, nuclear fusion and fundamental physics—however, the nonlinearities and strong couplings present in these extreme physical systems makes them very difficult to understand theoretically or to optimize experimentally. Here we argue that machine learning models and data-driven methods are in the process of reshaping our exploration of these extreme systems that have hitherto proved far too nonlinear for human researchers. From a fundamental perspective, our understanding can be improved by the way in which machine learning models can rapidly discover complex interactions in large datasets. From a practical point of view, the newest generation of extreme physics facilities can perform experiments multiple times a second (as opposed to approximately daily), thus moving away from human-based control towards automatic control based on real-time interpretation of diagnostic data and updates of the physics model. To make the most of these emerging opportunities, we suggest proposals for the community in terms of research design, training, best practice and support for synthetic diagnostics and data analysis.

Djordjević, B. Z., Kemp, A. J., Kim, J., et al. (2021). “Modeling Laser-Driven Ion Acceleration with Deep Learning.” Physics of Plasmas. []

Developments in machine learning promise to ameliorate some of the challenges of modeling complex physical systems through neural-network-based surrogate models. High-intensity, short-pulse lasers can be used to accelerate ions to mega-electronvolt energies, but to model such interactions requires computationally expensive techniques such as particle-in-cell simulations. Multilayer neural networks allow one to take a relatively sparse ensemble of simulations and generate a surrogate model that can be used to rapidly search the parameter space of interest. In this work, we created an ensemble of over 1,000 simulations modeling laser-driven ion acceleration and developed a surrogate to study the resulting parameter space. A neural-network-based approach allows for rapid feature discovery not possible for traditional parameter scans given the computational cost. A notable observation made during this study was the dependence of ion energy on the pre-plasma gradient length scale. While this methodology harbors great promise for ion acceleration, it has ready application to all topics in which large-scale parameter scans are restricted by significant computational cost or relatively large, but sparse, domains.