πŸ—„οΈ

hMOF-ML Dataset

Version v1.0 | ML-Ready Dataset for High-Throughput Screening

πŸ“Š Dataset Overview

A massive, machine learning-ready dataset of 137,953 hypothetical MOF structures with comprehensive structural and geometric features. This dataset enables high-throughput computational screening, material discovery, and gas adsorption prediction for various applications. Note: This dataset does not include hydrogen storage labels and is primarily designed for inference and candidate screening tasks.

Key Statistics

137,953
Hypothetical MOFs
102
Building Blocks
HT
Screening Ready
JSON
ML-Ready Format

πŸ“Š Data Sources

🧱 Structural Data

From Northwestern University databases containing 137,953 hypothetically constructed MOF CIF files assembled from 102 building blocks. The hMOF-ML Database was originally developed by Christopher E. Wilmer and collaborators and is now curated by the Snurr Research Lab.

πŸ”¬ Geometric Features

From MOFXDB API including original geometric descriptors:

πŸ”§ Data Processing Pipeline

We performed comprehensive processing to transform raw hMOF-ML data into a ML-ready format:

⭐ 1. Automated Structure Data Extraction

  • Developed Pymatgen-based Python automation scripts
  • Automatic CIF parsing from Northwestern databases
  • Parsed and extracted: unit cell parameters (a, b, c, Ξ±, Ξ², Ξ³), atomic types and coordinates

⭐ 2. Derived Feature Calculation

Computed additional features to enrich the feature space:

  • Total metal atom count
  • Metal mass fraction
  • Metal molar fraction
  • Lattice volume

⭐ 3. Data Integration & Standardization

Unified JSON database containing:

  • Atomic composition features
  • Lattice parameter features
  • Pore geometry features

For a complete list of all accessible attributes and their descriptions, please refer to the Data Format documentation.

🎯 Applications

πŸ’‘ Key Advantages

βœ… ML-Ready

Structured JSON format with comprehensive features, ready for direct model inference

πŸ“Š Feature-Rich

Comprehensive structural and geometric features

πŸ”¬ Massive Scale

137,953 hypothetical structures for comprehensive screening

🎯 Ready-to-Use

Unified format optimized for inference tasks

⚠️ Important Note

This dataset does NOT include hydrogen storage labels (UG_at_PS, UV_at_PS, UG_at_TPS, UV_at_TPS). It is primarily designed for inference tasks and candidate screening. For training models on hydrogen storage prediction, please use the H2MOF-ML Dataset.

Download

The dataset is available in JSON format. You can download it via:

πŸ“š Citation

If you use this dataset, please cite the following references:

Recent Curation (2023)

N. Scott Bobbitt, Kaihang Shi, Benjamin J. Bucior, Haoyuan Chen, Nathaniel Tracy-Amoroso, Zhao Li, Yangzesheng Sun, Julia H. Merlin, J. Ilja Siepmann, Daniel W. Siderius, and Randall Q. Snurr.
MOFX-DB: An Online Database of Computational Adsorption Data for Nanoporous Materials.
Journal of Chemical & Engineering Data, 2023, 68 (2), 483–498.
DOI: 10.1021/acs.jced.2c00583

Foundational Generation Study (2012)

Christopher E. Wilmer, Michael Leaf, Chang Yeon Lee, Omar K. Farha, Brad G. Hauser, Joseph T. Hupp, and Randall Q. Snurr.
Large-scale screening of hypothetical metal–organic frameworks.
Nature Chemistry, 2012, 4, 83–89.
DOI: 10.1038/nchem.1192

πŸ”— Related Dataset

For training machine learning models on hydrogen storage prediction with labeled GCMC data, see: H2MOF-ML Dataset