πŸ’§

H2MOF-ML Dataset

Version v1.0 | ML-Ready Dataset for Hydrogen Storage Prediction

πŸ“Š Dataset Overview

A comprehensive, machine learning-ready dataset of 4,034 experimentally synthesised MOF materials, including rich structural features and four hydrogen adsorption capacity metrics derived from Grand Canonical Monte Carlo (GCMC) simulations. This dataset enables direct training and inference for hydrogen storage performance prediction tasks.

Key Statistics

4,034
MOF Structures
4
GCMC Target Values
100%
Traceable (CSD)
JSON
ML-Ready Format

πŸ“Š Data Sources

🧱 Structural Data

From CSD MOF Collection (Non-Commercial) containing experimentally synthesised MOF CIF data.

πŸ”¬ Adsorption Performance & Geometric Features

From HyMARC Datahub (Ahmed et al., 2017) including:

πŸ”§ Data Processing Pipeline

We performed extensive processing to transform raw data into a ML-ready format:

⭐ 1. Data Filtering & Cleaning

Retained only MOFs with CSD codes (4,034 samples) to ensure data traceability and structure reproducibility.

⭐ 2. Automated Structure Data Extraction

  • Developed Pymatgen-based Python automation scripts
  • Automatic CIF retrieval via CCDC API
  • Parsed and extracted: unit cell parameters (a, b, c, Ξ±, Ξ², Ξ³), atomic types and coordinates, crystal bonding information

⭐ 3. Derived Feature Calculation

Computed additional features to enrich the feature space:

  • Total metal atom count
  • Metal mass fraction
  • Metal molar fraction
  • Lattice volume

⭐ 4. Data Integration & Standardization

Unified JSON database containing:

  • Atomic composition features
  • Lattice parameter features
  • Pore geometry features
  • Crystal bonding information
  • GCMC performance labels

⭐ 5. Quality Control

Missing value handling with default values to ensure training stability.

For a complete list of all accessible attributes and their descriptions, please refer to the Data Format documentation.

🎯 Applications

πŸ’‘ Key Advantages

βœ… ML-Ready

Structured JSON format with comprehensive features, ready for direct model training

πŸ“Š Feature-Rich

Multi-dimensional feature space: structural + geometric + performance

πŸ” Traceable

Preserved CSD codes for reliable data provenance

🎯 Ready-to-Use

Unified format with comprehensive feature engineering

Download

The dataset is available in JSON format. You can download it via:

πŸ“š Citation

If you use this dataset, please cite both the original data sources:

Hydrogen Adsorption Data

Ahmed, A., Liu, Y., Purewal, J., Tran, L., Sholl, D. S., & Lively, R. P.
High-throughput screening of metal–organic frameworks for hydrogen storage at cryogenic temperature.
Energy & Environmental Science, 2017, 10, 2459–2471.

Structural Data

From CSD MOF Collection (Non-Commercial), licensed under CC BY-NC-SA 4.0

πŸ“„ License

This modified dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

License Terms