MatText: A framework for text-based materials modeling

MatText is a framework for text-based materials modeling. It supports

conversion of crystal structures in to text representations
transformations of crystal structures for sensitivity analyses
decoding of text representations to crystal structures
tokenization of text-representation of crystal structures
pre-training, finetuning and testing of language models on text-representations of crystal structures
analysis of language models trained on text-representations of crystal structures

Local Installation

Requirements:

Python 3.10 or 3.11 (tested and supported)
uv package manager (recommended)

We recommend using uv for fast and reliable Python package management. To install uv, follow the installation instructions.

Install latest release

uv pip install git+https://github.com/lamalab-org/mattext.git

Install development version

Clone this repository (you need git for this, if you get a missing command error for git you can install it with sudo apt-get install git)

git clone https://github.com/lamalab-org/mattext.git
cd mattext

Create a virtual environment and install:

uv venv --python 3.10
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"

Install pre-commit hooks (optional, for development):

pre-commit install

If you want to use the Local Env representation, you will also need to install OpenBabel. You can install it via conda/mamba:

conda install openbabel -c conda-forge

or on Ubuntu/Debian:

sudo apt-get install openbabel

Getting started

Converting crystals into text

from mattext.representations import TextRep
from pymatgen.core import Structure

# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")

# Initialize TextRep Class
text_rep = TextRep.from_input(structure)

requested_reps = [
    "cif_p1",
    "slices",
    "atom_sequences",
    "atom_sequences_plusplus",
    "crystal_text_llm",
    "zmatrix"
]

# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)

Pretrain

python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32

Running a benchmark

python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint

The + symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration.

To override the existing default configuration, use ++, for e.g., ++model.pretrain.training_arguments.per_device_train_batch_size=32. Refer to the docs for more examples and advanced ways to use the configs with config groups.

Using data

The MatText datasets can be easily obtained from HuggingFace, for example

from datasets import load_dataset

dataset = load_dataset("n0w0f/MatText", "pretrain300k")

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

Citation

If you use MatText in your work, please cite

@misc{alampara2024mattextlanguagemodelsneed,
      title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
      author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
      year={2024},
      eprint={2406.17295},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci}
      url={https://arxiv.org/abs/2406.17295},
}

⚖️ License

The code in this package is licensed under the MIT License.

💰 Funding

This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
.github		.github
conf		conf
docs		docs
notebooks		notebooks
revision-scripts		revision-scripts
scripts		scripts
src/mattext		src/mattext
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatText: A framework for text-based materials modeling

Local Installation

Install latest release

Install development version

Getting started

Converting crystals into text

Pretrain

Running a benchmark

Using data

👐 Contributing

👋 Attribution

Citation

⚖️ License

💰 Funding

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MatText: A framework for text-based materials modeling

Local Installation

Install latest release

Install development version

Getting started

Converting crystals into text

Pretrain

Running a benchmark

Using data

👐 Contributing

👋 Attribution

Citation

⚖️ License

💰 Funding

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages