ProSST

Key Features

This repository provides the official implementation of ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention.

The paper introduces several key contributions to protein language modeling:

Integration of Protein Sequences and Structures: The ProSST model integrates both protein sequences and structures using a structure quantization module and a Transformer architecture with disentangled attention, effectively capturing the relationship between protein residues and their structural context.
Structure Quantization Module: This module converts 3D protein structures into discrete tokens by serializing residue-level local structures and embedding them into a dense vector space, which are then quantized using a pre-trained clustering model to serve as effective protein structure representations.
Disentangled Attention Mechanism: ProSST uses a disentangled attention mechanism to explicitly learn the relationships between protein sequence tokens and structure tokens, improving the model’s ability to capture complex features of protein sequences and structures, and leading to state-of-the-art performance in various protein function prediction tasks.

Links

Paper
Model
Code

Dataset Links

ProteinGYM Benchmark

Download the dataset from Google Drive.

Get Started

Installation

git clone https://github.com/ginnm/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)

Structure quantizer

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)

Output:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Download Model

ProSST models have been uploaded to huggingface 🤗 Transformers

from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrianed("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)

See AI4Protein/ProSST-* for more models.

Zero-shot mutant effect prediction

Example notebook

Zero-shot mutant effect prediction

Run ProteinGYM Benchmark

Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).

cd example_data
unzip proteingym_benchmark.zip

python zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048

🛡️ License

This project is under the GPL-3.0 license. See LICENSE for details.

📝 Citation

If you find this repository useful, please consider citing this paper:

@article {Li2024.04.15.589672,
	author = {Li, Mingchen and Tan, Yang and Ma, Xinzhu and Zhong, Bozitao and Zhou, Ziyi and Yu, Huiqun and Ouyang, Wanli and Hong, Liang and Zhou, Bingxin and Tan, Pan},
	title = {ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention},
	elocation-id = {2024.04.15.589672},
	year = {2024},
	doi = {10.1101/2024.04.15.589672},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1},
	eprint = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1.full.pdf},
	journal = {bioRxiv}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ProSST

Key Features

Links

Dataset Links

Get Started

Zero-shot mutant effect prediction

Example notebook

Run ProteinGYM Benchmark

🛡️ License

📝 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ProSST

Key Features

Links

Dataset Links

Get Started

Zero-shot mutant effect prediction

Example notebook

Run ProteinGYM Benchmark

🛡️ License

📝 Citation