Protein design

Protein design

Problem definition

Datasets

CATH

domain-wise single-chain design

PDB

multi-chain dataset: ProteinMPNN dataset derived from PDB with high resolution and less than 10,000 residues, with clustering sequences at 30% identity, resulting in 25, 361 clusters.

训练时在每个cluster中随机抽取一个做训练!

TS50

T500

TS45

de novo protein structures from CASP15

Free Modeling(FM), TBM(Template-Based Modeling), TBM-easy, TBM-hard

18 FM, 25+2 TBM(20 TBM-easy, 5 TBM-hard, 2 FM/TBM)

Baselines

transformer-based

StructGNN C-alpha geometric features

GraphTrans C-alpha geometric features

GCA C-alpha geometric features, global attention

GVP novel GNN layer for invariant and equivariant fatures

AlphaDesign replace decoder with a iterative 1D CNN

ProteinMPNN incorporate additional structural information

PiFold a combination of AlphaDesign and ProteinMPNN

KWDesign ensemble model that utilizes PiFold to create a prompt template, using pre-trained knowledge (ESM-650M and structure pretrained ESMIF's encoder)

Autoregressive

slow for generating long proteins

Iterative

generate residues in parallel and iteratively refine the generated sequence (AlphaDesign and KWDesign) t is refinement step, affecting the inference time cost.

One-shot

generate protein sequence in parallel, PiFold

Metrics

Perplexity

computation

Recovery

Recovery is the primary metric of the ability of the designed protein to recover its original residues. 𝕝

Confidence

When there is no reference sequence available, confidence defined as the average predictive probability of designed amino acids should be measured.

Diversity

Explore a set of protein rather than a single sequence. So diverse sequence generation is quite important for reasonable optimization in the protein space.

We have pair-wise diversity 𝕝 And the overall diversity is

sc-TM

structural smilarity

self-consistent TM-score

Robustness

Apply small Gaussian perturbations to Cartesian coordinates of the structure and then see the resulting sequence

Efficiency

computational resources and time for a design

for model's scalability and practicality