Protein design
Protein design
Problem definition
Datasets
CATH
domain-wise single-chain design
PDB
multi-chain dataset: ProteinMPNN dataset derived from PDB with high resolution and less than 10,000 residues, with clustering sequences at 30% identity, resulting in 25, 361 clusters.
训练时在每个cluster中随机抽取一个做训练!
TS50
T500
TS45
de novo protein structures from CASP15
Free Modeling(FM), TBM(Template-Based Modeling), TBM-easy, TBM-hard
18 FM, 25+2 TBM(20 TBM-easy, 5 TBM-hard, 2 FM/TBM)
Baselines
transformer-based
StructGNN C-alpha geometric features
GraphTrans C-alpha geometric features
GCA C-alpha geometric features, global attention
GVP novel GNN layer for invariant and equivariant fatures
AlphaDesign replace decoder with a iterative 1D CNN
ProteinMPNN incorporate additional structural information
PiFold a combination of AlphaDesign and ProteinMPNN
KWDesign ensemble model that utilizes PiFold to create a prompt template, using pre-trained knowledge (ESM-650M and structure pretrained ESMIF's encoder)
Autoregressive
slow for generating long proteins
Iterative
generate residues in parallel and iteratively refine the generated sequence (AlphaDesign and KWDesign)
One-shot
generate protein sequence in parallel, PiFold
Metrics
Perplexity
computation
Recovery
Recovery is the primary metric of the ability of the designed protein to recover its original residues.
Confidence
When there is no reference sequence available, confidence defined as the average predictive probability of designed amino acids should be measured.
Diversity
Explore a set of protein rather than a single sequence. So diverse sequence generation is quite important for reasonable optimization in the protein space.
We have pair-wise diversity
sc-TM
structural smilarity
self-consistent TM-score
Robustness
Apply small Gaussian perturbations to Cartesian coordinates of the structure and then see the resulting sequence
Efficiency
computational resources and time for a design
for model's scalability and practicality