ESM3 Structure-Conditioned Loop Design: TIMP3 C-Loop
Assignment Setup
Create your GitHub repository
- Create a private GitHub repository named
ESM3-TIMP3-StructCond - Initialize with a README.md
- Add Dr. Yim as a collaborator (GitHub username:
wyim-pgl) - Your repository structure should look like this when complete:
ESM3-TIMP3-StructCond/
README.md
environment.yml
01_seq_only_generation.py
02_struct_conditioned_generation.py
03_validation.py
04_analysis.ipynb # Figures + written analysis
results/
seq_only_loops.csv
struct_cond_loops.csv
validation_metrics.csv
figures/
fig1_plddt_distribution.png
fig2_rmsd_vs_plddt.png
fig3_aa_frequency_heatmap.png
Environment setup
# environment.yml
name: esm3-timp3
channels:
- conda-forge
- pytorch
- nvidia
dependencies:
- python=3.10
- pytorch>=2.0
- pytorch-cuda=12.1
- pip
- pip:
- esm
- biotite
- py3Dmol
- matplotlib
- seaborn
- pandas
- scipy
conda env create -f environment.yml
conda activate esm3-timp3
CRITICAL: The BOS/EOS Token Indexing Trap
Read this before writing any code.
When you call model.encode(), ESM3 wraps the sequence with special tokens:
Your sequence: M C P Q ... res_62 ... res_187
Position: 0 1 2 3 ... 62 ... 187
After encode():
Token index: 0 1 2 3 4 ... 63 ... 188 189
Token: BOS M C P Q ... res_62 ... res_187 EOS
4098 4097
Sequence position i = token index i + 1. Tensor length = len(sequence) + 2.
When copying structure tokens between tensors, either:
- Work entirely in token space (index-to-index copy), OR
- Work in residue space and add +1 for every index
Mixing conventions shifts everything by one residue. This is the #1 source of silent bugs in multi-track prompts.
Special token values
| Token | Value | Meaning |
|---|---|---|
| BOS | 4098 | Beginning of sequence (index 0) |
| EOS | 4097 | End of sequence (last index) |
| MASK | 4096 | Masked position (model generates here) |
| 0-4095 | – | Structure codebook (real structural info) |
Part 1: Sequence-Only Generation [15 pts]
Question 1.1: ESM3 Sequence-Only Generation
Write 01_seq_only_generation.py that:
- Loads ESM3 (
ESM3.from_pretrained("esm3_sm_open_v1")) - Loads the TIMP3 mature sequence (UniProt P35625, signal peptide removed)
- Masks the C-loop at positions 62-67 with
_ - Generates 50 sequences using
GenerationConfig(track="sequence") - Extracts the 6-mer loop from each generated sequence
- Saves results to
results/seq_only_loops.csvwith columns:index,full_sequence,loop_sequence - Prints summary statistics: number of unique loops, wildtype matches, per-position entropy
Note
This replicates your current pipeline. The point is to have a clean baseline for comparison.
Hint
For reproducibility, set
temperature=0.7andnum_steps=8. Save the model loading and sequence setup in a shared utility if you want to avoid duplication with Part 2.
Reference
Your existing code in
Generation/Training_with_Gen.ipynb, Step 5b.
Question 1.2 (Optional Bonus, +5 pts)
Build a PSSM baseline from your experimental C-loop library (~47,000 sequences). Compute a position frequency matrix for the 6 loop positions, sample 50 sequences from that distribution, and save to results/pssm_loops.csv.
If you complete this, include the PSSM baseline in all downstream comparisons (Parts 3-4) for a 3-way analysis.
Note
This directly feeds into Q4.4. If you have the data, your answer to “when does a PSSM beat ESM3?” becomes empirical instead of speculative.
Hint
With 85% positive binders and 47K sequences, the PSSM captures the experimental fitness landscape well. The question is whether ESM3 adds anything beyond what this simple model already knows.
Part 2: Structure-Conditioned Generation [25 pts]
Question 2.1: Load the template structure
Write the structure-loading section of 02_struct_conditioned_generation.py.
Download PDB 3CKI and extract the N-TIMP3 chain.
Note
3CKI is the TACE(ADAM17)-N-TIMP3 complex (Wisniewska et al., 2008, J. Mol. Biol.). It has two chains:
- Chain A: TACE catalytic domain
- Chain B: N-TIMP3 (~125 residues, N-terminal domain only)
This is NOT the full-length TIMP3 (188 residues). The C-loop (positions 62-67) falls within the N-terminal domain, so 3CKI covers your target region.
CAUTION
PDB residue numbering may differ from your full-length sequence indexing. After loading, align the 3CKI chain B sequence against your TIMP3 sequence to find the correct position mapping. Do not assume they match.
Hint
timp3_chain = ProteinChain.from_rcsb("3CKI", chain_id="B") template = ESMProtein.from_protein_chain(timp3_chain) template_tokens = model.encode(template)
Question 2.2: Construct the multi-track prompt
In the same script, build a prompt with:
- Sequence track: C-loop masked, everything else is wildtype
- Structure track: C-loop masked (4096), flanking regions have real structure tokens copied from the 3CKI template
Your code must include a sanity check that prints:
- Number of structure-conditioned positions (should be all non-loop residues)
- Number of masked positions (should be exactly 6)
- A visual map showing which positions have structure conditioning
CAUTION
This is where the BOS/EOS offset matters. If your sanity check shows anything other than 6 masked structure positions, your indexing is wrong. Go back and re-read the BOS/EOS section above.
Hint
Initialize an all-masked structure track, set BOS/EOS, then copy flanking tokens:
prompt_tokens.structure = torch.full_like(prompt_tokens.sequence, 4096) prompt_tokens.structure[0] = 4098 # BOS prompt_tokens.structure[-1] = 4097 # EOS # Copy flanking structure: residue i -> token index i + 1
Hint
Sanity check pattern:
n_masked = (prompt_tokens.structure == 4096).sum().item() assert n_masked == 6, f"Expected 6 masked, got {n_masked}"
Question 2.3: Single-pass multi-track generation
Using the multi-track prompt from Q2.2, generate 50 sequences. Both sequence and structure tracks are masked at the loop positions, so ESM3 fills in both simultaneously in a single generate() call.
Save results to results/struct_cond_loops.csv with the same format as Part 1.
Note
The key difference from Part 1 is that the structure track now carries real structural context from 3CKI for the flanking regions. ESM3 sees the loop’s structural neighborhood when generating the sequence, not just the amino acid context.
Hint
Use
temperature=0.7andnum_steps=8(same as Part 1 for fair comparison).
Reference
ESM3 GFP tutorial – https://virtualcellmodels.cziscience.com/tutorial/esm3-tutorial
Part 3: Structural Validation [25 pts]
Question 3.1
Write 03_validation.py that:
- Reads both CSV files from
results/ - Folds every generated sequence using ESM3 (
track="structure",temperature=0.0for deterministic prediction) - For each folded structure, calculates:
- Mean pLDDT of the C-loop region (positions 62-67)
- Loop backbone RMSD vs. the 3CKI template (superimpose on non-loop residues first, then measure loop RMSD)
- Overall backbone RMSD
- Saves all metrics to
results/validation_metrics.csvwith columns:method,index,loop_sequence,loop_plddt,loop_rmsd,backbone_rmsd
Note
Superimpose on the conserved (non-loop) backbone first, THEN measure loop RMSD without re-superimposing. This isolates loop geometry differences from global alignment noise.
Hint
For pLDDT extraction:
folded_chain = folded_protein.to_protein_chain() loop_plddt = folded_chain.confidence[62:68].mean()
Hint
For RMSD with biotite:
from biotite.structure import superimpose, rmsd # Select Ca atoms, exclude loop for superimposition, include loop for RMSD
Question 3.2
In 04_analysis.ipynb, create three figures and save them to figures/:
Figure 1 (fig1_plddt_distribution.png): Loop pLDDT distribution
- Box plot or violin plot, one per method
- Horizontal line at pLDDT = 70
Figure 2 (fig2_rmsd_vs_plddt.png): Loop RMSD vs. pLDDT scatter
- Color by method
- Label the high-pLDDT / low-RMSD quadrant as the ideal region
Figure 3 (fig3_aa_frequency_heatmap.png): Amino acid frequency comparison
- Side-by-side heatmaps (6 positions x 20 amino acids)
- Same color scale for both panels
Hint
Use
matplotlib.pyplot.subplots(1, 2, ...)withsharey=Truefor the heatmaps so the scale is identical.
Part 4: Critical Analysis [30 pts]
Answer these in 04_analysis.ipynb as markdown cells. Each answer: 3-5 sentences, supported by specific numbers from your results.
Question 4.1: Sequence divergence [6 pts]
Does structure conditioning produce measurably different loop sequences compared to sequence-only generation? Show evidence from your amino acid frequency data. If the distributions are similar, what does that tell you about what ESM3’s sequence track already knows about structure?
Question 4.2: Structural quality [6 pts]
Compare pLDDT and RMSD distributions between methods. Define a threshold for “good” (justify it), report the pass rate for each method, and run a statistical test (Mann-Whitney U or t-test).
Question 4.3: The 85% baseline problem [6 pts]
The experimental TIMP3 C-loop library has ~47,000 sequences with ~85% positive binders. That means almost any 6-mer works. How would you demonstrate that ESM3 generates sequences that are genuinely better than random? What metric beyond binary binder/non-binder would you use?
Hint
Think about what “better” could mean – binding affinity (Ki), thermal stability, specificity for one MMP over another, resistance to proteolysis. The binary hit rate is not enough when the baseline is 85%.
Question 4.4: Model appropriateness [6 pts]
1.4 billion parameters for 6 amino acids. The experimental data already gives you a PSSM from 47,000 sequences. When would the PSSM win? When would ESM3 win? Consider data availability, in-distribution vs. out-of-distribution design, and compute cost.
Question 4.5: Scaling beyond loop redesign [6 pts]
If you applied multi-track conditioning to a de novo protein design (not loop redesign), what changes? In this assignment you used single-pass generation where ESM3 fills sequence and structure simultaneously. Under what circumstances would a two-stage approach (generate structure first, then design sequence) be more appropriate? What fails when the target fold has no close homolog in ESM3’s training data?
Reference
Hayes et al. (2025), particularly the discussion of how esmGFP was designed at 58% identity to known GFPs – a case where structure conditioning was essential because sequence homology alone was insufficient.
Submission
- Push all code, results, and figures to your GitHub repository
- Ensure
README.mdincludes:- Brief description of the project
- How to set up the environment (
conda env create -f environment.yml) - How to run each script in order
- Summary of key findings (2-3 sentences)
- Verify that all scripts run without errors on a clean checkout
- Submit the repository URL
Due date: Check the course schedule
Grading
| Component | Points |
|---|---|
Part 1: 01_seq_only_generation.py + output CSV |
15 |
Part 2: 02_struct_conditioned_generation.py + output CSV + sanity checks |
25 |
Part 3: 03_validation.py + metrics CSV + 3 figures |
25 |
| Part 4: Written analysis (Q4.1-Q4.5, 6 pts each) | 30 |
| Code quality, README, repo organization | 5 |
| Total | 100 |
References
- Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model. Science. DOI: 10.1126/science.ads0018
- Wisniewska, M. et al. (2008). Structural determinants of the ADAM inhibition by TIMP-3: crystal structure of the TACE-N-TIMP-3 complex. J. Mol. Biol., 381(5), 1307-1319. PDB: 3CKI
- ESM3 GitHub: https://github.com/evolutionaryscale/esm
- ESM3 GFP Tutorial: https://virtualcellmodels.cziscience.com/tutorial/esm3-tutorial
- Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.
Compute Notes
- 50 sequences per method on A100: ~30-60 min
- Structure-conditioned takes ~2x longer (two generation stages)
- If using T4/V100, reduce to 20 sequences and note in README
- Folding 100 sequences (validation): ~20-30 min additional