Module 3 — Weeks 6-8
The ML that matters for your career: protein/nucleotide embeddings, foundation models, transcriptomics + ML, and deep learning for cell images. This is where biology meets state-of-the-art AI.
Embeddings convert biological sequences into coordinates in high-dimensional space where similar sequences are close together. Think flow cytometry: you project cells into multi-dimensional space based on marker expression, and clusters emerge. Embeddings do the same for sequences, but the "markers" are learned from millions of examples.
| Model | Input | Why You Care |
|---|---|---|
| ESM-2 (Meta) | Protein sequences | Predict function, structure, interactions without wet lab |
| Nucleotide Transformer | DNA/RNA | Regulatory element prediction, variant effect |
| scGPT | Single-cell expression | Cell type classification, perturbation response |
| DNABERT-2 | DNA sequences | Promoter prediction, variant effects |
GitHub repo: protein-embedding-explorer
What to build:
transformersWrite a script that downloads all human complement pathway proteins from UniProt using the REST API. Save sequences and metadata (protein name, gene name, function annotation, subcellular location) as a DataFrame.Generate ESM-2 embeddings for this list of protein sequences using the facebook/esm2_t6_8M_UR50D model from Hugging Face. Use mean pooling over the sequence length dimension. Save the embedding matrix as a numpy array. Show a progress bar.Run UMAP on these protein embeddings. Color by protein subfamily/function. Compare with UMAP on simple features (amino acid composition, length, MW, pI). Show both plots side by side. Which representation separates functional groups better?GitHub repo: immune-signature-classifier
What to build:
pydeseq2Run differential expression analysis using pydeseq2 comparing IFN-treated vs. untreated samples. Then train a Random Forest classifier on the top 500 most variable genes. Compare the top 20 DEGs (by p-value) with the top 20 features (by RF importance). Show overlap as a Venn diagram. How many genes appear in both lists?Create a clustermap of the top 30 ML-selected genes across all samples. Annotate rows: green if the gene is a known ISG from this list [MX1, IFIT1, ISG15, OAS1, STAT1, IRF7, IFITM1, BST2, IFI44, RSAD2], red if not. Are there genes the model finds important that aren't canonical ISGs?Watch 3Blue1Brown Neural Networks videos 1-4 (~60 min). Best visual DL explanation available.
| Architecture | Good For | Bio Example |
|---|---|---|
| CNN | Images, sequences | Cell classification, motif detection |
| Transformer | Sequences, language | AlphaFold, ESM-2, scGPT |
| GNN | Molecular graphs | Drug-target interaction |
| VAE | Generation, compression | Single-cell analysis, molecule design |
When DL makes sense: Large datasets (>10K) OR transfer learning, complex inputs (images, sequences, graphs). When it doesn't: Small tabular data (use XGBoost), when interpretability matters for regulatory, when a simple model gets 90% there.
GitHub repo: cell-phenotype-classifier
What to build:
Download BBBC021 dataset images. Load them with their treatment labels. Show 3 example images per treatment condition side by side. Resize to 224x224 for the model. Apply standard ImageNet normalization.Fine-tune a pre-trained ResNet18 to classify cell images by treatment. Freeze all layers except the final FC layer. Use data augmentation (random flips, rotation, brightness). Train 10 epochs with early stopping on validation loss. Plot train/val accuracy and loss curves.Apply GradCAM to the trained model. Show original image alongside GradCAM heatmap for 5 correct and 5 misclassified examples. What cellular features is the model focusing on? Are they biologically meaningful (nucleus shape, organelle distribution, cell morphology)?You should have 3 new GitHub repos and understand embeddings, transcriptomics ML, and DL fundamentals.