Module 3 — Weeks 6-8

Bio-Specific ML

The ML that matters for your career: protein/nucleotide embeddings, foundation models, transcriptomics + ML, and deep learning for cell images. This is where biology meets state-of-the-art AI.

Week 6

Protein Embeddings with ESM-2

Goal: Understand embeddings — the most important concept in modern bio-AI. Use pre-trained protein models to create meaningful sequence representations.
Concept: What Are Embeddings?

Embeddings convert biological sequences into coordinates in high-dimensional space where similar sequences are close together. Think flow cytometry: you project cells into multi-dimensional space based on marker expression, and clusters emerge. Embeddings do the same for sequences, but the "markers" are learned from millions of examples.

ModelInputWhy You Care
ESM-2 (Meta)Protein sequencesPredict function, structure, interactions without wet lab
Nucleotide TransformerDNA/RNARegulatory element prediction, variant effect
scGPTSingle-cell expressionCell type classification, perturbation response
DNABERT-2DNA sequencesPromoter prediction, variant effects
Project: Protein Function Clustering

GitHub repo: protein-embedding-explorer

Dataset: UniProt — Download sequences for a protein family. Good options: all human kinases (~500), all human GPCRs (~800), or any well-defined family with clear functional subgroups. 200-500 sequences is plenty. Avoid families that overlap with your day-job work if portfolio independence matters to you.

What to build:

  1. Download protein sequences from UniProt with metadata
  2. Generate ESM-2 embeddings using Hugging Face transformers
  3. Visualize with UMAP — do functional subfamilies cluster?
  4. Compare: ESM-2 embeddings vs. simple features (length, MW, amino acid composition)
  5. Build nearest-neighbor search: "given this protein, find the most similar"
  6. Interactive plotly visualization with hover labels
Bio-Specific Prompts
Write a script that downloads all human complement pathway proteins from UniProt using the REST API. Save sequences and metadata (protein name, gene name, function annotation, subcellular location) as a DataFrame.
Generate ESM-2 embeddings for this list of protein sequences using the facebook/esm2_t6_8M_UR50D model from Hugging Face. Use mean pooling over the sequence length dimension. Save the embedding matrix as a numpy array. Show a progress bar.
Run UMAP on these protein embeddings. Color by protein subfamily/function. Compare with UMAP on simple features (amino acid composition, length, MW, pI). Show both plots side by side. Which representation separates functional groups better?
ML Concepts You'll Pick Up
  • Embeddings — The representation powering all modern bio-AI. Sequences become numbers that capture biological meaning.
  • Foundation models — Pre-trained on massive data, adapted to specific tasks. You don't train them; you use them.
  • Transfer learning — ESM-2 learned from millions of proteins. Your 200 sequences benefit from all that knowledge.
  • Representation learning — Learned features (embeddings) vs. hand-crafted features. The core question in modern ML.
Week 7

Immune Gene Signature Discovery

Goal: Apply ML to transcriptomics. Discover gene signatures that classify immune states — directly building on your published IFN research.
Project: Immune Gene Signature Classifier

GitHub repo: immune-signature-classifier

Dataset: GEO — Search for interferon stimulation RNA-seq datasets (IFN-treated vs. untreated across cell types). Or use DEE2 for uniformly processed data.

What to build:

  1. Download bulk RNA-seq with IFN treatment conditions
  2. Traditional DEG analysis using pydeseq2
  3. ML classification: can a model distinguish IFN-treated from untreated?
  4. Compare ML feature importance vs. known ISGs (interferon-stimulated genes)
  5. Does the model discover non-obvious genes? — YOUR biological insight shines here
  6. Venn diagram: DEGs vs. ML-selected features vs. known ISG list
Bio-Specific Prompts
Run differential expression analysis using pydeseq2 comparing IFN-treated vs. untreated samples. Then train a Random Forest classifier on the top 500 most variable genes. Compare the top 20 DEGs (by p-value) with the top 20 features (by RF importance). Show overlap as a Venn diagram. How many genes appear in both lists?
Create a clustermap of the top 30 ML-selected genes across all samples. Annotate rows: green if the gene is a known ISG from this list [MX1, IFIT1, ISG15, OAS1, STAT1, IRF7, IFITM1, BST2, IFI44, RSAD2], red if not. Are there genes the model finds important that aren't canonical ISGs?
ML Concepts You'll Pick Up
  • ML vs. traditional statistics — DEGs and feature importance answer different questions. They're complementary.
  • High-dimensional data — Thousands of genes, few samples. ML handles this with regularization and feature selection.
  • Biological validation — Checking if ML results make biological sense. Your domain expertise IS the validation.
Week 8

Deep Learning & Cell Image Analysis

Goal: Understand deep learning conceptually and build a cell image classifier using transfer learning. Know when DL is appropriate vs. overkill.
Learning: Deep Learning Demystified (1 hr)

Watch 3Blue1Brown Neural Networks videos 1-4 (~60 min). Best visual DL explanation available.

ArchitectureGood ForBio Example
CNNImages, sequencesCell classification, motif detection
TransformerSequences, languageAlphaFold, ESM-2, scGPT
GNNMolecular graphsDrug-target interaction
VAEGeneration, compressionSingle-cell analysis, molecule design

When DL makes sense: Large datasets (>10K) OR transfer learning, complex inputs (images, sequences, graphs). When it doesn't: Small tabular data (use XGBoost), when interpretability matters for regulatory, when a simple model gets 90% there.

Project: Cell Phenotype Classifier

GitHub repo: cell-phenotype-classifier

Dataset: Broad Bioimage Benchmark Collection — BBBC021 (MCF-7 cells, compound-treated, fluorescence images with phenotype labels) or BBBC038 (nuclei segmentation).

What to build:

  1. Download cell images with treatment labels from BBBC
  2. Fine-tune a pre-trained CNN (ResNet18) on cell images
  3. Classify cells by treatment condition or phenotype
  4. GradCAM visualization — what does the model "see"?
  5. Compare: does the CNN find the same features a biologist would?
Bio-Specific Prompts
Download BBBC021 dataset images. Load them with their treatment labels. Show 3 example images per treatment condition side by side. Resize to 224x224 for the model. Apply standard ImageNet normalization.
Fine-tune a pre-trained ResNet18 to classify cell images by treatment. Freeze all layers except the final FC layer. Use data augmentation (random flips, rotation, brightness). Train 10 epochs with early stopping on validation loss. Plot train/val accuracy and loss curves.
Apply GradCAM to the trained model. Show original image alongside GradCAM heatmap for 5 correct and 5 misclassified examples. What cellular features is the model focusing on? Are they biologically meaningful (nucleus shape, organelle distribution, cell morphology)?
ML Concepts You'll Pick Up
  • Transfer learning in practice — A ResNet trained on natural images works for cells because low-level features (edges, textures) transfer across domains.
  • Fine-tuning — Freeze base, retrain top layers. This is how most biotech DL works in practice.
  • Explainability (GradCAM) — Seeing what the model "sees" builds trust and reveals biology.
  • Data augmentation — Artificially expanding your dataset with transformations. Essential when labeled bio images are scarce.

You should have 3 new GitHub repos and understand embeddings, transcriptomics ML, and DL fundamentals.