Module 3: Bio-Specific ML | Bench to Bytes

Week 6

Protein Embeddings with ESM-2

Done

Goal: Understand embeddings — the most important concept in modern bio-AI. Use pre-trained protein models to create meaningful sequence representations.

Concept: What Are Embeddings?

Embeddings convert biological sequences into coordinates in high-dimensional space where similar sequences are close together. Think flow cytometry: you project cells into multi-dimensional space based on marker expression, and clusters emerge. Embeddings do the same for sequences, but the "markers" are learned from millions of examples.

Model	Input	Why You Care
ESM-2 (Meta)	Protein sequences	Predict function, structure, interactions without wet lab
Nucleotide Transformer	DNA/RNA	Regulatory element prediction, variant effect
scGPT	Single-cell expression	Cell type classification, perturbation response
DNABERT-2	DNA sequences	Promoter prediction, variant effects

Project: Protein Function Clustering

GitHub repo: protein-embedding-explorer

Dataset: UniProt — Download sequences for a protein family. Good options: all human kinases (~500), all human GPCRs (~800), or any well-defined family with clear functional subgroups. 200-500 sequences is plenty. Avoid families that overlap with your day-job work if portfolio independence matters to you.

What to build:

Download protein sequences from UniProt with metadata
Generate ESM-2 embeddings using Hugging Face transformers
Visualize with UMAP — do functional subfamilies cluster?
Compare: ESM-2 embeddings vs. simple features (length, MW, amino acid composition)
Build nearest-neighbor search: "given this protein, find the most similar"
Interactive plotly visualization with hover labels

Bio-Specific Prompts

Write a script that downloads all human complement pathway proteins from UniProt using the REST API. Save sequences and metadata (protein name, gene name, function annotation, subcellular location) as a DataFrame.

Generate ESM-2 embeddings for this list of protein sequences using the facebook/esm2_t6_8M_UR50D model from Hugging Face. Use mean pooling over the sequence length dimension. Save the embedding matrix as a numpy array. Show a progress bar.

Run UMAP on these protein embeddings. Color by protein subfamily/function. Compare with UMAP on simple features (amino acid composition, length, MW, pI). Show both plots side by side. Which representation separates functional groups better?

ML Concepts You'll Pick Up

Embeddings — The representation powering all modern bio-AI. Sequences become numbers that capture biological meaning.
Foundation models — Pre-trained on massive data, adapted to specific tasks. You don't train them; you use them.
Transfer learning — ESM-2 learned from millions of proteins. Your 200 sequences benefit from all that knowledge.
Representation learning — Learned features (embeddings) vs. hand-crafted features. The core question in modern ML.

Week 7

Immune Gene Signature Discovery

Done

Goal: Apply ML to transcriptomics. Discover gene signatures that classify immune states — directly building on your published IFN research.

Project: Immune Gene Signature Classifier

GitHub repo: immune-signature-classifier

Dataset: GEO — Search for interferon stimulation RNA-seq datasets (IFN-treated vs. untreated across cell types). Or use DEE2 for uniformly processed data.

What to build:

Download bulk RNA-seq with IFN treatment conditions
Traditional DEG analysis using pydeseq2
ML classification: can a model distinguish IFN-treated from untreated?
Compare ML feature importance vs. known ISGs (interferon-stimulated genes)
Does the model discover non-obvious genes? — YOUR biological insight shines here
Venn diagram: DEGs vs. ML-selected features vs. known ISG list

Bio-Specific Prompts

Run differential expression analysis using pydeseq2 comparing IFN-treated vs. untreated samples. Then train a Random Forest classifier on the top 500 most variable genes. Compare the top 20 DEGs (by p-value) with the top 20 features (by RF importance). Show overlap as a Venn diagram. How many genes appear in both lists?

Create a clustermap of the top 30 ML-selected genes across all samples. Annotate rows: green if the gene is a known ISG from this list [MX1, IFIT1, ISG15, OAS1, STAT1, IRF7, IFITM1, BST2, IFI44, RSAD2], red if not. Are there genes the model finds important that aren't canonical ISGs?

ML Concepts You'll Pick Up

ML vs. traditional statistics — DEGs and feature importance answer different questions. They're complementary.
High-dimensional data — Thousands of genes, few samples. ML handles this with regularization and feature selection.
Biological validation — Checking if ML results make biological sense. Your domain expertise IS the validation.

Week 8

Deep Learning & Cell Image Analysis

Done

Goal: Understand deep learning conceptually and build a cell image classifier using transfer learning. Know when DL is appropriate vs. overkill.

Learning: Deep Learning Demystified (1 hr)

Watch 3Blue1Brown Neural Networks videos 1-4 (~60 min). Best visual DL explanation available.

Architecture	Good For	Bio Example
CNN	Images, sequences	Cell classification, motif detection
Transformer	Sequences, language	AlphaFold, ESM-2, scGPT
GNN	Molecular graphs	Drug-target interaction
VAE	Generation, compression	Single-cell analysis, molecule design

When DL makes sense: Large datasets (>10K) OR transfer learning, complex inputs (images, sequences, graphs). When it doesn't: Small tabular data (use XGBoost), when interpretability matters for regulatory, when a simple model gets 90% there.

Project: Cell Phenotype Classifier

GitHub repo: cell-phenotype-classifier

Dataset: Broad Bioimage Benchmark Collection — BBBC021 (MCF-7 cells, compound-treated, fluorescence images with phenotype labels) or BBBC038 (nuclei segmentation).

What to build:

Download cell images with treatment labels from BBBC
Fine-tune a pre-trained CNN (ResNet18) on cell images
Classify cells by treatment condition or phenotype
GradCAM visualization — what does the model "see"?
Compare: does the CNN find the same features a biologist would?

Bio-Specific Prompts

Download BBBC021 dataset images. Load them with their treatment labels. Show 3 example images per treatment condition side by side. Resize to 224x224 for the model. Apply standard ImageNet normalization.

Fine-tune a pre-trained ResNet18 to classify cell images by treatment. Freeze all layers except the final FC layer. Use data augmentation (random flips, rotation, brightness). Train 10 epochs with early stopping on validation loss. Plot train/val accuracy and loss curves.

Apply GradCAM to the trained model. Show original image alongside GradCAM heatmap for 5 correct and 5 misclassified examples. What cellular features is the model focusing on? Are they biologically meaningful (nucleus shape, organelle distribution, cell morphology)?

ML Concepts You'll Pick Up

Transfer learning in practice — A ResNet trained on natural images works for cells because low-level features (edges, textures) transfer across domains.
Fine-tuning — Freeze base, retrain top layers. This is how most biotech DL works in practice.
Explainability (GradCAM) — Seeing what the model "sees" builds trust and reveals biology.
Data augmentation — Artificially expanding your dataset with transformations. Essential when labeled bio images are scarce.