Module 2 — Weeks 3-5

Data Science Foundations

Upgrade your Python from lab scripts to structured data science. Three projects covering EDA, classification, and regression — all with biological data and all vibe coded with Claude.

Week 3

Gene Expression Explorer — EDA & Visualization

Goal: Build your first data analysis project end-to-end using RNA-seq data. Learn the EDA workflow that ML teams expect before any modeling.
Project: Gene Expression Explorer

GitHub repo: gene-expression-explorer

Dataset: GEO — Pick an RNA-seq dataset relevant to your interests. Suggestions: IFN-stimulated immune cells, drug-treated hepatocytes, stress-response in yeast, or any treatment vs. control comparison. Or use DEE2 for pre-processed data. Pick a dataset that's not a conflict of interest with your day job.
If you'd rather work with synthetic data: ask Claude to generate a synthetic RNA-seq count matrix for you with a specified differential expression structure. Example prompt: "Generate a synthetic RNA-seq count matrix: 5000 genes × 12 samples (6 treated, 6 control). Use negative binomial counts with realistic mean-variance relationship. Spike in 200 truly differentially expressed genes (100 up, 100 down, log2FC between 1 and 4). Save as data/counts.csv with a separate data/sample_metadata.csv."

What to build:

  1. Load an RNA-seq count matrix into pandas, explore shape and structure
  2. Filter low-expression genes, normalize (log2 TPM+1 or DESeq2-style)
  3. Volcano plot of differential expression
  4. PCA plot colored by experimental condition
  5. Heatmap of top 50 most variable genes with hierarchical clustering
  6. Export summary table and publication-quality figures
  7. Write README explaining the biology, methods, and what you found
Bio-Specific Prompts
Load the RNA-seq count matrix from [file]. Filter out genes with mean counts < 10 across all samples. Apply log2(TPM+1) normalization. Show me the distribution of expression values before and after normalization as side-by-side histograms.
Create a volcano plot from this differential expression results table. Color genes with |log2FC| > 1 and padj < 0.05. Label the top 10 genes by significance. Use a clean publication-style theme with thin axes. Save as PDF at 300 DPI.
Run PCA on this normalized expression matrix. Color samples by treatment condition. Add 95% confidence ellipses for each group. Show the variance explained by PC1 and PC2 in the axis labels. Does PC1 separate treated from untreated?
ML Concepts You'll Pick Up
  • Dimensionality reduction (PCA) — Compressing thousands of genes into 2-3 components. You'll see this constantly in ML.
  • Feature filtering — Removing low-information features (low-expression genes). ML engineers call this feature selection.
  • Data normalization — Making data comparable across samples. Same principle as housekeeping gene normalization.
  • EDA — The mandatory first step before any modeling. No ML engineer builds a model without exploring the data first.
Week 4

Toxicity Classifier — Your First ML Model

Goal: Build, train, and evaluate your first classification model. Understand the full ML workflow: split → train → evaluate → interpret.
Project: Compound Toxicity Classifier

GitHub repo: toxicity-classifier

Dataset: Therapeutics Data Commons — Use the hERG toxicity or hepatotoxicity ADMET datasets (pre-formatted for ML). Or use Open TG-GATEs toxicogenomics data (public, small-molecule).

What to build:

  1. Frame a biological question as classification: "Given molecular features, can we predict toxicity?"
  2. Proper train/test split (stratified — critical for imbalanced bio data)
  3. Train 3 models: Logistic Regression, Random Forest, XGBoost
  4. Evaluate with AUC-ROC, precision-recall curves, confusion matrix
  5. Feature importance — which molecular features predict toxicity?
  6. Biological interpretation in README
Bio-Specific Prompts
Split this toxicity dataset into 80/20 train/test with stratification on the toxicity label (the classes are imbalanced). Show me the class distribution in both splits to verify the stratification worked.
Train a Random Forest classifier to predict hepatotoxicity from molecular descriptors. Use 5-fold stratified cross-validation. Report AUC-ROC, precision, recall, and F1 for each fold. Plot the ROC curves overlaid on one figure.
Extract SHAP values from the best model on the test set. Show: (1) SHAP beeswarm summary plot, (2) top 20 features bar plot, (3) SHAP dependence plot for the top 3 features. For each top feature, explain what it might mean biologically.
ML Concepts You'll Pick Up
  • Classification workflow — The standard ML pipeline: data → split → train → evaluate → interpret.
  • Class imbalance — In biology, negatives always outnumber positives. Naive accuracy is misleading. Use AUC-ROC.
  • Feature importance — YOUR superpower. ML engineers see "feature_42 matters." You see "that's CpG frequency, which makes sense because..."
  • Model comparison — Always start with a simple baseline. If logistic regression gets 0.85 AUC, the fancy model needs to beat that.
Week 5

Compound Activity Predictor — Regression & Feature Engineering

Goal: Move from categories to continuous values. Predict a compound's potency (pIC50) from molecular descriptors — the bread-and-butter regression task in early drug discovery.
Project: Compound Activity Predictor

GitHub repo: compound-activity-predictor

Dataset: Therapeutics Data Commons ADMET regression sets (e.g., Lipophilicity_AstraZeneca, Solubility_AqSolDB, Caco2_Wang) or pull a single ChEMBL target's bioactivity table via the ChEMBL API. Pick a target that's not a conflict of interest with your day job — pick a well-studied target like CDK2, JAK1, or HSP90 from public ChEMBL data.
If you'd rather work with synthetic data: "Generate a synthetic compound bioactivity dataset of 500 compounds. Columns: compound_id, smiles (use rdkit to generate plausible drug-like SMILES), molecular_weight, logP, num_h_bond_donors, num_h_bond_acceptors, tpsa, num_rotatable_bonds, num_aromatic_rings, pIC50 (target value, generated as a noisy linear combination of logP, MW, and TPSA so there's signal to recover). Save to data/compound_activity.csv."

What to build:

  1. Load a bioactivity dataset with measured activity (pIC50, logSolubility, logP, etc.)
  2. Engineer molecular descriptors:
    • Lipinski's Rule of 5 features: MW, logP, HBD, HBA
    • TPSA, rotatable bonds, aromatic rings, fraction sp3
    • Morgan fingerprints (2048-bit) as a high-dimensional alternative
  3. Train regression models: Linear, Ridge, Random Forest, XGBoost
  4. Evaluate with R², MAE, RMSE + predicted vs. actual scatter
  5. Residual analysis — where does the model fail and why?
  6. Compare descriptor-based vs. fingerprint-based representations
Bio-Specific Prompts
Write `compute_compound_descriptors(smiles: str) -> dict` using RDKit. Return: molecular_weight, logP (Crippen), num_h_bond_donors, num_h_bond_acceptors, num_rotatable_bonds, tpsa, num_aromatic_rings, fraction_sp3, passes_lipinski (bool, all four Ro5 rules). Include type hints, raise ValueError for invalid SMILES, and provide a worked example using aspirin.
Train an XGBoost regressor to predict pIC50 from descriptors. Use 5-fold cross-validation with random_state=42. Plot predicted vs. actual with a diagonal reference line, R² and RMSE annotated. Color points by molecular_weight. Add marginal histograms on each axis.
Analyze model residuals (actual - predicted). Plot residuals vs. each top-5 feature. Are there systematic patterns? Highlight regions where the model consistently over- or under-predicts. What chemistry might explain those failures (e.g., out-of-distribution scaffolds, large macrocycles)?
ML Concepts You'll Pick Up
  • Regression vs. Classification — Predicting a number vs. a category. Different metrics (R² vs. AUC).
  • Feature engineering — Hand-crafting input features from raw data. This is where chemistry knowledge is critical.
  • Residual analysis — Understanding model failures reveals chemistry the model can't capture.
  • Representation choice — Hand-crafted descriptors (interpretable, low-dim) vs. fingerprints (high-dim, less interpretable). Different trade-offs.

You should have 3 new GitHub repos after this module.