Module 2: Data Science Foundations

Week 3

Gene Expression Explorer — EDA & Visualization

Done

Goal: Build your first data analysis project end-to-end using RNA-seq data. Learn the EDA workflow that ML teams expect before any modeling.

Project: Gene Expression Explorer

GitHub repo: gene-expression-explorer

Dataset: GEO — Pick an RNA-seq dataset relevant to your interests. Suggestions: IFN-stimulated immune cells, drug-treated hepatocytes, stress-response in yeast, or any treatment vs. control comparison. Or use DEE2 for pre-processed data. Pick a dataset that's not a conflict of interest with your day job.

If you'd rather work with synthetic data: ask Claude to generate a synthetic RNA-seq count matrix for you with a specified differential expression structure. Example prompt: "Generate a synthetic RNA-seq count matrix: 5000 genes × 12 samples (6 treated, 6 control). Use negative binomial counts with realistic mean-variance relationship. Spike in 200 truly differentially expressed genes (100 up, 100 down, log2FC between 1 and 4). Save as data/counts.csv with a separate data/sample_metadata.csv."

What to build:

Load an RNA-seq count matrix into pandas, explore shape and structure
Filter low-expression genes, normalize (log2 TPM+1 or DESeq2-style)
Volcano plot of differential expression
PCA plot colored by experimental condition
Heatmap of top 50 most variable genes with hierarchical clustering
Export summary table and publication-quality figures
Write README explaining the biology, methods, and what you found

Bio-Specific Prompts

Load the RNA-seq count matrix from [file]. Filter out genes with mean counts < 10 across all samples. Apply log2(TPM+1) normalization. Show me the distribution of expression values before and after normalization as side-by-side histograms.

Create a volcano plot from this differential expression results table. Color genes with |log2FC| > 1 and padj < 0.05. Label the top 10 genes by significance. Use a clean publication-style theme with thin axes. Save as PDF at 300 DPI.

Run PCA on this normalized expression matrix. Color samples by treatment condition. Add 95% confidence ellipses for each group. Show the variance explained by PC1 and PC2 in the axis labels. Does PC1 separate treated from untreated?

ML Concepts You'll Pick Up

Dimensionality reduction (PCA) — Compressing thousands of genes into 2-3 components. You'll see this constantly in ML.
Feature filtering — Removing low-information features (low-expression genes). ML engineers call this feature selection.
Data normalization — Making data comparable across samples. Same principle as housekeeping gene normalization.
EDA — The mandatory first step before any modeling. No ML engineer builds a model without exploring the data first.

Week 4

Toxicity Classifier — Your First ML Model

Done

Goal: Build, train, and evaluate your first classification model. Understand the full ML workflow: split → train → evaluate → interpret.

Project: Compound Toxicity Classifier

GitHub repo: toxicity-classifier

Dataset: Therapeutics Data Commons — Use the hERG toxicity or hepatotoxicity ADMET datasets (pre-formatted for ML). Or use Open TG-GATEs toxicogenomics data (public, small-molecule).

What to build:

Frame a biological question as classification: "Given molecular features, can we predict toxicity?"
Proper train/test split (stratified — critical for imbalanced bio data)
Train 3 models: Logistic Regression, Random Forest, XGBoost
Evaluate with AUC-ROC, precision-recall curves, confusion matrix
Feature importance — which molecular features predict toxicity?
Biological interpretation in README

Bio-Specific Prompts

Split this toxicity dataset into 80/20 train/test with stratification on the toxicity label (the classes are imbalanced). Show me the class distribution in both splits to verify the stratification worked.

Train a Random Forest classifier to predict hepatotoxicity from molecular descriptors. Use 5-fold stratified cross-validation. Report AUC-ROC, precision, recall, and F1 for each fold. Plot the ROC curves overlaid on one figure.

Extract SHAP values from the best model on the test set. Show: (1) SHAP beeswarm summary plot, (2) top 20 features bar plot, (3) SHAP dependence plot for the top 3 features. For each top feature, explain what it might mean biologically.

ML Concepts You'll Pick Up

Classification workflow — The standard ML pipeline: data → split → train → evaluate → interpret.
Class imbalance — In biology, negatives always outnumber positives. Naive accuracy is misleading. Use AUC-ROC.
Feature importance — YOUR superpower. ML engineers see "feature_42 matters." You see "that's CpG frequency, which makes sense because..."
Model comparison — Always start with a simple baseline. If logistic regression gets 0.85 AUC, the fancy model needs to beat that.

Week 5

Compound Activity Predictor — Regression & Feature Engineering

Done

Goal: Move from categories to continuous values. Predict a compound's potency (pIC50) from molecular descriptors — the bread-and-butter regression task in early drug discovery.

Project: Compound Activity Predictor

GitHub repo: compound-activity-predictor

Dataset: Therapeutics Data Commons ADMET regression sets (e.g., Lipophilicity_AstraZeneca, Solubility_AqSolDB, Caco2_Wang) or pull a single ChEMBL target's bioactivity table via the ChEMBL API. Pick a target that's not a conflict of interest with your day job — pick a well-studied target like CDK2, JAK1, or HSP90 from public ChEMBL data.

If you'd rather work with synthetic data: "Generate a synthetic compound bioactivity dataset of 500 compounds. Columns: compound_id, smiles (use rdkit to generate plausible drug-like SMILES), molecular_weight, logP, num_h_bond_donors, num_h_bond_acceptors, tpsa, num_rotatable_bonds, num_aromatic_rings, pIC50 (target value, generated as a noisy linear combination of logP, MW, and TPSA so there's signal to recover). Save to data/compound_activity.csv."

What to build:

Load a bioactivity dataset with measured activity (pIC50, logSolubility, logP, etc.)
Engineer molecular descriptors:
- Lipinski's Rule of 5 features: MW, logP, HBD, HBA
- TPSA, rotatable bonds, aromatic rings, fraction sp3
- Morgan fingerprints (2048-bit) as a high-dimensional alternative
Train regression models: Linear, Ridge, Random Forest, XGBoost
Evaluate with R², MAE, RMSE + predicted vs. actual scatter
Residual analysis — where does the model fail and why?
Compare descriptor-based vs. fingerprint-based representations

Bio-Specific Prompts

Write `compute_compound_descriptors(smiles: str) -> dict` using RDKit. Return: molecular_weight, logP (Crippen), num_h_bond_donors, num_h_bond_acceptors, num_rotatable_bonds, tpsa, num_aromatic_rings, fraction_sp3, passes_lipinski (bool, all four Ro5 rules). Include type hints, raise ValueError for invalid SMILES, and provide a worked example using aspirin.

Train an XGBoost regressor to predict pIC50 from descriptors. Use 5-fold cross-validation with random_state=42. Plot predicted vs. actual with a diagonal reference line, R² and RMSE annotated. Color points by molecular_weight. Add marginal histograms on each axis.

Analyze model residuals (actual - predicted). Plot residuals vs. each top-5 feature. Are there systematic patterns? Highlight regions where the model consistently over- or under-predicts. What chemistry might explain those failures (e.g., out-of-distribution scaffolds, large macrocycles)?

ML Concepts You'll Pick Up

Regression vs. Classification — Predicting a number vs. a category. Different metrics (R² vs. AUC).
Feature engineering — Hand-crafting input features from raw data. This is where chemistry knowledge is critical.
Residual analysis — Understanding model failures reveals chemistry the model can't capture.
Representation choice — Hand-crafted descriptors (interpretable, low-dim) vs. fingerprints (high-dim, less interpretable). Different trade-offs.