Module 5 — Weeks 11-12
From individual contributor to scientific data lead. Data strategy, a capstone pipeline, portfolio polish, and positioning yourself at the intersection of biology and AI.
GitHub repo: ml-ready-bio-data
What to build:
| Question | Why It Matters |
|---|---|
| What exactly are we predicting? | Vague labels produce useless models |
| How much labeled data do we have? | 100 samples won't train a neural network |
| What's the baseline? | If a simple rule gets 80%, you might not need ML |
| Are there batch effects? | Model might learn batch, not biology |
| What would we do differently if the model works? | No actionable outcome = no business value |
| How will we validate prospectively? | Retrospective CV is necessary but not sufficient |
Generate synthetic RNA-seq data: 100 samples, 1000 genes, 2 conditions (treated/untreated, 50 each). Add 50 differentially expressed genes with log2FC ~2. Then add a batch effect: put samples from batch 1 (30 treated, 20 untreated) and batch 2 (20 treated, 30 untreated) — the batch is confounded with condition. Train a Random Forest classifier and show it achieves >95% accuracy. Then show that the model actually learned batch, not treatment, by examining feature importances and testing on balanced batches.Apply ComBat batch correction to the synthetic data. Retrain the model. Show: (1) PCA before and after correction colored by batch and by condition, (2) model performance before/after, (3) feature importances before/after. Does the model now learn biology instead of batch?GitHub repo: compound-ml-pipeline
What to build: A pipeline combining skills from every module:
Write a GitHub profile README for a molecular biology scientist building at the intersection of wet lab and ML. Include: (1) a 2-sentence bio, (2) a "Featured Projects" section with 4 cards showing project name, one-line description, and tech stack badges (Python, scikit-learn, Streamlit, PyTorch, etc.), (3) a "Skills" section split into Wet Lab / Computational / Tools columns. Keep it clean and professional.Draft a 600-word LinkedIn article: "What I Learned Building ML Projects as a Wet-Lab Scientist." Structure: (1) why I did this — the gap I saw, (2) the biggest surprise — feature engineering matters more than model choice for biological data, (3) where domain expertise beats algorithms — interpreting feature importance, catching batch effects, knowing which questions matter, (4) what I'd tell other scientists. Make it personal, specific, and link to 2-3 GitHub projects.Write a 1-page proposal for an ML project at a pharma company: "Predicting Compound Hepatotoxicity Risk from Molecular Descriptors and In Vitro Assay Data." Include: Business Case (why), Data Available (what we have), Approach (how, including baseline and ML models), Expected Impact (what changes if it works), Timeline (4-6 weeks), Risks (batch effects, sample size, label quality). Keep it concise enough for a director to read in 3 minutes.pandas, numpy — Data manipulationscikit-learn, xgboost — MLmatplotlib, seaborn, plotly — Vizstreamlit — Data appstransformers — Bio foundation modelsbiopython — Sequence handlingpydeseq2 — Differential expressionumap-learn, shap — Viz + explainabilityanthropic — Claude APIbench2bytes-workspace — Dev setuplab-data-processor — Plate QC + dose-responsegene-expression-explorer — RNA-seq EDAtoxicity-classifier — First ML modelcompound-activity-predictor — Regressionprotein-embedding-explorer — ESM-2immune-signature-classifier — Transcriptomics MLcell-phenotype-classifier — Deep learningdrug-safety-dashboard — Streamlit appbiopaper-mining-tool — LLM extractionml-ready-bio-data — Data strategycompound-ml-pipeline — CapstoneCongratulations. You've built 12 projects, deployed 2 apps, and positioned yourself at the intersection of biology and AI.