Module 5 — Weeks 11-12

Leadership & Portfolio Launch

From individual contributor to scientific data lead. Data strategy, a capstone pipeline, portfolio polish, and positioning yourself at the intersection of biology and AI.

Week 11

ML-Ready Data Strategy

Goal: Create a practical guide for generating ML-ready biological data. This knowledge makes you invaluable — most ML projects in biotech fail because of bad data, not bad models.
Project: ML-Ready Data Playbook

GitHub repo: ml-ready-bio-data

What to build:

  1. Written Guide — A document covering:
    • Sample size planning for ML (not the same as traditional power analysis)
    • Label quality — your assay readout IS the ML label
    • Batch effects — the #1 confounder in biological ML
    • Feature richness — multi-modal data beats more samples of one type
    • Metadata standards — FAIR principles in practice
    • Data format — tidy format, standard identifiers (Gene IDs not names)
  2. Demo Notebook: How Batch Effects Corrupt ML Models
    • Generate synthetic gene expression data with a real biological signal
    • Add batch effects confounded with the experimental condition
    • Show that the ML model learns batch, not biology
    • Apply ComBat batch correction and show the difference
  3. Checklist Template for wet-lab teams generating data for ML downstream
Key Questions to Ask When Someone Proposes an ML Project
QuestionWhy It Matters
What exactly are we predicting?Vague labels produce useless models
How much labeled data do we have?100 samples won't train a neural network
What's the baseline?If a simple rule gets 80%, you might not need ML
Are there batch effects?Model might learn batch, not biology
What would we do differently if the model works?No actionable outcome = no business value
How will we validate prospectively?Retrospective CV is necessary but not sufficient
Bio-Specific Prompts
Generate synthetic RNA-seq data: 100 samples, 1000 genes, 2 conditions (treated/untreated, 50 each). Add 50 differentially expressed genes with log2FC ~2. Then add a batch effect: put samples from batch 1 (30 treated, 20 untreated) and batch 2 (20 treated, 30 untreated) — the batch is confounded with condition. Train a Random Forest classifier and show it achieves >95% accuracy. Then show that the model actually learned batch, not treatment, by examining feature importances and testing on balanced batches.
Apply ComBat batch correction to the synthetic data. Retrain the model. Show: (1) PCA before and after correction colored by batch and by condition, (2) model performance before/after, (3) feature importances before/after. Does the model now learn biology instead of batch?
ML Concepts You'll Pick Up
  • Data-centric AI — Improving data quality matters more than improving models. This is Andrew Ng's key insight.
  • Data leakage — When the model learns something other than what you intended. Batch effects are the bio version.
  • FAIR principles — Findable, Accessible, Interoperable, Reusable. The standard for shareable data.
Week 12

Capstone Pipeline & Portfolio Launch

Goal: Build a capstone project that ties everything together, polish your GitHub portfolio, write a blog post, and plan your next move.
Project: End-to-End Compound ML Pipeline

GitHub repo: compound-ml-pipeline

Dataset: A single ChEMBL target's bioactivity table (e.g., CDK2, JAK1, HSP90 — pick a well-studied public target). Or a TDC ADMET regression set. Avoid targets that overlap with your day-job work.

What to build: A pipeline combining skills from every module:

  1. Data ingestion (Module 1 — Claude automation): Pull bioactivity data from ChEMBL or TDC
  2. EDA + cleaning (Module 2): Structured exploration, quality checks, deduplication on canonical SMILES
  3. Feature engineering (Module 2-3): Both RDKit descriptors (interpretable) AND Morgan fingerprints (high-dim) — compare
  4. Model training (Module 2-3): Train and compare multiple models (Ridge, RF, XGBoost) with stratified or scaffold-based CV
  5. Interpretation (Module 3): SHAP values, chemistry validation
  6. Deployment (Module 4): Streamlit app where a user pastes a SMILES and gets a prediction + explanation
  7. Documentation (All): README that could serve as a mini-paper
Portfolio Launch Checklist
  1. Polish all 12 GitHub repos
    • Clean, consistent READMEs with biological context
    • Each repo has: purpose, dataset, methods, results, figures
    • Requirements files and clear setup instructions
  2. Create a GitHub profile README
    • Pin your best 4-6 projects
    • Include a brief bio positioning you at the intersection
    • Link to your website and published articles
  3. Write a blog post
    • "What I Learned Building ML Projects as a Wet-Lab Scientist"
    • Publish on LinkedIn, your website, or Medium
    • Include specific insights, not generic advice
    • Link to your GitHub projects
  4. Internal impact at GSK
    • Draft a 1-pager proposing an ML-augmented approach to a current project
    • Identify 2-3 computational biology / AI people at GSK to connect with
    • Share your portfolio with your manager
Bio-Specific Prompts
Write a GitHub profile README for a molecular biology scientist building at the intersection of wet lab and ML. Include: (1) a 2-sentence bio, (2) a "Featured Projects" section with 4 cards showing project name, one-line description, and tech stack badges (Python, scikit-learn, Streamlit, PyTorch, etc.), (3) a "Skills" section split into Wet Lab / Computational / Tools columns. Keep it clean and professional.
Draft a 600-word LinkedIn article: "What I Learned Building ML Projects as a Wet-Lab Scientist." Structure: (1) why I did this — the gap I saw, (2) the biggest surprise — feature engineering matters more than model choice for biological data, (3) where domain expertise beats algorithms — interpreting feature importance, catching batch effects, knowing which questions matter, (4) what I'd tell other scientists. Make it personal, specific, and link to 2-3 GitHub projects.
Write a 1-page proposal for an ML project at a pharma company: "Predicting Compound Hepatotoxicity Risk from Molecular Descriptors and In Vitro Assay Data." Include: Business Case (why), Data Available (what we have), Approach (how, including baseline and ML models), Expected Impact (what changes if it works), Timeline (4-6 weeks), Risks (batch effects, sample size, label quality). Keep it concise enough for a director to read in 3 minutes.
What This Module Demonstrates
  • Systems thinking — Connecting data generation to modeling to deployment. This is what leaders do.
  • Communication — Explaining ML to biologists and biology to ML engineers. Your bridge role.
  • Strategic thinking — Knowing which problems are worth solving with ML and which aren't.
  • You already wrote "Scientists Must Leverage, Not Compete with, AI Systems" — now you have the portfolio to back it up.

Complete Resource Library

Biological Datasets

Python Libraries

  • pandas, numpy — Data manipulation
  • scikit-learn, xgboost — ML
  • matplotlib, seaborn, plotly — Viz
  • streamlit — Data apps
  • transformers — Bio foundation models
  • biopython — Sequence handling
  • pydeseq2 — Differential expression
  • umap-learn, shap — Viz + explainability
  • anthropic — Claude API

Your Project Portfolio

  • bench2bytes-workspace — Dev setup
  • lab-data-processor — Plate QC + dose-response
  • gene-expression-explorer — RNA-seq EDA
  • toxicity-classifier — First ML model
  • compound-activity-predictor — Regression
  • protein-embedding-explorer — ESM-2
  • immune-signature-classifier — Transcriptomics ML
  • cell-phenotype-classifier — Deep learning
  • drug-safety-dashboard — Streamlit app
  • biopaper-mining-tool — LLM extraction
  • ml-ready-bio-data — Data strategy
  • compound-ml-pipeline — Capstone

Congratulations. You've built 12 projects, deployed 2 apps, and positioned yourself at the intersection of biology and AI.