Module 5 — Weeks 11-12

Leadership & Portfolio Launch

From individual contributor to scientific data lead. Data strategy, a capstone pipeline, portfolio polish, and positioning yourself at the intersection of biology and AI.

Week 11

ML-Ready Data Strategy

Done

Goal: Create a practical guide for generating ML-ready biological data. This knowledge makes you invaluable — most ML projects in biotech fail because of bad data, not bad models.

Project: ML-Ready Data Playbook

GitHub repo: ml-ready-bio-data

What to build:

Written Guide — A document covering:
- Sample size planning for ML (not the same as traditional power analysis)
- Label quality — your assay readout IS the ML label
- Batch effects — the #1 confounder in biological ML
- Feature richness — multi-modal data beats more samples of one type
- Metadata standards — FAIR principles in practice
- Data format — tidy format, standard identifiers (Gene IDs not names)
Demo Notebook: How Batch Effects Corrupt ML Models
- Generate synthetic gene expression data with a real biological signal
- Add batch effects confounded with the experimental condition
- Show that the ML model learns batch, not biology
- Apply ComBat batch correction and show the difference
Checklist Template for wet-lab teams generating data for ML downstream

Key Questions to Ask When Someone Proposes an ML Project

Question	Why It Matters
What exactly are we predicting?	Vague labels produce useless models
How much labeled data do we have?	100 samples won't train a neural network
What's the baseline?	If a simple rule gets 80%, you might not need ML
Are there batch effects?	Model might learn batch, not biology
What would we do differently if the model works?	No actionable outcome = no business value
How will we validate prospectively?	Retrospective CV is necessary but not sufficient

Bio-Specific Prompts

Generate synthetic RNA-seq data: 100 samples, 1000 genes, 2 conditions (treated/untreated, 50 each). Add 50 differentially expressed genes with log2FC ~2. Then add a batch effect: put samples from batch 1 (30 treated, 20 untreated) and batch 2 (20 treated, 30 untreated) — the batch is confounded with condition. Train a Random Forest classifier and show it achieves >95% accuracy. Then show that the model actually learned batch, not treatment, by examining feature importances and testing on balanced batches.

Apply ComBat batch correction to the synthetic data. Retrain the model. Show: (1) PCA before and after correction colored by batch and by condition, (2) model performance before/after, (3) feature importances before/after. Does the model now learn biology instead of batch?

ML Concepts You'll Pick Up

Data-centric AI — Improving data quality matters more than improving models. This is Andrew Ng's key insight.
Data leakage — When the model learns something other than what you intended. Batch effects are the bio version.
FAIR principles — Findable, Accessible, Interoperable, Reusable. The standard for shareable data.

Week 12

Capstone Pipeline & Portfolio Launch

Done

Goal: Build a capstone project that ties everything together, polish your GitHub portfolio, write a blog post, and plan your next move.

Project: End-to-End Compound ML Pipeline

GitHub repo: compound-ml-pipeline

Dataset: A single ChEMBL target's bioactivity table (e.g., CDK2, JAK1, HSP90 — pick a well-studied public target). Or a TDC ADMET regression set. Avoid targets that overlap with your day-job work.

What to build: A pipeline combining skills from every module:

Data ingestion (Module 1 — Claude automation): Pull bioactivity data from ChEMBL or TDC
EDA + cleaning (Module 2): Structured exploration, quality checks, deduplication on canonical SMILES
Feature engineering (Module 2-3): Both RDKit descriptors (interpretable) AND Morgan fingerprints (high-dim) — compare
Model training (Module 2-3): Train and compare multiple models (Ridge, RF, XGBoost) with stratified or scaffold-based CV
Interpretation (Module 3): SHAP values, chemistry validation
Deployment (Module 4): Streamlit app where a user pastes a SMILES and gets a prediction + explanation
Documentation (All): README that could serve as a mini-paper

Portfolio Launch Checklist

Polish all 12 GitHub repos
- Clean, consistent READMEs with biological context
- Each repo has: purpose, dataset, methods, results, figures
- Requirements files and clear setup instructions
Create a GitHub profile README
- Pin your best 4-6 projects
- Include a brief bio positioning you at the intersection
- Link to your website and published articles
Write a blog post
- "What I Learned Building ML Projects as a Wet-Lab Scientist"
- Publish on LinkedIn, your website, or Medium
- Include specific insights, not generic advice
- Link to your GitHub projects
Internal impact at GSK
- Draft a 1-pager proposing an ML-augmented approach to a current project
- Identify 2-3 computational biology / AI people at GSK to connect with
- Share your portfolio with your manager

Bio-Specific Prompts

Write a GitHub profile README for a molecular biology scientist building at the intersection of wet lab and ML. Include: (1) a 2-sentence bio, (2) a "Featured Projects" section with 4 cards showing project name, one-line description, and tech stack badges (Python, scikit-learn, Streamlit, PyTorch, etc.), (3) a "Skills" section split into Wet Lab / Computational / Tools columns. Keep it clean and professional.

Draft a 600-word LinkedIn article: "What I Learned Building ML Projects as a Wet-Lab Scientist." Structure: (1) why I did this — the gap I saw, (2) the biggest surprise — feature engineering matters more than model choice for biological data, (3) where domain expertise beats algorithms — interpreting feature importance, catching batch effects, knowing which questions matter, (4) what I'd tell other scientists. Make it personal, specific, and link to 2-3 GitHub projects.

Write a 1-page proposal for an ML project at a pharma company: "Predicting Compound Hepatotoxicity Risk from Molecular Descriptors and In Vitro Assay Data." Include: Business Case (why), Data Available (what we have), Approach (how, including baseline and ML models), Expected Impact (what changes if it works), Timeline (4-6 weeks), Risks (batch effects, sample size, label quality). Keep it concise enough for a director to read in 3 minutes.

What This Module Demonstrates

Systems thinking — Connecting data generation to modeling to deployment. This is what leaders do.
Communication — Explaining ML to biologists and biology to ML engineers. Your bridge role.
Strategic thinking — Knowing which problems are worth solving with ML and which aren't.
You already wrote "Scientists Must Leverage, Not Compete with, AI Systems" — now you have the portfolio to back it up.

Complete Resource Library

Learning (All Free)

Biological Datasets

Therapeutics Data Commons — ML-ready drug discovery
GEO — Gene expression
DEE2 — Processed RNA-seq
ChEMBL — Compound bioactivity
OpenFDA — Adverse events
BBBC — Cell images
UniProt — Protein sequences
Open TG-GATEs — Toxicogenomics

Python Libraries

pandas, numpy — Data manipulation
scikit-learn, xgboost — ML
matplotlib, seaborn, plotly — Viz
streamlit — Data apps
transformers — Bio foundation models
biopython — Sequence handling
pydeseq2 — Differential expression
umap-learn, shap — Viz + explainability
anthropic — Claude API

Your Project Portfolio

bench2bytes-workspace — Dev setup
lab-data-processor — Plate QC + dose-response
gene-expression-explorer — RNA-seq EDA
toxicity-classifier — First ML model
compound-activity-predictor — Regression
protein-embedding-explorer — ESM-2
immune-signature-classifier — Transcriptomics ML
cell-phenotype-classifier — Deep learning
drug-safety-dashboard — Streamlit app
biopaper-mining-tool — LLM extraction
ml-ready-bio-data — Data strategy
compound-ml-pipeline — Capstone

Congratulations. You've built 12 projects, deployed 2 apps, and positioned yourself at the intersection of biology and AI.