Module 2 — Weeks 3-5
Upgrade your Python from lab scripts to structured data science. Three projects covering EDA, classification, and regression — all with biological data and all vibe coded with Claude.
GitHub repo: gene-expression-explorer
What to build:
Load the RNA-seq count matrix from [file]. Filter out genes with mean counts < 10 across all samples. Apply log2(TPM+1) normalization. Show me the distribution of expression values before and after normalization as side-by-side histograms.Create a volcano plot from this differential expression results table. Color genes with |log2FC| > 1 and padj < 0.05. Label the top 10 genes by significance. Use a clean publication-style theme with thin axes. Save as PDF at 300 DPI.Run PCA on this normalized expression matrix. Color samples by treatment condition. Add 95% confidence ellipses for each group. Show the variance explained by PC1 and PC2 in the axis labels. Does PC1 separate treated from untreated?GitHub repo: toxicity-classifier
What to build:
Split this toxicity dataset into 80/20 train/test with stratification on the toxicity label (the classes are imbalanced). Show me the class distribution in both splits to verify the stratification worked.Train a Random Forest classifier to predict hepatotoxicity from molecular descriptors. Use 5-fold stratified cross-validation. Report AUC-ROC, precision, recall, and F1 for each fold. Plot the ROC curves overlaid on one figure.Extract SHAP values from the best model on the test set. Show: (1) SHAP beeswarm summary plot, (2) top 20 features bar plot, (3) SHAP dependence plot for the top 3 features. For each top feature, explain what it might mean biologically.GitHub repo: compound-activity-predictor
What to build:
Write `compute_compound_descriptors(smiles: str) -> dict` using RDKit. Return: molecular_weight, logP (Crippen), num_h_bond_donors, num_h_bond_acceptors, num_rotatable_bonds, tpsa, num_aromatic_rings, fraction_sp3, passes_lipinski (bool, all four Ro5 rules). Include type hints, raise ValueError for invalid SMILES, and provide a worked example using aspirin.Train an XGBoost regressor to predict pIC50 from descriptors. Use 5-fold cross-validation with random_state=42. Plot predicted vs. actual with a diagonal reference line, R² and RMSE annotated. Color points by molecular_weight. Add marginal histograms on each axis.Analyze model residuals (actual - predicted). Plot residuals vs. each top-5 feature. Are there systematic patterns? Highlight regions where the model consistently over- or under-predicts. What chemistry might explain those failures (e.g., out-of-distribution scaffolds, large macrocycles)?You should have 3 new GitHub repos after this module.