Module 1 — Weeks 1-2
Before you learn ML, master the tool you'll use to build everything. Claude Code isn't just autocomplete — it's agents, automation, plugins, and a completely different way to write software. This module makes you dangerous with it.
Claude Code is a CLI tool that gives you an AI pair programmer in your terminal. It can read files, write code, run commands, search your codebase, and manage git — all through natural language.
The most important file for vibe coding. CLAUDE.md sits in your project root and tells Claude everything it needs to know about your project. Claude reads it automatically every session.
# Compound Tox Explorer
## What this project does
Analyzes synthetic small-molecule toxicogenomics data to explore compound toxicity patterns using EDA and classical ML. Personal portfolio project — all data is synthetic to keep this independent of any employer work.
## Tech stack
- Python 3.11, pandas, scikit-learn, seaborn, plotly
- Jupyter notebooks for exploration, .py files for reusable functions
- Data lives in /data (not committed to git)
## Conventions
- Use snake_case for all functions and variables
- Every notebook should have a markdown cell at the top explaining what it does
- Figures saved to /figures as both PNG (for README) and PDF (for publication)
- Use log2(TPM+1) normalization for gene expression data
## Biology context
- We're analyzing small-molecule compound toxicity patterns
- Key endpoints: hepatotoxicity (ALT/AST/GGT elevation), oxidative stress, apoptosis (caspase activation)
- Compound classes: four generic mechanism categories (Class A/B/C/D) — fully synthetic
- Gene panel: universal liver/stress markers plus a broader transcriptomic panel
## Current status
- Data downloaded and cleaned
- EDA notebook complete
- Working on classification modelWhy this matters: A good CLAUDE.md means you spend less time re-explaining context every session. It's like a lab notebook for your AI assistant — the more context it has, the better its suggestions.
The difference between good and bad vibe coding is prompt quality. Here are patterns that work specifically for biological data science.
Bad: "Make a heatmap"
Good: "I have an RNA-seq count matrix (rows = genes, columns = samples) stored in a DataFrame called `expr_df`. The first 5 columns are IFN-treated macrophages, the last 5 are untreated controls. Create a clustermap of the top 50 most variable genes across all samples, with a column color bar showing treatment condition. Use a diverging colormap centered on 0 (blue-white-red) since this is z-score normalized expression.""Write a function called `compute_compound_descriptors(smiles: str) -> dict` that takes a SMILES string and returns a dictionary with: molecular_weight (float, Da), logP (float, RDKit Crippen), num_h_bond_donors (int), num_h_bond_acceptors (int), num_rotatable_bonds (int), tpsa (float, topological polar surface area), passes_lipinski (bool, all four Ro5 rules). Include type hints, a docstring with a worked example using aspirin (CC(=O)Oc1ccccc1C(=O)O), and raise ValueError for invalid SMILES.""The volcano plot looks good but: (1) increase font size to 12pt for publication, (2) add a dashed horizontal line at -log10(0.05), (3) only label genes from this list: [MX1, IFIT1, ISG15, OAS1, STAT1] since those are known ISGs, (4) save as both PNG 300dpi and PDF vector format.""I need to normalize my flow cytometry data. The raw data is MFI (median fluorescence intensity) values for 8 markers across 96 samples in a plate. Some wells are unstained controls (column 1), single-stain controls (column 2), and FMO controls (column 3). The rest are experimental samples. Apply compensation-like subtraction of FMO background, then normalize each marker to the 99th percentile across all experimental samples so values are roughly 0-1."Bad: jumping straight to "build a classifier on data/compound_tox.csv"
Good: "Read data/compound_tox.csv and tell me: shape, dtypes, missing-value counts per column, summary statistics for numeric columns, value counts for categorical columns, and any obvious data-quality issues (constant columns, suspicious ranges, duplicate rows). Don't model anything yet — just describe what's in the file."
Then in your next prompt: "Given what you found, what's the cleanest way to set up a binary classifier for hepatotox_label?"Why it matters: Skipping this step is how you end up with models trained on leaked features or NaN-filled garbage. Always force Claude to describe the data before transforming it.
"Build a hepatotox classifier on data/compound_tox.csv (target = hepatotox_label).
Constraints — these matter:
- Do NOT use any of the *_fold_change columns as features. Those derive from the label and would leak.
- Do NOT impute missing values with column means before splitting. Fit imputers on train only.
- Do NOT use accuracy as the headline metric — classes are imbalanced. Use ROC-AUC and PR-AUC.
- Do NOT use a stock train_test_split. Stratify on hepatotox_label so test set has the same prevalence.
- Do NOT install new packages without telling me first."Why it matters: Claude will produce reasonable-looking code that does subtly wrong things (data leakage, wrong metric, unstratified splits) unless you forbid them. Stating constraints up front is faster than catching the bugs in code review.
"After training the classifier, run these sanity checks and tell me if anything looks suspicious:
1. ROC-AUC on training set vs. test set — if train >> test, flag as overfitting.
2. Permutation importance for the top 5 features — if a single feature dominates with importance > 0.5, flag as possible leakage.
3. Prediction distribution on the test set — if >95% of predictions are one class, flag as a degenerate model.
4. Confusion matrix at the default 0.5 threshold and at the threshold that maximizes F1.
For any flag, propose what to investigate next. Don't fix anything yet — just diagnose."Why it matters: Claude is good at finding problems if you ask it to. The same model that "looks great" in the first response often has obvious issues when you specifically request a critique.
Claude Code has permission modes that control how much autonomy it has. Cycle through them in any session by pressing Shift+Tab — the active mode shows in the status line.
| Mode | What Happens | When to Use |
|---|---|---|
default | Asks before running shell commands or editing files | Default for any work that matters |
acceptEdits | Auto-accepts file edits, still asks for shell commands | Active coding sessions where you're reviewing diffs as they land |
plan | Read-only — Claude can search and read but cannot edit or run anything | When you want a plan without touching the codebase |
bypassPermissions | Auto-accepts everything, including shell commands ("YOLO mode") | Sandboxes and throwaway experiments only. Never on a real codebase. |
Set a default for new sessions: add "defaultMode": "acceptEdits" to ~/.claude/settings.json. Don't put bypassPermissions there — opt into it per-session.
Throughout Module 1 we use a small synthetic toxicology dataset. It's intentionally fake: no real compounds, no real biology — just realistic shape and distributions so you can practice prompting on something that looks like a real EDA problem.
compound_id (str), compound_class (A/B/C/D), molecular_weight (Da), logP, dose_uM, alt_fold_change, ast_fold_change, ggt_fold_change, hepatotox_label (0/1, positive if any FC > 3).
Generate it once at the start of the project. Paste this into Claude Code — this prompt itself is Pattern 4 (biology-first):
Generate a synthetic compound toxicity dataset and save as data/compound_tox.csv. 100 compounds. Columns: compound_id (str like "CMP001"), compound_class (one of: A, B, C, D — pick uniformly), molecular_weight (float, 200-800 Da), logP (float, -1 to 6), dose_uM (log-spaced from 0.1 to 100), alt_fold_change (1.0-15, log-normal noise — Class B and D should average ~3x higher than A and C), ast_fold_change (correlated with ALT, ratio noise ~10%), ggt_fold_change (similar pattern), hepatotox_label (binary: 1 if any of ALT/AST/GGT > 3, the standard fold-change threshold). Use a fixed random seed for reproducibility. Add a docstring at the top of the script explaining the schema.Why synthetic: A toy dataset removes data wrangling as a confounder while you learn prompting. Once you're comfortable, every later module uses real public data (GEO, ChEMBL, BBBC, FAERS).
GitHub repo: bench2bytes-workspace
What to do:
claude in the directory)CLAUDE.md using the template above — rename the project to something that's not a conflict of interest with your day jobrequirements.txt, /src, /notebooks, /data, /figures, .gitignorePrompt 1 — Pattern 1 (Context-first):
"I have a CSV at data/compound_tox.csv with 100 compounds. Columns: compound_id, compound_class (A/B/C/D), molecular_weight (Da), logP, dose_uM, alt_fold_change, ast_fold_change, ggt_fold_change, hepatotox_label (0/1). Make a strip plot of alt_fold_change on the y-axis grouped by compound_class on the x-axis, with hepatotox_label as the hue (0 = blue, 1 = red). Add a horizontal dashed line at fold-change = 3. Use a log scale on the y-axis. Save to figures/alt_by_class.png at 300dpi."
Prompt 2 — Pattern 2 (Specify output format):
"Write a function `compute_tox_features(row: pd.Series) -> dict` in src/features.py that takes one compound row and returns: max_fold_change (float, max of ALT/AST/GGT FCs), n_elevated_markers (int, count of markers with FC > 3), tox_severity (str: 'none' if max FC < 2, 'mild' if 2-3, 'moderate' if 3-5, 'severe' if >5), is_high_risk (bool: True if alt_fold_change > 3 AND ast_fold_change > 3). Include type hints, a docstring with a worked example, and raise ValueError if any required column is missing."
Prompt 3 — Pattern 3 (Iterative refinement, run after Prompt 1):
"The strip plot from figures/alt_by_class.png is close but: (1) make points size 8 with alpha 0.7, (2) increase font sizes to 12pt, (3) overlay the median ALT FC per class as a black horizontal bar, (4) order classes by ascending median ALT FC, (5) save as both PNG (300dpi) and PDF."
Prompt 4 — Pattern 4 (Biology-first):
"I want to know whether lipophilicity (logP) predicts hepatotoxicity. Textbook claim: highly lipophilic compounds (logP > 3) accumulate in hepatocytes and disrupt mitochondrial function. Compute Spearman correlation between logP and alt_fold_change overall and per compound_class. Make a scatter plot of logP vs alt_fold_change colored by class with a LOWESS trend per class. Tell me whether the data supports the claim, including class-specific exceptions."
Prompt 5 — Combination (Patterns 1 + 2):
"Write a notebook at notebooks/01_compound_eda.ipynb. Cell 1 (markdown): title + dataset description. Cell 2 (code): imports. Cell 3: load + shape/dtypes/head. Cell 4: missing-value check + summary stats. Cell 5: histogram of alt_fold_change (log scale) with FC=3 line. Cell 6: boxplot of alt_fold_change by class with sample sizes. Cell 7: correlation heatmap of numeric columns. Cell 8 (markdown): 3-5 bullet summary. Save figures to figures/eda/ as PNG."Claude Code can spawn "sub-agents" — separate Claude instances that handle specific sub-tasks. Think of it like delegating to a research assistant who goes away, does the work, and reports back.
When you give Claude Code a complex task, it can automatically:
1. Break it into sub-tasks
2. Spawn agents to handle each one in parallel
3. Combine the results
Example: "Analyze all CSV files in /data, create a summary statistics table for each, identify the 3 most interesting patterns, and write a combined report."
Claude Code will:
- Agent 1: Read and summarize file1.csv
- Agent 2: Read and summarize file2.csv
- Agent 3: Read and summarize file3.csv
- Main: Combine findings into a report
This happens automatically for complex tasks. You don't need to orchestrate it.Bio application: Ask Claude to "read all the notebooks in this project, understand the analysis pipeline, and suggest what's missing" — it will use agents to read each notebook in parallel and synthesize.
MCP (Model Context Protocol) lets Claude Code connect to external services as plugins. Think of MCP servers as lab instruments that Claude can now operate directly.
Useful MCP servers for bio-data work:
| MCP Server | What It Does | Bio Use Case |
|---|---|---|
| Filesystem | Read/write files outside project | Access data in shared drives |
| GitHub | Create issues, PRs, manage repos | Automate project management |
| Fetch/Web | Fetch web content and APIs | Pull data from PubMed, UniProt, GEO |
| SQLite | Query databases | Query local experiment databases |
| Google Drive | Access Google Workspace | Pull data from shared lab folders |
Hooks are shell commands that run automatically in response to Claude Code events. Like triggers on your automation platform.
Useful hook patterns:
ruff or black after every Python file editCreate reusable commands for tasks you do repeatedly. Save them as markdown files in .claude/commands/.
For programmatic use in your projects (like the literature mining tool in Module 4), you'll use the Anthropic Python SDK.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Extract the target gene and compound type from this abstract: ..."
}
]
)
print(message.content[0].text)When to use CLI vs API: Use Claude Code CLI for interactive development and exploration. Use the API when you're building a tool that needs Claude inside a script (like extracting data from 100 papers programmatically).
GitHub repo: lab-data-processor
What to build:
/plate-qc — Run QC checks on plate reader data (CV of replicates, Z-factor, edge effects)/dose-response — Fit dose-response curves and calculate IC50Generate synthetic 96-well plate reader data for a dose-response experiment. 8 doses (3-fold dilution from 100 uM) in columns 2-9, each in triplicate (rows A-C for compound 1, rows D-F for compound 2). Column 1 = vehicle controls, columns 10-12 = positive controls. Add realistic noise (CV ~10%) and a sigmoidal dose-response with IC50 around 5 uM for compound 1 and 15 uM for compound 2. Save as CSV with well position, absorbance, timepoint columns.Build a plate QC module that takes raw plate reader data and calculates: (1) Z-factor from positive and negative controls, (2) CV% of replicates for each condition, (3) edge effect analysis (compare perimeter wells vs. interior), (4) flag any wells with >3 SD deviation from replicate mean. Return a QC summary dict and a visual plate heatmap.Write a dose-response fitting function that takes concentrations and responses, fits a 4-parameter logistic curve (Hill equation), and returns IC50, Hill slope, top, bottom, R-squared, and 95% CI for IC50. Plot the fitted curve with data points and error bars. Handle cases where the curve doesn't converge.Make sure you've pushed both projects to GitHub before completing.