Module 4 — Weeks 9-10

Building Data Products

Stop making notebooks that live on your laptop. Build tools other scientists can actually use — dashboards, apps, and LLM-powered data extraction. This is what scientific data leads DO.

Week 9

Drug Safety Dashboard

Goal: Build a real data application using regulatory data. Learn the end-to-end workflow: data source → processing → analysis → visualization → deployment.
Project: Interactive Drug Safety Explorer

GitHub repo: drug-safety-dashboard

Dataset: FDA FAERS API — Real adverse event reporting data. Or ChEMBL for compound bioactivity data.

What to build: A Streamlit dashboard that:

  1. Lets users search for a drug or compound class
  2. Shows adverse event profiles organized by organ system
  3. Compares safety profiles across drug modalities (small molecules, biologics, etc.)
  4. ML clustering of drugs by adverse event profile (unsupervised)
  5. Interactive plotly visualizations
  6. Deploy on Streamlit Cloud (free tier)
Bio-Specific Prompts
Write a Python function that queries the FDA FAERS API for adverse events associated with a given drug name. Parse the JSON response to extract: reaction names, outcome counts, seriousness level, and reporting quarter. Return as a clean DataFrame. Handle pagination for drugs with many reports.
Build a Streamlit dashboard with: (1) search bar for drug name, (2) bar chart of top 20 adverse events by frequency, (3) treemap showing events organized by MedDRA organ system class, (4) comparison tab that overlays AE profiles of two drugs side by side using plotly. Use st.tabs for organization. Cache API calls with @st.cache_data.
Create a drug clustering analysis: build a feature matrix (rows = drugs, columns = adverse event types, values = normalized frequency). Run UMAP + KMeans(k=5). Visualize with plotly scatter, color by cluster, hover shows drug name. Add a sidebar to adjust k. Do drugs in the same therapeutic class cluster together?
ML Concepts You'll Pick Up
  • Unsupervised learning — Clustering without labels. Discovery-driven, not prediction-driven.
  • Data product development — The full loop from raw data to a tool someone else can use.
  • API integration — Pulling live data from external sources programmatically.
  • Deployment — Making your work accessible. Streamlit Cloud = free hosting for data apps.
Week 10

LLM-Powered Literature Mining

Goal: Build a tool that uses Claude's API to extract structured data from scientific papers. This is the kind of tool that could seed a startup.
Project: BioPaper Mining Tool

GitHub repo: biopaper-mining-tool

Dataset: PubMed abstracts via NCBI E-utilities API. Pick a focused, public-knowledge topic (e.g., "CRISPR base editing", "kinase inhibitor selectivity", "drug-induced liver injury"). Avoid topics that overlap with your day-job work if portfolio independence matters.

What to build:

  1. Fetch abstracts from PubMed by search query
  2. Use Claude API to extract structured data:
    • Drug/compound name + modality (small molecule, antibody, gene therapy, etc.)
    • Target gene/pathway
    • Model system (cell type, animal model, clinical)
    • Key findings (efficacy, toxicity, mechanism)
    • Dose/concentration information
  3. Output as structured CSV/database
  4. Streamlit interface: search, browse, filter, download
  5. Analytics tab: most-studied targets, trends over time, common model systems
Bio-Specific Prompts
Write a function using the NCBI E-utilities API that searches PubMed for a query string and returns the top N abstracts with PubMed IDs, titles, authors, publication date, journal, and full abstract text as a DataFrame. Include rate limiting (max 3 requests/second per NCBI guidelines).
Using the Anthropic Python SDK, write a function that sends a PubMed abstract to Claude and extracts structured data. The system prompt should instruct Claude to return JSON with these fields: compound_name, modality (one of: small_molecule, antibody, peptide, cell_therapy, gene_therapy, other), target_gene, model_system, species, key_finding, toxicity_mentioned (bool), doses_tested. Handle cases where information isn't stated (return null, not a guess). Use Claude Haiku for cost efficiency since we're processing many abstracts.
Build a Streamlit app for this literature mining tool. Layout: (1) sidebar with PubMed search query input, number of papers slider (10-100), and a "Mine Papers" button with a progress bar, (2) main area with tabs: "Results Table" (sortable, filterable DataFrame), "Analytics" (bar charts of top targets, compound types, model systems), "Download" (CSV export button). Cache results so re-running doesn't re-fetch.
ML Concepts You'll Pick Up
  • LLM APIs — Using language models programmatically. The most in-demand AI skill in 2026.
  • Structured extraction (NER) — Getting reliable, typed output from unstructured text. Massive bottleneck in biotech.
  • Prompt engineering for extraction — System prompts, output schemas, handling edge cases.
  • RAG (conceptual) — You're building the "retrieval" part. This prepares you for knowledge systems and chatbots over scientific literature.

Startup potential: Automated literature mining + structured databases is a real market need. Pharma companies spend massive resources on manual literature review. A tool that reliably extracts and structures data from thousands of papers is genuinely valuable. Deploy this on Streamlit Cloud and link from your portfolio.

You should have 2 deployed Streamlit apps after this module.