Source Attribution

This content is adapted from “When Vibe Coding Meets Life Science” by Gozde Eskici, Ph.D. (The Second Translation newsletter, April 14, 2025)

What is Vibe Coding?
Vibe Coding in Biotech
8 Tools Rewriting the Rules
BCH709 Lab Materials
Appendix: Prompt Templates

What is Vibe Coding?

On February 2nd, 2025, Andrej Karpathy, one of the most influential voices in AI, introduced a new term:

“There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It’s possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard… I’m building a project or webapp, but it’s not really coding – I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”

IBM followed with a formal definition:

“Vibe coding is a fresh take in coding where users express their intention using plain speech and the AI transforms that thinking into executable code.”

Vibe Coding in Biotech

In biotech, coding isn’t about building websites—it’s about running genome pipelines, training disease models, or scripting CRISPR screens. Historically, that’s meant technical depth, time, and a dedicated bioinformatics team.

But what if a scientist could just say:

“Compare these RNA-seq datasets.”
“Predict disease progression from this clinical data.”
“Simulate how this protein binds ligands.”

…and the AI handles the rest?

Thanks to LLMs, Biopython, and Colab-powered interfaces, we’re now close. The act of building has become more conversational, more iterative—more “vibey.”

Why This Matters

Biotech has long been bottlenecked by translation—the gap between idea and execution. Vibe coding changes that by:

Benefit	Description
Faster iteration	Rapid prototyping of experiments and product ideas
Lower barriers	Scientists can code without deep programming expertise
Broader access	More people can prototype, test, and scale ideas
Leaner teams	Smaller teams can accomplish more, especially at early stages

8 Tools Rewriting the Rules of Life Sciences

1. Superbio.ai – No-Code AI Marketplace

Founded by Berke Buyukkucak and Ronjon Nag. Run cutting-edge AI tools for drug discovery, protein design, and literature review—no code needed.

Link: superbio.ai

2. Recursion’s LOWE – LLM-Orchestrated Wet Lab

Recursion’s internal tool (unveiled by Chris Gibson): describe an assay, LOWE designs and executes it via robotics using their proprietary phenomics and chemistry stack.

Link: recursion.com

3. DrBioRight 2.0 – Cancer Proteomics Chatbot

Built at MD Anderson by the Han Liang Lab. Ask questions like “Which proteins in pathway X are altered in this tumor?” and get real answers with plots.

Publication: Nature Communications (2025)

Link: drbioright.org

4. BioChatter – Open Source Bio-AI Toolkit

From EMBL-EBI. Build custom AI assistants that connect to APIs, databases, and bio tools. Fully open-source and on-prem ready.

Link: biochatter.org

5. OLAF – Conversational Bioinformatics OS

From Weill Cornell (Dylan Riffle et al.). Say “Analyze this RNA-seq file” and OLAF writes the code, runs it, and returns transparent, inspectable results.

Publication: ArXiv

6. TinyBio – ChatGPT for Scientists

Acquired by Seqera. Started by Sasha Dagayev and Vishal Patel in 2022. Real-time code execution supporting 50+ bio libraries with self-healing error correction.

Link: tinybio.cloud

7. Scispot (Scibot) – Lab AI Analyst

YC-backed. Their AI assistant Scibot makes lab data conversational: “Summarize this week’s PCR results” produces instant dashboards.

Link: scispot.com

8. Synthace – Conversational Wet Lab Automation

Describe experiments in plain English; AI generates protocols and sends them directly to lab robots.

Link: synthace.com

Key Takeaways

Vibe coding lets users build through intent, not syntax

In biotech, that means less friction, faster feedback, and broader access

These tools don’t just “assist” scientists—they enable more with less code and more creativity

BCH709 Bioinformatics Vibe Coding Lab Materials

Lab Overview

Audience: BCH709 Genome Informatics graduate students

Goal: Experience how prompt specificity transforms code quality and output

Core Lesson: The more specific your prompt, the closer the result to what you actually need

Structure: 2 Examples (Python, R) + 2 Homework Assignments (Python, R)

The Vibe Coding Workflow

Natural-language prompt → AI generates code → Execute → Inspect results → Revise prompt → Repeat

The Key Insight

“Saying exactly what you want” is the core skill. A prompt controls not only the code but also the execution environment—without that, reproducibility breaks down.

Step 0: Brainstorming and Environment Setup

Learning Objective

Before writing any code, ask the AI about possible approaches and required tools first.

Step 0A. Brainstorming Prompt

Copy and paste this prompt into the AI first. This is a strategy question, not a code request.

I am a beginner student in BCH709.
Before writing any code, brainstorm the approaches and libraries I need for the following analyses.

Analysis 1 (Python, GFF3 analysis):
- Input: MGI.gff3.gz (GFF3, gzip), chrom.sizes (TSV: chrom, length_bp)
- Goal: Count genes, exons (preventing isoform overcounting), snRNAs, and lncRNAs per chromosome; compute density
- Output: TSV table + dropped_seqids.txt (QC artifact)

Analysis 2 (R, RNA-Seq TPM analysis):
- Input: iceplant_TPM_DT_ZT.tab.gz (gzip TSV, gene_id + sample TPM values)
- Goal: Select top 200 genes by CV (with mean >= 1 filter), save log2(TPM+1) heatmap
- Output: TSV + heatmap PNG

Requirements:
1) Break each analysis into functional units (input parsing, statistical computation, visualization, file output, QC).
2) For each functional unit, suggest 1–2 candidate libraries.
3) Pick one recommended combination for beginners and explain why.
4) List the exact conda-forge package names for that combination.

Why this works:

Students don’t get stuck trying to recall package names
The AI produces a structured “function → package” mapping
The conda install command follows naturally

Step 0B. Environment Setup Prompt

Once brainstorming is complete, use this prompt:

Using the library combination you just recommended, generate conda environment creation commands.

Conditions:
- Python environment name: bch709-python
- R environment name: bch709-R
- Pin Python 3.11, R 4.3
- Include import/library verification tests after installation
- Present the commands in copy-paste order so a beginner can just run them one by one

Step 0C. Environment Creation Commands

Python Environment: `bch709-python`

# Create environment
conda create -n bch709-python -y python=3.11
conda activate bch709-python

# Install packages
conda install -c conda-forge -y \
  pandas numpy matplotlib seaborn biopython tqdm

# Verify installation
python -c "import pandas, numpy, matplotlib, Bio; print('bch709-python OK')"

# Create working directories
mkdir -p data results

R Environment: `bch709-R`

# Create environment
conda create -n bch709-R -y -c conda-forge \
  r-base=4.3 r-data.table r-ggplot2 r-pheatmap r-viridislite r-scales
conda activate bch709-R

# Verify installation
R -q -e 'library(data.table); library(ggplot2); library(pheatmap); cat("bch709-R OK\n")'

# Create working directories
mkdir -p results

Data Downloads

# GFF3 (Example 1, Homework 1)
curl -L -o data/MGI.gff3.gz http://www.informatics.jax.org/downloads/mgigff3/MGI.gff3.gz

# mRNA FASTA (Homework 1)
curl -L -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz

# Ice plant TPM (Example 2, Homework 2)
git clone https://github.com/plantgenomicslab/Ice-plant-transcriptome-profiling

Including Environment in Every Prompt

From this point forward, start every prompt with environment context:

For Python:

Write Python code that runs in the bch709-python conda environment.
Assume pandas, numpy, matplotlib, and biopython are installed.

For R:

Write R code that runs in the bch709-R conda environment.
You may use data.table, ggplot2, and pheatmap.

Warning

If you don’t specify the environment, the AI will assume an arbitrary one.

Part 1: Vibe Coding Examples

Example 1 (Python): Per-Chromosome Feature Counts from MGI GFF3 + QC

Background

Extract chromosome-level feature counts (genes, exons, snRNAs, lncRNAs) from the MGI GFF3 file and cross-reference against an external chrom.sizes file to verify data integrity.

GFF3 file structure (9 tab-separated columns):

chr1  MGI  gene  3214482  3671498  .  -  .  ID=MGI:MGI:1918911;Name=Xkr4;biotype=protein_coding

Column	Content
1	chromosome (seqid)
2	source
3	feature type (gene, mRNA, exon, snRNA, lnc_RNA, etc.)
4	start position
5	end position
6	score
7	strand (+/−)
8	phase
9	attributes (key=value pairs, semicolon-delimited)

chrom.sizes file structure:

chr1    195465000
chr2    182105000
chrX    171031299

Critical Design Decisions

Definitions That Must Appear in Your Prompt

Exon overcounting prevention: Count unique (start, end, strand) intervals per chromosome

Chromosome length source: Use external chrom.sizes file; log excluded seqids to dropped_seqids.txt

snRNA/lncRNA definition: Count lines where type is snRNA, lnc_RNA, or lncRNA

Stage 1: Vague Prompt

Prompt:

“Extract the gene, exon, snRNA, and lncRNA counts per chromosome from the GFF3 file.”

AI-generated code:

import gzip, re
from collections import defaultdict

counts = defaultdict(lambda: defaultdict(int))

with gzip.open("data/MGI.gff3.gz", "rt") as f:
    for line in f:
        if line.startswith("#"):
            continue
        fields = line.strip().split("\t")
        if len(fields) < 9:
            continue
        chrom = fields[0]
        ftype = fields[2]
        if ftype in ("gene", "exon", "snRNA", "lncRNA"):
            counts[chrom][ftype] += 1

for chrom in sorted(counts):
    print(chrom, dict(counts[chrom]))

Problems with Stage 1

No chromosome length reference—scaffolds included with primary chromosomes

Exons double-counted across isoforms

Misses lnc_RNA (standard GFF3 spelling)

No QC tracking of dropped seqids

No file output

Stage 2: Specifying Output Format

Prompt:

Write Python code that runs in the bch709-python conda environment.
pandas and numpy are installed.

Input:
- GFF3 gzip: data/MGI.gff3.gz (9 columns, tab-separated)
- Chrom sizes: data/chrom.sizes (TSV: chrom, length_bp)

Task:
- Only include chromosomes present in chrom.sizes
- Save GFF seqids not found in chrom.sizes to results/dropped_seqids.txt
- Count genes per chromosome (type == "gene")
- Count exons per chromosome: to prevent isoform overcounting, define count as number of unique (start, end, strand) intervals
- Count snRNA/lncRNA: lines where type is "snRNA", "lnc_RNA", or "lncRNA"

Output:
- results/chr_feature_counts.tsv
- Columns: chrom, chrom_length_bp, n_gene, n_exon_unique, n_snRNA, n_lncRNA
- Include header
- Print top 5 rows to console

AI-generated code:

from pathlib import Path
import gzip
import pandas as pd
from collections import defaultdict

DATA_GFF  = Path("data/MGI.gff3.gz")
DATA_SIZES = Path("data/chrom.sizes")
OUT_TSV   = Path("results/chr_feature_counts.tsv")
OUT_DROP  = Path("results/dropped_seqids.txt")
OUT_TSV.parent.mkdir(parents=True, exist_ok=True)

# Load chrom sizes
sizes = pd.read_csv(DATA_SIZES, sep="\t", header=None, names=["chrom","length_bp"])
chrom_set = set(sizes["chrom"])
chrom_len = dict(zip(sizes["chrom"], sizes["length_bp"]))

n_gene = defaultdict(int)
exon_intervals = defaultdict(set)
n_snRNA = defaultdict(int)
n_lncRNA = defaultdict(int)
dropped_seqids = set()

SNRNA_TYPES  = {"snRNA"}
LNCRNA_TYPES = {"lnc_RNA", "lncRNA"}

with gzip.open(DATA_GFF, "rt") as f:
    for line in f:
        if not line or line.startswith("#"):
            continue
        fields = line.rstrip("\n").split("\t")
        if len(fields) < 9:
            continue
        seqid, _, ftype, start, end, _, strand, _, _ = fields

        if seqid not in chrom_set:
            dropped_seqids.add(seqid)
            continue

        if ftype == "gene":
            n_gene[seqid] += 1
        elif ftype == "exon":
            exon_intervals[seqid].add((int(start), int(end), strand))
        elif ftype in SNRNA_TYPES:
            n_snRNA[seqid] += 1
        elif ftype in LNCRNA_TYPES:
            n_lncRNA[seqid] += 1

rows = []
for chrom in sizes["chrom"]:
    rows.append({
        "chrom": chrom,
        "chrom_length_bp": int(chrom_len[chrom]),
        "n_gene": n_gene.get(chrom, 0),
        "n_exon_unique": len(exon_intervals.get(chrom, set())),
        "n_snRNA": n_snRNA.get(chrom, 0),
        "n_lncRNA": n_lncRNA.get(chrom, 0),
    })

df = pd.DataFrame(rows)
df.to_csv(OUT_TSV, sep="\t", index=False)
OUT_DROP.write_text("\n".join(sorted(dropped_seqids)) + "\n")

print("Saved:", OUT_TSV)
print("Dropped seqids:", len(dropped_seqids))
print(df.head(5).to_string(index=False))

Improvements: TSV output, unique exon definition, dropped seqid QC

Still missing: Density not computed, no sorting, unclear zero-feature handling

Stage 3: Adding Detailed Constraints

Prompt:

Extend the Stage 2 results.

Additional output requirements:
- Add density columns:
  - gene_per_Mb = n_gene / (chrom_length_bp / 1e6)
  - exon_unique_per_Mb, snRNA_per_Mb, lncRNA_per_Mb (same formula)
- Sort by gene_per_Mb descending
- Round to 4 decimal places
- Include all chromosomes from chrom.sizes, filling counts with 0 where no features exist
- Save unique dropped seqids sorted alphabetically to dropped_seqids.txt
- Print to console: number of dropped seqids, number of excluded feature lines, top 5 rows of the result table

AI-generated code (final):

from pathlib import Path
import gzip
import pandas as pd
from collections import defaultdict

DATA_GFF  = Path("data/MGI.gff3.gz")
DATA_SIZES = Path("data/chrom.sizes")
OUT_TSV   = Path("results/chr_feature_counts.tsv")
OUT_DROP  = Path("results/dropped_seqids.txt")
OUT_TSV.parent.mkdir(parents=True, exist_ok=True)

# 1) Chrom sizes
sizes = pd.read_csv(DATA_SIZES, sep="\t", header=None, names=["chrom","length_bp"])
chrom_set = set(sizes["chrom"])
chrom_len = dict(zip(sizes["chrom"], sizes["length_bp"]))

# 2) Counters
n_gene = defaultdict(int)
exon_intervals = defaultdict(set)
n_snRNA = defaultdict(int)
n_lncRNA = defaultdict(int)
dropped_seqids = set()
dropped_lines = 0

SNRNA_TYPES  = {"snRNA"}
LNCRNA_TYPES = {"lnc_RNA", "lncRNA"}

with gzip.open(DATA_GFF, "rt") as f:
    for line in f:
        if not line or line.startswith("#"):
            continue
        fields = line.rstrip("\n").split("\t")
        if len(fields) < 9:
            continue
        seqid, _, ftype, start, end, _, strand, _, _ = fields

        if seqid not in chrom_set:
            dropped_seqids.add(seqid)
            dropped_lines += 1
            continue

        if ftype == "gene":
            n_gene[seqid] += 1
        elif ftype == "exon":
            exon_intervals[seqid].add((int(start), int(end), strand))
        elif ftype in SNRNA_TYPES:
            n_snRNA[seqid] += 1
        elif ftype in LNCRNA_TYPES:
            n_lncRNA[seqid] += 1

# 3) Build result table (include all chroms from chrom.sizes; fill 0 where no features)
rows = []
for chrom in sizes["chrom"]:
    L = float(chrom_len[chrom])
    g  = n_gene.get(chrom, 0)
    ex = len(exon_intervals.get(chrom, set()))
    sn = n_snRNA.get(chrom, 0)
    ln = n_lncRNA.get(chrom, 0)
    Mb = L / 1e6 if L > 0 else 1

    rows.append({
        "chrom": chrom,
        "chrom_length_bp": int(L),
        "n_gene": g,
        "n_exon_unique": ex,
        "n_snRNA": sn,
        "n_lncRNA": ln,
        "gene_per_Mb":        round(g  / Mb, 4),
        "exon_unique_per_Mb": round(ex / Mb, 4),
        "snRNA_per_Mb":       round(sn / Mb, 4),
        "lncRNA_per_Mb":      round(ln / Mb, 4),
    })

df = pd.DataFrame(rows).sort_values("gene_per_Mb", ascending=False)
df.to_csv(OUT_TSV, sep="\t", index=False)
OUT_DROP.write_text("\n".join(sorted(dropped_seqids)) + "\n")

print(f"Saved: {OUT_TSV}")
print(f"Saved: {OUT_DROP}")
print(f"Dropped seqids: {len(dropped_seqids)}")
print(f"Dropped feature lines: {dropped_lines}")
print(df.head(5).to_string(index=False))

Sample output:

Saved: results/chr_feature_counts.tsv
Saved: results/dropped_seqids.txt
Dropped seqids: 47
Dropped feature lines: 1823

 chrom  chrom_length_bp  n_gene  n_exon_unique  n_snRNA  n_lncRNA  gene_per_Mb  ...
 chr11      122082543     2847        42156        12        45      23.3224  ...
 chr19       61431566     1892        31245         8        32      30.8024  ...
 chr17       94987271     2456        38912        10        38      25.8563  ...
  chr1      195465000     3215        52341        18        67      16.4488  ...
  chr2      182105000     2987        48562        15        58      16.4027  ...

Example 1: Comparison Summary

Aspect	Stage 1 (Vague)	Stage 2 (Format)	Stage 3 (Detailed)
Chromosome scope	Everything	chrom.sizes only	chrom.sizes + zero-fill
Exon definition	Duplicate-counted	Unique interval	Unique interval
QC artifact	None	dropped_seqids.txt	Count + line count + file
Density	None	None	4 per_Mb columns
Sorting	None	None	gene_per_Mb descending
Reusability	Low	Medium	High (publication-ready)

QC Interpretation Questions

Questions Students Must Answer

What seqids ended up in dropped_seqids.txt? (Alternative contigs? Unplaced scaffolds? Mitochondrial?)

What fraction of total genes were dropped? Could this affect conclusions?

If the prompt had NOT specified using chrom.sizes, what errors could have occurred silently?

Key Message

dropped_seqids.txt is not a failure log—it is a declaration of what this analysis accepts as data.

Example 2 (R): Top 200 Variable Genes from Ice Plant TPM + Heatmap

Background

Analyze time-course (Zeitgeber Time) TPM expression data from ice plant (Mesembryanthemum crystallinum). Extract the top 200 genes by coefficient of variation (CV) and visualize their expression patterns.

Data source: Ice-plant-transcriptome-profiling

Data structure:

Column	Description
gene_id	Gene identifier (e.g., Mc01g0010)
DT_ZT2_rep1 through DT_ZT22_rep3	6 time points × 3 replicates = 18 samples

DT = Day Treatment, ZT = Zeitgeber Time
Time points: ZT2, ZT6, ZT10, ZT14, ZT18, ZT22 (4-hour intervals)

Critical Design Decisions

Decision	Rationale
mean >= 1 filter	Genes with near-zero TPM produce infinite/extreme CV values
CV = sd / mean	Measures relative variability independent of expression level
log2(TPM + 1)	Reduces influence of extreme values in heatmap

Stage 1: Vague Prompt

Prompt:

“Pick the 200 most variable genes from the TPM data and draw a heatmap.”

AI-generated code:

data <- read.table("iceplant_TPM_DT_ZT.tab", header=TRUE, sep="\t", row.names=1)
vars <- apply(data, 1, var)
top200 <- head(sort(vars, decreasing=TRUE), 200)
heatmap(as.matrix(data[names(top200), ]))

Problems with Stage 1

Uses variance instead of CV → biased toward highly expressed genes

No mean >= 1 filter → noise genes in top ranks

No file output, no image size/resolution control

No log2 transformation → color scale dominated by extremes

Stage 2: Specifying Output Format

Prompt:

Write R code that runs in the bch709-R conda environment.
data.table and pheatmap are installed.

Input:
- Ice-plant-transcriptome-profiling/iceplant_TPM_DT_ZT.tab.gz (gzip TSV)
- First column: gene_id; remaining columns: numeric TPM

Task:
- Compute mean_tpm and sd_tpm for each gene
- CV = sd_tpm / mean_tpm
- Keep only genes with mean_tpm >= 1
- Select top 200 by CV descending

Output:
- Save to results/iceplant_cv_top200.tsv
- Columns: gene_id, mean_tpm, sd_tpm, cv
- Round to 4 decimal places
- Print top 10 to console

AI-generated code:

library(data.table)

dir.create("results", showWarnings = FALSE, recursive = TRUE)

f <- "Ice-plant-transcriptome-profiling/iceplant_TPM_DT_ZT.tab.gz"
dt <- fread(f)
stopifnot("gene_id" %in% names(dt))

sample_cols <- setdiff(names(dt), "gene_id")
mat <- as.matrix(dt[, ..sample_cols])
mode(mat) <- "numeric"

mean_tpm <- rowMeans(mat, na.rm = TRUE)
sd_tpm   <- apply(mat, 1, sd, na.rm = TRUE)

keep <- mean_tpm >= 1
res <- data.table(
  gene_id  = dt$gene_id[keep],
  mean_tpm = mean_tpm[keep],
  sd_tpm   = sd_tpm[keep]
)
res[, cv := sd_tpm / mean_tpm]
setorder(res, -cv)

top200 <- res[1:min(200, .N)]
top200[, `:=`(
  mean_tpm = round(mean_tpm, 4),
  sd_tpm   = round(sd_tpm, 4),
  cv       = round(cv, 4)
)]

fwrite(top200, "results/iceplant_cv_top200.tsv", sep = "\t")
print(top200[1:min(10, .N)])
cat("Saved: results/iceplant_cv_top200.tsv\n")

Improvements: Explicit CV definition, mean >= 1 filter, TSV output

Still missing: No heatmap generated, no plot specifications

Stage 3: Adding Detailed Constraints

Prompt:

Using the top 200 genes from Stage 2, generate a heatmap.

Heatmap data:
- Values: log2(TPM + 1)
- Rows: gene_id (maintain CV descending order, cluster_rows = FALSE)
- Columns: original sample column order (cluster_cols = FALSE)

Plot specifications:
- Save to: results/iceplant_cv_top200_heatmap.png
- Size: 1800 × 1200 px, dpi 200
- Rotate x-axis labels 90 degrees
- Title: "Ice plant log2(TPM+1), CV top200 (mean>=1)"
- Also print top 10 summary table to console

AI-generated code (final):

library(data.table)
library(pheatmap)

dir.create("results", showWarnings = FALSE, recursive = TRUE)

f <- "Ice-plant-transcriptome-profiling/iceplant_TPM_DT_ZT.tab.gz"
dt <- fread(f)
stopifnot("gene_id" %in% names(dt))

sample_cols <- setdiff(names(dt), "gene_id")
mat <- as.matrix(dt[, ..sample_cols])
mode(mat) <- "numeric"

mean_tpm <- rowMeans(mat, na.rm = TRUE)
sd_tpm   <- apply(mat, 1, sd, na.rm = TRUE)

keep <- mean_tpm >= 1
res <- data.table(
  gene_id  = dt$gene_id[keep],
  mean_tpm = mean_tpm[keep],
  sd_tpm   = sd_tpm[keep]
)
res[, cv := sd_tpm / mean_tpm]
setorder(res, -cv)
top200 <- res[1:min(200, .N)]

# Summary TSV
top200_out <- copy(top200)
top200_out[, `:=`(
  mean_tpm = round(mean_tpm, 4),
  sd_tpm   = round(sd_tpm, 4),
  cv       = round(cv, 4)
)]
fwrite(top200_out, "results/iceplant_cv_top200.tsv", sep = "\t")

# Heatmap matrix
idx <- match(top200$gene_id, dt$gene_id)
submat <- mat[idx, , drop = FALSE]
submat <- log2(submat + 1)
rownames(submat) <- top200$gene_id

# Save (1800×1200 px, dpi 200)
png("results/iceplant_cv_top200_heatmap.png",
    width = 1800, height = 1200, res = 200)
pheatmap(
  submat,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  fontsize_col = 6,
  fontsize_row = 3,
  angle_col = 90,
  main = "Ice plant log2(TPM+1), CV top200 (mean>=1)"
)
dev.off()

print(top200_out[1:min(10, .N)])
cat("Saved: results/iceplant_cv_top200.tsv\n")
cat("Saved: results/iceplant_cv_top200_heatmap.png\n")

Example 2: Comparison Summary

Aspect	Stage 1 (Vague)	Stage 2 (Format)	Stage 3 (Detailed)
Variability metric	Variance	CV (sd/mean)	CV (sd/mean)
Filtering	None	mean >= 1	mean >= 1
Data transformation	None	None	log2(TPM+1)
File output	None	TSV	TSV + PNG (size/dpi specified)
Reusability	Low	Medium	High

Interpretation Points

Key Insights

Without mean >= 1 filter, genes with near-zero expression but a single spike dominate top ranks

CV captures relative variability independent of absolute expression → fair comparison across expression levels

log2(TPM+1) transformation equalizes heatmap color distribution, making patterns visible

Part 2: Homework Assignments

Homework 1 (Python): mRNA FASTA Analysis + GC Distribution Graph

Problem Description

Extract sequence information from the UCSC human mRNA FASTA file (mrna.fa.gz), analyze GC content distribution, and produce a graph and an HTML report page.

Objective

Write a single prompt using vibe coding that produces the desired result in one shot. The goal is to get all outputs correct from a single, well-crafted prompt.

Input Data

curl -L -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz

FASTA file structure:

>BC032353 /gb=BC032353 /gi=34783011 /ug=Hs.444613 /len=1873
CGGCAGCGGCTGCGGGGAGGATGGCGGCGACGGCGACTTTTAAATATATTGG...

Expected Output

Output	Specification
`results/mrna_metrics.tsv`	accession, length, gc_content (4 decimals, sorted by gc_content desc, top 20)
`results/gc_content_distribution.png`	Histogram + density curve (1600×900 px, dpi 200)
`results/gc_content_distribution.svg`	Same graph in SVG format
`docs/gc_content_distribution.html`	Title + summary statistics + embedded PNG

Plot specifications:

Bins: 0.00 to 1.00, step 0.01
Bar color: light gray; border: dark gray
Density curve: blue, line width 2.0
Mean: red dashed vertical line; Median: green dashed vertical line
Caption showing n, mean, median, sd

Chart Selection Discussion (In-Class Activity)

I want to show the distribution of gc_content values.
Compare histogram, density curve, and ECDF — which is most appropriate?

My situation:
- Continuous values between 0 and 1, very large sample size (tens of thousands)
- I want to convey distribution shape, central tendency (mean, median), and tails
- Must be reproducible at publication-quality level

For each option, give pros and cons, then recommend your top pick with justification.

Reference: R Graph Gallery — Distribution section

Grading Criteria

Criterion	Weight	Description
Prompt quality	40%	Input format, accession parsing, gzip handling, output specs (filename, columns, decimals, sorting, graph size/color/font)
Code correctness	40%	Correct gzip streaming parse, accurate computation, all 4 files generated
Result interpretation	20%	Explain 3 possible biological/technical reasons for high top-20 GC content values

Homework 2 (R): Z-Score Clustering of CV Top 200 Genes + Pattern Visualization

Problem Description

Using the 200 genes from results/iceplant_cv_top200.tsv (from Example 2):

Z-score normalize (row-wise)
Hierarchical clustering (ward.D2 method, euclidean distance)
Cut tree at k=4 to assign clusters
Save cluster mean expression pattern line plots
Save cluster assignment table as TSV

Objective

Practice writing prompts that precisely control R visualization output.

Input Data

# TSV from Example 2 + original TPM data
# results/iceplant_cv_top200.tsv (gene_id list)
# Ice-plant-transcriptome-profiling/iceplant_TPM_DT_ZT.tab.gz (original TPM)

Expected Output

1. Clustered Heatmap — results/cv_top200_cluster_heatmap.pdf

Specification	Value
Data	Mean TPM per ZT → row-wise Z-score
Z-score	(value − row_mean) / row_sd
Rows	gene_id (hierarchical clustering, ward.D2, euclidean)
Columns	ZT2, ZT6, ZT10, ZT14, ZT18, ZT22 (chronological, cluster_cols = FALSE)
Colors	blue (low) → white (0) → red (high)
Annotation	k=4 cutree as color bar
Size	8 × 12 inches

2. Cluster Mean Pattern Line Plot — results/cluster_patterns.pdf

Specification	Value
Layout	2×2 panel (facet_wrap)
Y-axis	Mean TPM ± SD
X-axis	ZT (2, 6, 10, 14, 18, 22)
Panel titles	“Cluster 1 (n=XX genes)”
Size	10 × 8 inches

3. Assignment Table — results/cluster_assignment.tsv

Column	Description
gene_id	Gene identifier
cluster	1–4
peak_ZT	ZT with highest mean TPM
trough_ZT	ZT with lowest mean TPM
amplitude	max − min (2 decimal places)

Sort by cluster ascending, then amplitude descending within cluster.

Prompt-Writing Hints

Analysis procedure:
1. From iceplant_TPM_DT_ZT.tab.gz, extract only the top 200 genes (by gene_id list)
2. Parse ZT time point from sample names (regex: digits after "ZT")
3. Average the 3 replicates per ZT → gene × ZT matrix (200 rows × 6 columns)
4. Z-score normalize: for each row (gene), compute (value - mean) / sd
5. Hierarchical clustering: dist(euclidean) → hclust(ward.D2)
6. cutree(k=4) to assign 4 clusters

For the heatmap:
- Use pheatmap or ComplexHeatmap
- Show cluster assignment as annotation_row color bar
- Columns (ZT) in chronological order (cluster_cols = FALSE)

For the line plot:
- Use original TPM values (no log transformation) for mean ± SD
- ggplot2 facet_wrap(~cluster, ncol=2)

Grading Criteria

Criterion	Weight	Description
Prompt quality	40%	Z-score definition, clustering method (ward.D2, euclidean), k=4, replicate-to-mean procedure, output specs
Code correctness	40%	Accurate normalization, clustering, cutree, summary statistics, all 3 files generated
Result interpretation	20%	Interpret 4 cluster patterns in context of CAM photosynthesis (2 sentences per cluster)

Appendix: Effective Vibe Coding Prompt Template

[Environment]: Write [language] code that runs in the [env name] conda environment.
[Installed packages] are available.

**Input specification:**
- File: [filename] ([format], [delimiter], [gzip?])
- Structure: [column descriptions, special parsing rules]
- Additional inputs: [chrom.sizes, gene lists, or other reference files]

**Analysis conditions:**
- [Filter criteria (e.g., mean >= 1)]
- [Computation method (e.g., CV = sd/mean)]
- [Definitions (e.g., exon count = unique intervals only)]

**Output 1 — Table:**
- Filename: [filename]
- Columns: [list column names]
- Decimal places: [number]
- Sorting: [criterion, direction]
- Filter: [top N]

**Output 2 — Plot:**
- Filename: [filename]
- Size: [px or inches], dpi: [value]
- Colors: [specify explicitly]
- Axes/labels/legend: [specify explicitly]

**Output 3 — QC:**
- [Dropped items filename]
- [Summary information to print to console]

Input/Output Prompt Checklist

When Specifying Input

Filename and path

File format (GFF3, TSV, FASTA, etc.)

Compression (gzip or not)

Delimiter (tab, comma, space)

Header presence

Data structure (column names, what rows represent)

Special structures (e.g., GFF3 attribute parsing rules)

External reference files (chrom.sizes, etc.)

When Specifying Output

File format (TSV, CSV, PDF, PNG, SVG, HTML)

Filename

Column names and order

Decimal places

Sorting criterion (ascending/descending)

Filter conditions (top N, minimum threshold, etc.)

Plot: size, resolution, colors, font, legend position, axis range

QC artifacts (dropped items, summary statistics)

When Specifying Analysis Definitions

Metric definitions (CV = sd/mean, Z-score = (x−mean)/sd)

Filter rules (mean >= 1)

Deduplication handling (unique intervals, etc.)

Transformation methods (log2(TPM+1))

Clustering parameters (method, distance metric, k)

Data Sources

MGI GFF3: http://www.informatics.jax.org/downloads/mgigff3/MGI.gff3.gz

Human mRNA: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz

Ice Plant TPM: https://github.com/plantgenomicslab/Ice-plant-transcriptome-profiling

BCH709 Introduction to Bioinformatics: Vibe Coding in Life Sciences

Source Attribution

Table of Contents

What is Vibe Coding?

Vibe Coding in Biotech

Why This Matters

8 Tools Rewriting the Rules of Life Sciences

1. Superbio.ai – No-Code AI Marketplace

2. Recursion’s LOWE – LLM-Orchestrated Wet Lab

3. DrBioRight 2.0 – Cancer Proteomics Chatbot

4. BioChatter – Open Source Bio-AI Toolkit

5. OLAF – Conversational Bioinformatics OS

6. TinyBio – ChatGPT for Scientists

7. Scispot (Scibot) – Lab AI Analyst

8. Synthace – Conversational Wet Lab Automation

Key Takeaways

BCH709 Bioinformatics Vibe Coding Lab Materials

Lab Overview

The Vibe Coding Workflow

The Key Insight

Step 0: Brainstorming and Environment Setup

Learning Objective

Step 0A. Brainstorming Prompt

Step 0B. Environment Setup Prompt

Step 0C. Environment Creation Commands

Python Environment: bch709-python

R Environment: bch709-R

Data Downloads

Including Environment in Every Prompt

Warning

Part 1: Vibe Coding Examples

Example 1 (Python): Per-Chromosome Feature Counts from MGI GFF3 + QC

Background

Critical Design Decisions

Definitions That Must Appear in Your Prompt

Stage 1: Vague Prompt

Problems with Stage 1

Stage 2: Specifying Output Format

Stage 3: Adding Detailed Constraints

Example 1: Comparison Summary

QC Interpretation Questions

Questions Students Must Answer

Key Message

Example 2 (R): Top 200 Variable Genes from Ice Plant TPM + Heatmap

Background

Critical Design Decisions

Stage 1: Vague Prompt

Problems with Stage 1

Stage 2: Specifying Output Format

Stage 3: Adding Detailed Constraints

Example 2: Comparison Summary

Interpretation Points

Key Insights

Part 2: Homework Assignments

Homework 1 (Python): mRNA FASTA Analysis + GC Distribution Graph

Problem Description

Objective

Input Data

Expected Output

Chart Selection Discussion (In-Class Activity)

Grading Criteria

Homework 2 (R): Z-Score Clustering of CV Top 200 Genes + Pattern Visualization

Problem Description

Objective

Input Data

Expected Output

Prompt-Writing Hints

Grading Criteria

Appendix: Effective Vibe Coding Prompt Template

Input/Output Prompt Checklist

When Specifying Input

When Specifying Output

When Specifying Analysis Definitions

Data Sources

Python Environment: `bch709-python`

R Environment: `bch709-R`