Source Attribution
This content is adapted from βWhen Vibe Coding Meets Life Scienceβ by Gozde Eskici, Ph.D. (The Second Translation newsletter, April 14, 2025)
Table of Contents
- What is Vibe Coding?
- Vibe Coding in Bioinformatics
- 8 Tools Rewriting the Rules
- Setting Up VS Code with AI Assistants
- BCH709 Lab Materials
- Step 0: Project Setup (GitHub Repo)
- Step 0A: Brainstorming (Python)
- Step 0A: Brainstorming (R)
- Step 0B: How to Ask AI to Set Up Your Environment
- Step 0C: Environment Creation Commands
- Step 0D: Research Project Design (Advanced)
- Telling AI About Your Environment
- How to Write Effective Prompts
- Part 1: Vibe Coding Examples
- Part 2: Homework Assignments
- Appendix: Prompt Templates
- Input/Output Prompt Checklist
What is Vibe Coding?
On February 2nd, 2025, Andrej Karpathy, one of the most influential voices in AI, introduced a new term:
βThereβs a new kind of coding I call βvibe codingβ, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. Itβs possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboardβ¦ Iβm building a project or webapp, but itβs not really coding β I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.β
IBM followed with a formal definition:
βVibe coding is a fresh take in coding where users express their intention using plain speech and the AI transforms that thinking into executable code.β
Vibe Coding in Bioinformatics
In bioinformatics, coding isnβt about building websitesβitβs about running genome pipelines, analyzing RNA-seq data, or scripting variant calling workflows. Historically, thatβs meant technical depth, time, and a dedicated computational team.
But what if a scientist could just say:
- βCompare these RNA-seq datasets.β
- βPredict disease progression from this clinical data.β
- βSimulate how this protein binds ligands.β
β¦and the AI handles the rest?
Thanks to LLMs, Biopython, and Colab-powered interfaces, weβre now close. The act of building has become more conversational, more iterativeβmore βvibey.β
Why This Matters
Bioinformatics has long been bottlenecked by translationβthe gap between biological question and computational answer. Vibe coding changes that by:
| Benefit | Description |
|---|---|
| Faster iteration | Rapid prototyping of experiments and product ideas |
| Lower barriers | Scientists can code without deep programming expertise |
| Broader access | More people can prototype, test, and scale ideas |
| Leaner teams | Smaller teams can accomplish more, especially at early stages |
8 Tools Rewriting the Rules of Life Sciences
1. Superbio.ai β No-Code AI Marketplace
Founded by Berke Buyukkucak and Ronjon Nag. Run cutting-edge AI tools for drug discovery, protein design, and literature reviewβno code needed.
Link: superbio.ai
2. Recursionβs LOWE β LLM-Orchestrated Wet Lab
Recursionβs internal tool (unveiled by Chris Gibson): describe an assay, LOWE designs and executes it via robotics using their proprietary phenomics and chemistry stack.
Link: recursion.com
3. DrBioRight 2.0 β Cancer Proteomics Chatbot
Built at MD Anderson by the Han Liang Lab. Ask questions like βWhich proteins in pathway X are altered in this tumor?β and get real answers with plots.
| Publication: Nature Communications (2025) | Link: drbioright.org |
4. BioChatter β Open Source Bio-AI Toolkit
From EMBL-EBI. Build custom AI assistants that connect to APIs, databases, and bio tools. Fully open-source and on-prem ready.
Link: biochatter.org
5. OLAF β Conversational Bioinformatics OS
From Weill Cornell (Dylan Riffle et al.). Say βAnalyze this RNA-seq fileβ and OLAF writes the code, runs it, and returns transparent, inspectable results.
Publication: arXiv
6. TinyBio β ChatGPT for Scientists
Acquired by Seqera. Started by Sasha Dagayev and Vishal Patel in 2022. Real-time code execution supporting 50+ bio libraries with self-healing error correction.
Link: tinybio.cloud
7. Scispot (Scibot) β Lab AI Analyst
YC-backed. Their AI assistant Scibot makes lab data conversational: βSummarize this weekβs PCR resultsβ produces instant dashboards.
Link: scispot.com
8. Synthace β Conversational Wet Lab Automation
Describe experiments in plain English; AI generates protocols and sends them directly to lab robots.
Link: synthace.com
Key Takeaways
- Vibe coding lets users build through intent, not syntax
- In bioinformatics, that means less friction, faster feedback, and broader access
- These tools donβt just βassistβ scientistsβthey enable more with less code and more creativity
Setting Up VS Code with AI Coding Assistants
VS Code is the recommended editor for vibe coding. By installing AI extensions, you turn it into a conversational coding environment where you can write prompts, generate code, and iterate β all in one place.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Coding Assistants β
β β
β VS Code Extensions: Web-Based: β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Claude β β GitHub β β Gemini β β ChatGPT / β β
β β (Anthropic)β β Copilot β β Code Assistβ β Codex β β
β β β β (OpenAI) β β (Google) β β (OpenAI) β β
β βββββββ¬βββββββ βββββββ¬ββββββ βββββββ¬βββββββ βββββββ¬βββββββ β
β β β β β β
β ββββββββββββββββΌβββββββββββββββΌβββββββββββββββ β
β β β β
β βββββββββββΌβββββββββββββββΌβββββββββββ β
β β Your prompt in plain English β β
β ββββββββββββββββββββ¬ββββββββββββββββββ β
β β β
β ββββββββββββββββββββΌββββββββββββββββββ β
β β AI-generated Python / R code β β
β ββββββββββββββββββββ¬ββββββββββββββββββ β
β β β
β ββββββββββββββββββββΌββββββββββββββββββ β
β β Execute & inspect results β β
β ββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Install Visual Studio Code
Windows (WSL)
- Download VS Code from https://code.visualstudio.com/
- Install on Windows (not inside WSL)
- Install the WSL extension in VS Code
- Open WSL terminal and type
code .to launch VS Code connected to WSL
macOS
- Download VS Code from https://code.visualstudio.com/
- Move to Applications folder
- Open VS Code, press
Cmd+Shift+P, type βShell Command: Install βcodeβ command in PATHβ- Now you can use
code .from Terminal
For detailed VS Code configuration with conda environments, see the Software Installation lesson.
Extension 1: Claude (Anthropic)
Claude provides a chat panel and inline code generation powered by Anthropicβs Claude models. It excels at understanding large code contexts and following detailed instructions.
Install:
$ code --install-extension anthropic.claude-code
Setup:
- Open VS Code and click the Claude icon in the sidebar
- Sign in with your Anthropic account or enter an API key from console.anthropic.com
- Start a chat and paste your prompt
Claude Code (CLI Alternative)
Claude is also available as a command-line tool for terminal-based workflows:
$ npm install -g @anthropic-ai/claude-code $ claudeThis is useful for working directly in the terminal without VS Code.
Extension 2: GitHub Copilot (OpenAI Codex)
GitHub Copilot provides real-time inline autocomplete suggestions as you type. Copilot Chat adds a conversational chat panel for writing prompts.
Install:
$ code --install-extension GitHub.copilot
$ code --install-extension GitHub.copilot-chat
Setup:
- You need a GitHub account
- Open VS Code and sign in to GitHub when prompted
- Copilot starts suggesting code automatically as you type; use
Tabto accept
Free for Students
GitHub Copilot is free for verified students through GitHub Education. Apply with your university email (
.edu) to get access.
Extension 3: Gemini Code Assist (Google)
Gemini Code Assist provides AI-powered code generation and a chat panel, backed by Googleβs Gemini models.
Install:
$ code --install-extension Google.geminicodeassist
Setup:
- Open VS Code and click the Gemini icon in the sidebar
- Sign in with your Google account
- Start a chat β a free usage tier is available for individual developers
ChatGPT and Codex (OpenAI) β Web-Based Alternative
You donβt need VS Code to do vibe coding. ChatGPT and Codex are web-based tools by OpenAI that let you write prompts and generate code directly in the browser.
ChatGPT:
- Go to chatgpt.com and sign in with an OpenAI account
- Paste your prompt; ChatGPT generates code you can copy into your editor or terminal
- Free tier available; Plus subscription unlocks GPT-4o and longer context
Codex (OpenAI):
- Available at chatgpt.com/codex
- Specialized for code generation tasks
- Can execute code in a sandboxed environment and return results
- Requires ChatGPT Plus or Pro subscription
When to Use Web-Based Tools vs. VS Code Extensions
Use Case Recommended Tool Quick one-off code generation ChatGPT (web) Iterating on code in a project VS Code + Claude / Copilot / Gemini Running code in a sandboxed cloud environment Codex (web) Working on HPC cluster via terminal Claude Code (CLI)
How AI Assistants Fit into the Vibe Coding Workflow
Step 1 Step 2 Step 3 Step 4
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Write β β AI generatesβ β Run code in β β Check β
β prompt βββββββΆβ code in βββββΆβ terminal or ββββΆβ output and β
β in chat β β editor β β notebook β β revise β
β panel β β β β β β prompt β
ββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββ¬βββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Iterate until correct
Typical Session (VS Code)
- Open VS Code with your conda environment active
- Open the AI chat panel (Claude, Copilot, or Gemini)
- Paste your structured prompt (environment + input + task + output specs)
- Review the generated code, click βInsert at Cursorβ or copy to a
.py/.Rfile- Run the script in the integrated terminal
- Inspect results; refine the prompt if needed
Typical Session (Web-Based: ChatGPT / Codex)
- Open chatgpt.com or chatgpt.com/codex in your browser
- Paste your structured prompt
- Copy the generated code into your local editor or terminal
- Run the script in your conda environment:
conda activate bch709_vibe_coding && python script.py- Inspect results; return to ChatGPT and refine the prompt if needed
Comparison: AI Coding Assistants
| Feature | Claude | GitHub Copilot | Gemini Code Assist | ChatGPT / Codex |
|---|---|---|---|---|
| Provider | Anthropic | GitHub / OpenAI | OpenAI | |
| Type | VS Code extension + CLI | VS Code extension | VS Code extension | Web-based |
| Authentication | API key or Anthropic account | GitHub account | Google account | OpenAI account |
| Free for students | Usage-based pricing | Free via GitHub Education | Free tier available | Free tier (GPT-4o mini) |
| Inline autocomplete | Yes | Yes | Yes | N/A (web) |
| Chat panel | Yes | Yes | Yes | Yes (browser) |
| Code execution | Via terminal | Via terminal | Via terminal | Codex sandbox |
| Best for | Detailed prompts; multi-file context | Real-time autocomplete | Google Cloud integration | Quick generation; no setup |
Quick Install: All VS Code Extensions
Install all extensions in one command:
$ code --install-extension anthropic.claude-code && \
code --install-extension GitHub.copilot && \
code --install-extension GitHub.copilot-chat && \
code --install-extension Google.geminicodeassist
Which One Should I Use for BCH709?
You can use any of these AI assistants for the lab exercises and homework. The prompts in this lesson are written in plain English and work with all AI coding tools β VS Code extensions and web-based tools alike.
Recommendation: Try multiple tools during the semester and compare the results. Different AI models produce different code for the same prompt β thatβs part of the learning experience.
BCH709 Bioinformatics Vibe Coding Lab Materials
Lab Overview
- Audience: BCH709 Genome Informatics graduate students
- Goal: Experience how prompt specificity transforms code quality and output
- Core Lesson: The more specific your prompt, the closer the result to what you actually need
- Structure: 2 Examples (Python, R) + 2 Homework Assignments (Python, R)
The Vibe Coding Workflow
Natural-language prompt β AI generates code β Execute β Inspect results β Revise prompt β Repeat
The Key Insight
βSaying exactly what you wantβ is the core skill. A prompt controls not only the code but also the execution environmentβwithout that, reproducibility breaks down.
Step 0: Project Setup
Learning Objective
Before writing any code, create a GitHub repository for your project, then ask the AI about possible approaches and required tools.
Create a GitHub Repository
Start by creating a new repository on GitHub to keep your work organized and version-controlled.
1. Log in to GitHub CLI
First, make sure you are logged in to GitHub from the command line. You only need to do this once.
$ gh auth login
Follow the prompts:
- Select GitHub.com
- Select HTTPS as the protocol
- Select Login with a web browser
- Copy the one-time code, press Enter, and paste it in the browser window that opens
- Authorize the GitHub CLI
To verify you are logged in:
$ gh auth status
github.com
β Logged in to github.com account your-username
2. Create a Project Directory
# Create a new directory for your vibe coding project
$ mkdir ~/bch709_vibe_coding
# Move into the directory
$ cd ~/bch709_vibe_coding
3. Initialize Git and Create the GitHub Repo
# Initialize a git repository
$ git init
# Create a README file so the repo is not empty
$ echo "# BCH709 Vibe Coding" > README.md
# Stage and make the first commit
$ git add README.md
$ git commit -m "Initial commit"
# Create the repo on GitHub and push
$ gh repo create bch709_vibe_coding --public --source=. --remote=origin --push
β Created repository your-username/bch709_vibe_coding on GitHub
β Added remote origin
β Pushed commits to origin/main
4. Verify Everything Worked
# Check that the remote is set up
$ git remote -v
origin https://github.com/your-username/bch709_vibe_coding.git (fetch)
origin https://github.com/your-username/bch709_vibe_coding.git (push)
You can also visit https://github.com/your-username/bch709_vibe_coding in your browser to see the repo.
5. Create Project Folders
Set up a directory structure for your data, scripts, and results:
$ mkdir -p data results scripts
~/bch709_vibe_coding/
βββ README.md
βββ data/ β input files (GFF3, chrom.sizes, expression data, etc.)
βββ results/ β output files (TSV, PNG, PDF)
βββ scripts/ β your Python and R scripts
6. Save the Project Structure to GitHub
# Stage the new folders
$ git add -A
# Commit
$ git commit -m "Add project folder structure"
# Push to GitHub
$ git push
Why Start with a GitHub Repo?
- All your code, data, and results stay in one place
- You can track changes and revert mistakes with
git logandgit diff- AI assistants like Claude Code can read your project structure via
CLAUDE.md- You can submit your homework by sharing the repo link
Saving Your Work
After making changes, save them to GitHub:
$ git add -A $ git commit -m "Describe what you changed" $ git pushDo this regularly β after finishing each analysis step or before closing your terminal.
From here, all commands assume you are working inside ~/bch709_vibe_coding.
Step 0A. Brainstorming Prompt (Analysis 1 β Python)
Copy and paste this prompt into the AI first. This is a strategy question, not a code request.
I am a beginner student in BCH709.
Before writing any code, brainstorm the approaches and libraries I need for the following analysis.
Analysis (Python, GFF3 analysis):
- Input: saccharomyces_cerevisiae.gff.gz (GFF3, gzip), chrom.sizes (TSV: chrom, length_bp)
- Goal: Count genes, exons (preventing isoform overcounting), tRNAs, and snoRNAs per chromosome; compute density
- Output: TSV table + dropped_seqids.txt (QC artifact)
Requirements:
1) Break the analysis into functional units (input parsing, statistical computation, visualization, file output, QC).
2) For each functional unit, suggest 1β2 candidate libraries.
3) Pick one recommended combination for beginners and explain why.
4) List the exact conda-forge package names for that combination.
5) Provide the conda environment creation command (environment name, Python version, and all packages in one command).
6) (Optional) Provide import verification commands to confirm all packages installed correctly.
Step 0A. Brainstorming Prompt (Analysis 2 β R)
Now do the same for the R analysis. Copy and paste this prompt into a new conversation (or continue the same one).
I am a beginner student in BCH709.
Before writing any code, brainstorm the approaches and libraries I need for the following analysis.
Analysis (R, Yeast stress response expression analysis):
- Input: gasch2000.txt (TSV, gene_id + log2 expression ratios across ~170 stress conditions)
- Goal: Select top 200 genes by CV, generate a heatmap of stress response patterns
- Output: TSV + heatmap PNG
Requirements:
1) Break the analysis into functional units (input parsing, statistical computation, visualization, file output, QC).
2) For each functional unit, suggest 1β2 candidate libraries.
3) Pick one recommended combination for beginners and explain why.
4) List the exact conda-forge package names for that combination.
5) Provide the conda environment creation command (environment name, R version, and all packages in one command).
6) (Optional) Provide library() verification commands to confirm all packages installed correctly.
Why Brainstorming First?
- You donβt need to memorize package names β the AI suggests them
- The AI produces a structured βfunction β packageβ mapping you can review
- The conda install commands follow naturally in the next step (Step 0B)
Step 0B. How to Ask AI to Set Up Your Environment
Learning Objective
Instead of memorizing conda commands, learn to describe what you need and let the AI generate the installation plan for you.
Setting up a conda environment is a three-step process:
Step 1: Tell the AI what you want to do (Step 0A β Brainstorming)
β
Step 2: Ask the AI to generate install commands (Step 0B β Environment Prompt)
β
Step 3: Copy-paste and run the commands (Step 0C β Install)
Key Idea
You already told the AI what analysis you want to do in Step 0A (brainstorming). The AI knows which libraries it recommended β you donβt need to list them again. Simply ask: βBased on what you recommended, give me the install commands.β
Step 0B. Environment Setup Prompt (Analysis 1 β Python)
Once brainstorming is complete, use this prompt:
Using the library combination you just recommended, generate conda environment creation commands.
Conditions:
- Conda environment name: bch709_vibe_coding
- Pin Python 3.11
- Include import/library verification tests after installation
- Present the commands in copy-paste order so a beginner can just run them one by one
Step 0B. Environment Setup Prompt (Analysis 2 β R)
Once brainstorming is complete, use this prompt:
Using the library combination you just recommended, generate conda environment creation commands.
Conditions:
- Conda environment name: bch709_vibe_coding
- Pin R 4.3
- Install R packages into the SAME environment (bch709_vibe_coding) that already has Python
- Include library() verification tests after installation
- Present the commands in copy-paste order so a beginner can just run them one by one
One Environment for Everything
We use a single conda environment (
bch709_vibe_coding) that matches the GitHub repo name. This keeps things simple β one project, one environment, one name.
Step 0C. Environment Creation Commands
Steps 0A and 0B were exercises in learning to ask the AI for environment recommendations. The commands below are a reference in case the AI output differs from what you need.
Important
The commands below are what the AI typically produces. Your results may vary slightly depending on which AI assistant you use β thatβs fine as long as the verification step passes.
Create the Environment (Python + R)
# Create environment with Python and R together
conda create -n bch709_vibe_coding -y -c conda-forge \
python=3.11 r-base=4.3 \
pandas numpy matplotlib seaborn biopython tqdm \
r-data.table r-ggplot2 r-pheatmap r-viridislite r-scales
# Activate
conda activate bch709_vibe_coding
# Verify Python packages
python -c "import pandas, numpy, matplotlib, seaborn, Bio, tqdm; print('Python OK')"
# Verify R packages
R -q -e 'library(data.table); library(ggplot2); library(pheatmap); library(viridisLite); library(scales); cat("R OK\n")'
Troubleshooting
If the verification step fails:
- Check that you activated the correct environment:
conda activate bch709_vibe_coding- Re-run the install command β sometimes packages fail to download on the first try
- Ask the AI: βI got this error when verifying: [paste error]. How do I fix it?β
Export Your Environment to YAML
After creating and verifying your environment, export it:
conda env export -n bch709_vibe_coding > environment.yml
This creates an environment.yml file listing every package and its exact version:
name: bch709_vibe_coding
channels:
- conda-forge
dependencies:
- python=3.11.x
- pandas=2.x.x
- numpy=1.x.x
- r-base=4.3.x
- r-data.table=1.x.x
# ... all packages with pinned versions
Why Export to YAML?
Reason Explanation Reproducibility Anyone can recreate your exact environment with conda env create -f environment.ymlVersion control Commit environment.ymlto Git β if a package update breaks your code, you can roll backAI context AI assistants can read this file to know exactly what packages (and versions) you have Collaboration Share with classmates or your instructor so everyone runs the same software Recovery If your environment gets corrupted, rebuild it in one command instead of reinstalling packages one by one Update the YAML whenever you install new packages:
conda env export -n bch709_vibe_coding > environment.yml git add environment.yml && git commit -m "Update environment" && git push
Data Downloads
# Yeast GFF3 from SGD (Example 1, Homework 1)
curl -L -o data/saccharomyces_cerevisiae.gff.gz http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff.gz
# Yeast chromosome sizes from UCSC sacCer3 (Example 1)
curl -L -o data/chrom.sizes https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes
# Yeast mRNA FASTA (Homework 1)
curl -L -o data/mrna.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/mrna.fa.gz
# Yeast genome FASTA (optional reference β not required for examples/homework)
curl -L -o data/sacCer3.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz
# Yeast stress response expression data - Gasch et al. (2000) (Example 2, Homework 2)
curl -L -o data/gasch2000.txt https://www.shackett.org/files/gasch2000.txt
Set Up AI Configuration Files
Now that your environment is installed, let AI create configuration files so every AI assistant knows your setup before you ask it to write code.
# Make sure you're in your project directory with the environment active
cd ~/bch709_vibe_coding
conda activate bch709_vibe_coding
# Open VS Code in the current directory
code .
In VS Code, open the Claude extension chat panel and type /init. This generates a CLAUDE.md file describing your project.
Then ask Claude:
Look at my project and conda environment (bch709_vibe_coding).
Create the following configuration files for my project:
1. .github/copilot-instructions.md β for GitHub Copilot
2. GEMINI.md β for Gemini Code Assist (unofficial, for copy-paste)
3. CODEX.md β for ChatGPT/Codex (unofficial, for copy-paste)
Each file should describe my conda environment, installed packages,
project structure, and coding constraints.
Why Do This Now?
From this point on, every time you ask an AI to write code, it will read these files first and know exactly what packages you have. No more βModuleNotFoundErrorβ surprises.
Step 0D. Research Project Design Prompt (Advanced)
When starting a new bioinformatics research project, use this prompt to systematically explore analytical directions before writing any code. This is especially useful for thesis projects, grant proposals, or novel research questions.
Learning Objective
Design a comprehensive analysis framework by exploring multiple analytical directions grounded in existing literature.
The Research Project Design Prompt
Copy and customize the following prompt. Replace [Research Question] with your specific question.
Design a bioinformatics analysis framework to address the following research question.
[Research Question]
[YOUR RESEARCH QUESTION HERE - e.g., "How do circadian-regulated genes in CAM plants differ from C3 plants at the regulatory level?"]
Your task is NOT to propose a finalized pipeline, but to explore and structure multiple analytical directions, explicitly grounded in existing literature.
Instructions:
A. Distinct Analytical Directions
1. Propose at least FIVE analysis directions that are clearly differentiated from commonly used or expected approaches in this field.
2. Each direction should represent a distinct analytical framing or inferential perspective, not a minor methodological variation.
B. Structured Evaluation of Each Direction
For EACH proposed analysis direction, provide the following in a clearly labeled structure:
1. Core idea
- What is the central analytical concept?
2. Why it is interesting
- What biological or conceptual insight could this reveal that standard analyses typically miss?
3. Relationship to prior work
- Cite 1β3 representative references (authorβyear format is sufficient).
- Explicitly state whether this direction:
a) Extends existing approaches,
b) Reinterprets prior findings, or
c) Challenges an implicit assumption in the literature.
- Avoid citing review articles unless they are used specifically to define or question a dominant paradigm.
- If direct primary literature is sparse or absent, explicitly state this limitation and explain how the proposed analysis explores underexamined or emerging conceptual space rather than reiterating established findings.
4. Additional data needs
- What new or orthogonal data, if any, would strengthen or enable this analysis?
5. Assumptions
- What biological, evolutionary, or statistical assumptions does this analysis rely on?
6. Analysis difficulty
- Rate as Low, Medium, or High, and briefly justify the rating.
C. Hypothesis Scope
- Include speculative or not-yet-validated hypotheses where appropriate.
- Do NOT exclude an analysis direction solely because it lacks direct experimental validation.
- Clearly distinguish between evidence-supported claims and conjectural interpretations.
D. Evidence Integration and Conflict Resolution
1. Identify at least three independent axes of evidence across the proposed analyses.
2. Describe how conclusions would be interpreted if these evidence axes yield conflicting or partially inconsistent results.
3. Specify how such inconsistencies would guide follow-up analyses, reframing of hypotheses, or narrowing of scope.
E. Critical Self-Assessment
- Identify where a skeptical reviewer is most likely to push back.
- Discuss risks related to reproducibility, overinterpretation, and literature bias.
- Explicitly distinguish what the data would demonstrate versus what would remain inferential or model-dependent.
Emphasize analytical reasoning, interpretive logic, and literature positioning over tool selection.
Example Research Questions
Here are example research questions you can adapt:
| Domain | Example Research Question |
|---|---|
| Transcriptomics | How do salt stress response genes in halophytes differ from glycophytes at the regulatory network level? |
| Genomics | What genomic signatures distinguish drought-tolerant crop varieties from susceptible ones? |
| Metagenomics | How does rhizosphere microbiome composition correlate with plant disease resistance? |
| Comparative Genomics | What is the evolutionary origin of C4 photosynthesis based on gene family expansion patterns? |
| Single-cell | How do cell-type-specific expression patterns change during plant development under stress? |
When to Use This Prompt
| Situation | Use This Prompt? |
|---|---|
| Starting a thesis project | Yes - explore directions before committing |
| Writing a grant proposal | Yes - identify novel angles |
| Class homework assignment | No - use simpler brainstorming prompts |
| Replicating a published analysis | No - follow the original methods |
| Exploring a new dataset | Yes - discover unexpected patterns |
Key Insight
This prompt forces you to think beyond the obvious analysis. Instead of jumping to βrun DESeq2,β you first ask: βWhat are five fundamentally different ways to approach this question?β
Warning: AI Limitations
AI assistants may:
- Cite papers that donβt exist (hallucination) β always verify references
- Miss recent publications (knowledge cutoff)
- Oversimplify domain-specific nuances
Use this prompt as a starting point for exploration, not as a definitive literature review.
Telling AI Assistants About Your Conda Environment
If you followed Step 0C, your AI already knows your environment through configuration files (CLAUDE.md, copilot-instructions.md, etc.). This section explains why that matters and what to do if you donβt have config files.
Why This Matters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Without config files: With config files: β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β AI assumes random β β AI reads your setup and β β
β β packages β code fails β vs β generates working code β β
β β with ImportError β β β β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Persistent Configuration Reference
Hereβs what each configuration file does:
Configuration Files Summary
AI Assistant File / Setting What It Does Claude Code CLAUDE.md(created by/init)Auto-read every session β no prompting needed GitHub Copilot .github/copilot-instructions.mdAuto-read for autocomplete and chat Gemini GEMINI.md(copy-paste) or VS Code settingsPaste at start of conversation ChatGPT/Codex CODEX.md(copy-paste) or Custom InstructionsPaste at start of conversation
If You Donβt Have Config Files
If you skipped Set Up AI Configuration Files, include environment info at the start of every prompt:
Write [Python/R] code that runs in the bch709_vibe_coding conda environment. Installed packages: [list your packages]. Do NOT use packages outside this list.Without this, the AI will assume arbitrary packages and your code will fail with
ImportError.
How to Write Effective Vibe Coding Prompts
Writing a good prompt is like writing a recipe: the more specific your instructions, the better the result. Hereβs a step-by-step guide to crafting prompts that produce working code on the first try.
The 5-Part Prompt Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Effective Prompt = 5 Essential Parts β
β β
β 1. Tool β "Write Python code" (AI reads your config files) β
β 2. Input β "Read data/file.gz (gzip TSV, columns: a, b, c)" β
β 3. Task β "Compute X using formula Y, filter by Z" β
β 4. Output β "Save to results/out.tsv (cols, decimals, sorting)" β
β 5. QC/Console β "Print top 10 rows, save dropped items to log.txt" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step-by-Step Prompt Construction
Step 1: Tool (Which Language Do We Need?)
If you set up AI configuration files, your AI already knows your environment. Just tell it which language to use:
Bad:
Write Python code to analyze my data.
Good (with config files):
Write Python code for Analysis 1.
Good (without config files):
Write Python code that runs in the bch709_vibe_coding conda environment.
Installed packages: pandas, numpy, matplotlib, seaborn, biopython, tqdm.
Step 2: Input (What Are You Reading?)
Describe the input file precisely.
Bad:
Read the GFF file.
Good:
Input: data/saccharomyces_cerevisiae.gff.gz
- Format: GFF3 (9 tab-separated columns), gzip compressed
- Columns: seqid, source, type, start, end, score, strand, phase, attributes
- seqid = chromosome name (e.g., chrI, chrII, chrXVI)
- type = feature type (gene, exon, mRNA, tRNA, snoRNA, etc.)
Step 3: Task (What Should You Do?)
Define computations with explicit formulas.
Bad:
Find the most variable genes.
Good:
Task:
1. Compute mean_expr = row-wise mean across all condition columns
2. Compute sd_expr = row-wise standard deviation
3. Compute CV = sd_expr / abs(mean_expr)
4. Filter: remove genes with all NA or zero variance
5. Select: top 200 genes by CV (descending)
Step 4: Output (What Files Should You Create?)
Specify exact filenames, formats, columns, and formatting.
Bad:
Save the results.
Good:
Output: results/cv_top200.tsv
- Format: TSV with header
- Columns: gene_id, mean_expr, sd_expr, cv
- Round numeric values to 4 decimal places
- Sort by cv descending
Step 5: QC/Console (What Should You Print?)
Tell the AI what to display for verification.
Bad:
Print something.
Good:
Console output:
- Print number of genes after filtering (removed NA/zero-variance)
- Print number of genes that were removed
- Print top 10 rows of the result table
- Print "Saved: [filename]" for each output file
Complete Prompt Examples
Example 1: Python GFF3 Analysis (Complete Prompt)
Write Python code that runs in the bch709_vibe_coding conda environment. Installed packages: pandas, numpy, matplotlib, seaborn, biopython, tqdm. **Input:** - GFF3 file: data/saccharomyces_cerevisiae.gff.gz (gzip, 9 tab-separated columns) - seqid = chromosome (chrI, chrII, ..., chrXVI, chrM) - type = feature type (gene, exon, mRNA, tRNA, snoRNA, etc.) - Chromosome sizes: data/chrom.sizes (TSV: chrom, length_bp) **Task:** 1. Only include seqids that exist in chrom.sizes 2. Log seqids NOT in chrom.sizes to a QC file 3. Count genes per chromosome (type == "gene") 4. Count unique exons per chromosome (unique start, end, strand tuples to prevent isoform overcounting) 5. Count tRNA (type == "tRNA") 6. Count snoRNA (type == "snoRNA") 7. Compute density: gene_per_Mb = n_gene / (chrom_length_bp / 1e6) **Output 1:** results/chr_feature_counts.tsv - Columns: chrom, chrom_length_bp, n_gene, n_exon_unique, n_tRNA, n_snoRNA, gene_per_Mb - Round densities to 4 decimal places - Sort by gene_per_Mb descending **Output 2:** results/dropped_seqids.txt - One seqid per line, sorted alphabetically **Console:** - Print number of dropped seqids and number of dropped feature lines - Print top 5 rows of the result table
Example 2: R Heatmap Analysis (Complete Prompt)
Write R code that runs in the bch709_vibe_coding conda environment. Installed packages: data.table, ggplot2, pheatmap, viridisLite, scales. **Input:** - Expression file: data/gasch2000.txt (TSV, log2 ratios) - First column: UID (systematic gene name, e.g., YAL001C) - Skip columns: NAME, description, GWEIGHT - Remaining columns: ~170 stress condition columns (log2 expression ratios) **Task:** 1. Parse gene_id from UID column, skip metadata columns 2. Compute row-wise sd across all condition columns 3. Compute CV = sd / abs(mean) for each gene 4. Filter: remove genes with all NA or zero variance 5. Select: top 200 genes by CV descending **Output 1:** results/yeast_stress_cv_top200.tsv - Columns: gene_id, mean_expr, sd_expr, cv - Round to 4 decimal places **Output 2:** results/yeast_stress_cv_top200_heatmap.png - Size: 1800 Γ 1200 pixels, dpi 200 - Data: log2 expression ratios (already log-transformed) - Rows: gene_id (maintain CV descending order, cluster_rows = FALSE) - Columns: original condition order (cluster_cols = FALSE) - X-axis labels: rotated 90 degrees - Title: "Yeast stress response, CV top200 (Gasch et al. 2000)" **Console:** - Print top 10 rows of the CV table - Print "Saved: [filename]" for each output
Prompt Writing Checklist
Use this checklist before sending your prompt:
Before You Send Your Prompt
Environment:
- Specified language (Python/R)
- Config files set up (or manually listed conda environment and packages)
Input:
- Specified filename and path
- Specified format (TSV, CSV, GFF3, FASTA, etc.)
- Specified if gzip compressed
- Described column structure
Task:
- Defined formulas (CV = sd/abs(mean), etc.)
- Specified filter criteria (remove NA/zero-variance, top N, etc.)
- Explained any deduplication logic
Output:
- Specified filename and path
- Listed column names
- Specified decimal places
- Specified sorting order
- Specified plot dimensions and format (if applicable)
QC:
- Specified what to print to console
- Specified any QC files to save
Common Prompt Mistakes and Fixes
| Mistake | Problem | Fix |
|---|---|---|
| βAnalyze the dataβ | AI doesnβt know what analysis | Specify exact computation: βCompute CV = sd/abs(mean)β |
| βSave the resultsβ | AI chooses random filename | Specify: βSave to results/output.tsvβ |
| βMake a nice plotβ | AI chooses arbitrary colors/size | Specify: β1800x1200 px, dpi 200, blue-white-red colorsβ |
| βFilter low genesβ | AI doesnβt know threshold | Specify: βFilter: remove genes with all NA or zero varianceβ |
| βCount exonsβ | AI may double-count isoforms | Specify: βCount unique (start, end, strand) tuplesβ |
Iteration Strategy
If the first prompt doesnβt work perfectly, follow this pattern:
Write prompt β AI generates code
β
Ask AI: "Does this need packages I don't have?" β install if needed
β
Run code β check output
β
If error β paste error message, ask AI to fix
If wrong output β describe expected vs actual, refine prompt
After AI Generates Code: Check for Missing Packages
Before running the code, ask the AI:
Does this code require any packages that are not in my bch709_vibe_coding environment?
If yes, give me the conda install commands to add them.
If you have AI configuration files set up, the AI already knows your packages β just ask the question above. If not, add your package list to the prompt (e.g., βMy current packages: pandas, numpy, matplotlib, seaborn, biopython, tqdm, data.table, ggplot2, pheatmap, viridisLite, scales.β).
Install any missing packages before running the code.
Example: AI Code Uses a Package You Donβt Have
If you skip the check above (or the AI misses one), hereβs what happens:
Step 1. You ask the AI to generate code, and it produces:
import scipy.stats as stats # β not in your environment!
import pandas as pd
z_scores = stats.zscore(df.values, axis=1)
Step 2. You run the code and get:
ModuleNotFoundError: No module named 'scipy'
Step 3. Ask the AI:
I got "ModuleNotFoundError: No module named 'scipy'" when running your code.
Give me the conda install command to add scipy to my bch709_vibe_coding environment.
Step 4. The AI responds:
conda install -c conda-forge -n bch709_vibe_coding scipy
Step 5. Run the install command, then re-run your script:
$ conda install -c conda-forge -n bch709_vibe_coding scipy
$ python scripts/analysis.py # now it works
Step 6. Update your environment YAML and config files:
# Re-export so environment.yml stays current
conda env export -n bch709_vibe_coding > environment.yml
git add environment.yml && git commit -m "Add scipy" && git push
Then in VS Code Claude chat, type:
I just installed scipy into bch709_vibe_coding.
Update CLAUDE.md, copilot-instructions.md, GEMINI.md, and CODEX.md
to reflect the current conda environment.
Why Update environment.yml and Config Files?
- environment.yml β locks the exact package versions so you (or anyone else) can recreate the environment later
- Config files β tells the AI what packages are available so it doesnβt suggest missing libraries or redundant installs
Always update both after installing new packages.
Part 1: Vibe Coding Examples
Example 1 (Python): Per-Chromosome Feature Counts from Yeast GFF3 + QC
Research Question
How are genomic features (genes, exons, tRNAs, snoRNAs) distributed across yeast chromosomes, and does feature density correlate with chromosome size?
Background
Extract chromosome-level feature counts (genes, exons, tRNAs, snoRNAs) from the Saccharomyces cerevisiae (yeast) GFF3 file and cross-reference against an external chrom.sizes file to verify data integrity.
GFF3 file structure (9 tab-separated columns):
chrI SGD gene 335 649 . + . ID=YAL069W;Name=YAL069W;gene=YAL069W
| Column | Content |
|---|---|
| 1 | chromosome (seqid) |
| 2 | source |
| 3 | feature type (gene, mRNA, exon, tRNA, snoRNA, etc.) |
| 4 | start position |
| 5 | end position |
| 6 | score |
| 7 | strand (+/β) |
| 8 | phase |
| 9 | attributes (key=value pairs, semicolon-delimited) |
chrom.sizes file structure:
chrI 230218
chrII 813184
chrIV 1531933
Critical Design Decisions
Definitions That Must Appear in Your Prompt
- Exon overcounting prevention: Count unique (start, end, strand) intervals per chromosome
- Chromosome length source: Use external chrom.sizes file; log excluded seqids to
dropped_seqids.txt- tRNA/snoRNA definition: Count lines where type is
tRNAorsnoRNA
Stage 1: Vague Prompt
Prompt:
βExtract the gene, exon, tRNA, and snoRNA counts per chromosome from the GFF3 file.β
AI-generated code:
import gzip, re
from collections import defaultdict
counts = defaultdict(lambda: defaultdict(int))
with gzip.open("data/saccharomyces_cerevisiae.gff.gz", "rt") as f:
for line in f:
if line.startswith("#"):
continue
fields = line.strip().split("\t")
if len(fields) < 9:
continue
chrom = fields[0]
ftype = fields[2]
if ftype in ("gene", "exon", "tRNA", "snoRNA"):
counts[chrom][ftype] += 1
for chrom in sorted(counts):
print(chrom, dict(counts[chrom]))
Problems with Stage 1
- No environment specified β AI doesnβt know which packages are available
- No chromosome length reference β all seqids included without filtering
- Exons double-counted across isoforms
- No QC tracking of dropped seqids
- No file output
Stage 2: Specifying Output Format
Prompt:
Write Python code that runs in the bch709_vibe_coding conda environment.
Installed packages: pandas, numpy, matplotlib, seaborn, biopython, tqdm.
Input:
- GFF3 gzip: data/saccharomyces_cerevisiae.gff.gz (9 columns, tab-separated)
- Chrom sizes: data/chrom.sizes (TSV: chrom, length_bp)
Task:
- Only include chromosomes present in chrom.sizes
- Save GFF seqids not found in chrom.sizes to results/dropped_seqids.txt
- Count genes per chromosome (type == "gene")
- Count exons per chromosome: to prevent isoform overcounting, define count as number of unique (start, end, strand) intervals
- Count tRNA and snoRNA: lines where type is "tRNA" or "snoRNA"
Output:
- results/chr_feature_counts.tsv
- Columns: chrom, chrom_length_bp, n_gene, n_exon_unique, n_tRNA, n_snoRNA
- Include header
- Print top 5 rows to console
AI-generated code:
from pathlib import Path
import gzip
import pandas as pd
from collections import defaultdict
DATA_GFF = Path("data/saccharomyces_cerevisiae.gff.gz")
DATA_SIZES = Path("data/chrom.sizes")
OUT_TSV = Path("results/chr_feature_counts.tsv")
OUT_DROP = Path("results/dropped_seqids.txt")
OUT_TSV.parent.mkdir(parents=True, exist_ok=True)
# Load chrom sizes
sizes = pd.read_csv(DATA_SIZES, sep="\t", header=None, names=["chrom","length_bp"])
chrom_set = set(sizes["chrom"])
chrom_len = dict(zip(sizes["chrom"], sizes["length_bp"]))
n_gene = defaultdict(int)
exon_intervals = defaultdict(set)
n_tRNA = defaultdict(int)
n_snoRNA = defaultdict(int)
dropped_seqids = set()
with gzip.open(DATA_GFF, "rt") as f:
for line in f:
if not line or line.startswith("#"):
continue
fields = line.rstrip("\n").split("\t")
if len(fields) < 9:
continue
seqid, _, ftype, start, end, _, strand, _, _ = fields
if seqid not in chrom_set:
dropped_seqids.add(seqid)
continue
if ftype == "gene":
n_gene[seqid] += 1
elif ftype == "exon":
exon_intervals[seqid].add((int(start), int(end), strand))
elif ftype == "tRNA":
n_tRNA[seqid] += 1
elif ftype == "snoRNA":
n_snoRNA[seqid] += 1
rows = []
for chrom in sizes["chrom"]:
rows.append({
"chrom": chrom,
"chrom_length_bp": int(chrom_len[chrom]),
"n_gene": n_gene.get(chrom, 0),
"n_exon_unique": len(exon_intervals.get(chrom, set())),
"n_tRNA": n_tRNA.get(chrom, 0),
"n_snoRNA": n_snoRNA.get(chrom, 0),
})
df = pd.DataFrame(rows)
df.to_csv(OUT_TSV, sep="\t", index=False)
OUT_DROP.write_text("\n".join(sorted(dropped_seqids)) + "\n")
print("Saved:", OUT_TSV)
print("Dropped seqids:", len(dropped_seqids))
print(df.head(5).to_string(index=False))
Improvements: TSV output, unique exon definition, dropped seqid QC
Still missing: Density not computed, no sorting, unclear zero-feature handling
Stage 3: Adding Detailed Constraints
Prompt:
Extend the Stage 2 results.
Additional output requirements:
- Add density columns:
- gene_per_Mb = n_gene / (chrom_length_bp / 1e6)
- exon_unique_per_Mb, tRNA_per_Mb, snoRNA_per_Mb (same formula)
- Sort by gene_per_Mb descending
- Round to 4 decimal places
- Include all chromosomes from chrom.sizes, filling counts with 0 where no features exist
- Save unique dropped seqids sorted alphabetically to dropped_seqids.txt
- Print to console: number of dropped seqids, number of excluded feature lines, top 5 rows of the result table
AI-generated code (final):
from pathlib import Path
import gzip
import pandas as pd
from collections import defaultdict
DATA_GFF = Path("data/saccharomyces_cerevisiae.gff.gz")
DATA_SIZES = Path("data/chrom.sizes")
OUT_TSV = Path("results/chr_feature_counts.tsv")
OUT_DROP = Path("results/dropped_seqids.txt")
OUT_TSV.parent.mkdir(parents=True, exist_ok=True)
# 1) Chrom sizes
sizes = pd.read_csv(DATA_SIZES, sep="\t", header=None, names=["chrom","length_bp"])
chrom_set = set(sizes["chrom"])
chrom_len = dict(zip(sizes["chrom"], sizes["length_bp"]))
# 2) Counters
n_gene = defaultdict(int)
exon_intervals = defaultdict(set)
n_tRNA = defaultdict(int)
n_snoRNA = defaultdict(int)
dropped_seqids = set()
dropped_lines = 0
with gzip.open(DATA_GFF, "rt") as f:
for line in f:
if not line or line.startswith("#"):
continue
fields = line.rstrip("\n").split("\t")
if len(fields) < 9:
continue
seqid, _, ftype, start, end, _, strand, _, _ = fields
if seqid not in chrom_set:
dropped_seqids.add(seqid)
dropped_lines += 1
continue
if ftype == "gene":
n_gene[seqid] += 1
elif ftype == "exon":
exon_intervals[seqid].add((int(start), int(end), strand))
elif ftype == "tRNA":
n_tRNA[seqid] += 1
elif ftype == "snoRNA":
n_snoRNA[seqid] += 1
# 3) Build result table (include all chroms from chrom.sizes; fill 0 where no features)
rows = []
for chrom in sizes["chrom"]:
L = float(chrom_len[chrom])
g = n_gene.get(chrom, 0)
ex = len(exon_intervals.get(chrom, set()))
tr = n_tRNA.get(chrom, 0)
sn = n_snoRNA.get(chrom, 0)
Mb = L / 1e6 if L > 0 else 1
rows.append({
"chrom": chrom,
"chrom_length_bp": int(L),
"n_gene": g,
"n_exon_unique": ex,
"n_tRNA": tr,
"n_snoRNA": sn,
"gene_per_Mb": round(g / Mb, 4),
"exon_unique_per_Mb": round(ex / Mb, 4),
"tRNA_per_Mb": round(tr / Mb, 4),
"snoRNA_per_Mb": round(sn / Mb, 4),
})
df = pd.DataFrame(rows).sort_values("gene_per_Mb", ascending=False)
df.to_csv(OUT_TSV, sep="\t", index=False)
OUT_DROP.write_text("\n".join(sorted(dropped_seqids)) + "\n")
print(f"Saved: {OUT_TSV}")
print(f"Saved: {OUT_DROP}")
print(f"Dropped seqids: {len(dropped_seqids)}")
print(f"Dropped feature lines: {dropped_lines}")
print(df.head(5).to_string(index=False))
Sample output:
Saved: results/chr_feature_counts.tsv
Saved: results/dropped_seqids.txt
Dropped seqids: 1
Dropped feature lines: 42
chrom chrom_length_bp n_gene n_exon_unique n_tRNA n_snoRNA gene_per_Mb ...
chrIII 316620 174 210 10 5 549.5146 ...
chrI 230218 117 136 4 3 508.2190 ...
chrVI 270161 136 170 7 4 503.2285 ...
chrIX 439888 218 275 10 4 495.5724 ...
chrV 576874 282 356 17 6 488.8408 ...
Example 1: Comparison Summary
| Aspect | Stage 1 (Vague) | Stage 2 (Format) | Stage 3 (Detailed) |
|---|---|---|---|
| Chromosome scope | Everything | chrom.sizes only | chrom.sizes + zero-fill |
| Exon definition | Duplicate-counted | Unique interval | Unique interval |
| QC artifact | None | dropped_seqids.txt | Count + line count + file |
| Density | None | None | 4 per_Mb columns |
| Sorting | None | None | gene_per_Mb descending |
| Reusability | Low | Medium | High (publication-ready) |
QC Interpretation Questions
Questions Students Must Answer
- What seqids ended up in
dropped_seqids.txt? (Mitochondrial genome? 2-micron plasmid?)- What fraction of total genes were dropped? Could this affect conclusions?
- If the prompt had NOT specified using chrom.sizes, what errors could have occurred silently?
Key Message
dropped_seqids.txtis not a failure logβit is a declaration of what this analysis accepts as data.
Example 2 (R): Top 200 Variable Genes from Yeast Stress Data + Heatmap
Research Question
Which yeast genes show the most variable expression across environmental stress conditions, and do they reveal a coordinated stress response program?
Background
Analyze the classic yeast stress response microarray dataset from Gasch et al. (2000). Extract the top 200 genes by coefficient of variation (CV) across ~170 environmental stress conditions and visualize their expression patterns.
Data source: Gasch et al. (2000) β βGenomic expression programs in the response of yeast cells to environmental changes.β Mol Biol Cell 11(12):4241-4257.
Data structure:
| Column | Description |
|---|---|
| UID | Systematic gene name (e.g., YAL001C) |
| NAME | Gene common name (e.g., TFC3) |
| (description) | Functional description |
| GWEIGHT | Gene weight (skip) |
| Remaining columns | ~170 stress conditions (log2 expression ratios) |
- Values are log2 ratios (positive = upregulated, negative = downregulated)
- Conditions include: heat shock, oxidative stress, osmotic shock, amino acid starvation, nitrogen depletion, stationary phase
Critical Design Decisions
| Decision | Rationale |
|---|---|
| Remove all-NA rows | Some genes have missing data across conditions |
| CV = sd / abs(mean) | Measures relative variability; use absolute mean since values can be negative (log2 ratios) |
| Data already log-transformed | No additional log transformation needed for heatmap |
Stage 1: Vague Prompt
Prompt:
βPick the 200 most variable genes from the stress data and draw a heatmap.β
AI-generated code:
data <- read.table("data/gasch2000.txt", header=TRUE, sep="\t", row.names=1)
vars <- apply(data, 1, var, na.rm=TRUE)
top200 <- head(sort(vars, decreasing=TRUE), 200)
heatmap(as.matrix(data[names(top200), ]))
Problems with Stage 1
- No environment specified β AI doesnβt know which packages are available
- Uses variance instead of CV β biased toward genes with large absolute changes
- No NA handling β may crash or produce misleading results
- No file output, no image size/resolution control
- Metadata columns (NAME, GWEIGHT) not removed β treated as data
Stage 2: Specifying Output Format
Prompt:
Write R code that runs in the bch709_vibe_coding conda environment.
Installed packages: data.table, ggplot2, pheatmap, viridisLite, scales.
Input:
- data/gasch2000.txt (TSV, log2 expression ratios)
- First column: UID (gene_id); skip NAME, description, GWEIGHT columns
- Remaining columns: ~170 stress conditions (numeric)
Task:
- Compute mean_expr and sd_expr for each gene across all conditions
- CV = sd_expr / abs(mean_expr)
- Remove genes with all NA or zero variance
- Select top 200 by CV descending
Output:
- Save to results/yeast_stress_cv_top200.tsv
- Columns: gene_id, mean_expr, sd_expr, cv
- Round to 4 decimal places
- Print top 10 to console
AI-generated code:
library(data.table)
dir.create("results", showWarnings = FALSE, recursive = TRUE)
f <- "data/gasch2000.txt"
dt <- fread(f, header = TRUE)
# Extract gene_id and numeric condition columns (skip NAME, description, GWEIGHT)
gene_ids <- dt[[1]]
meta_cols <- c("UID", "NAME", "GWEIGHT")
desc_col <- names(dt)[3] # description column
skip_cols <- c(meta_cols, desc_col)
cond_cols <- setdiff(names(dt), skip_cols)
mat <- as.matrix(dt[, ..cond_cols])
mode(mat) <- "numeric"
mean_expr <- rowMeans(mat, na.rm = TRUE)
sd_expr <- apply(mat, 1, sd, na.rm = TRUE)
keep <- !is.na(sd_expr) & sd_expr > 0
res <- data.table(
gene_id = gene_ids[keep],
mean_expr = mean_expr[keep],
sd_expr = sd_expr[keep]
)
res[, cv := sd_expr / abs(mean_expr)]
res <- res[is.finite(cv)]
setorder(res, -cv)
top200 <- res[1:min(200, .N)]
top200[, `:=`(
mean_expr = round(mean_expr, 4),
sd_expr = round(sd_expr, 4),
cv = round(cv, 4)
)]
fwrite(top200, "results/yeast_stress_cv_top200.tsv", sep = "\t")
print(top200[1:min(10, .N)])
cat("Saved: results/yeast_stress_cv_top200.tsv\n")
Improvements: Explicit CV definition, NA handling, TSV output
Still missing: No heatmap generated, no plot specifications
Stage 3: Adding Detailed Constraints
Prompt:
Using the top 200 genes from Stage 2, generate a heatmap.
Heatmap data:
- Values: log2 expression ratios (already log-transformed in the data)
- Rows: gene_id (maintain CV descending order, cluster_rows = FALSE)
- Columns: original condition order (cluster_cols = FALSE)
Plot specifications:
- Save to: results/yeast_stress_cv_top200_heatmap.png
- Size: 1800 Γ 1200 px, dpi 200
- Rotate x-axis labels 90 degrees
- Title: "Yeast stress response, CV top200 (Gasch et al. 2000)"
- Also print top 10 summary table to console
AI-generated code (final):
library(data.table)
library(pheatmap)
dir.create("results", showWarnings = FALSE, recursive = TRUE)
f <- "data/gasch2000.txt"
dt <- fread(f, header = TRUE)
# Extract gene_id and numeric condition columns
gene_ids <- dt[[1]]
meta_cols <- c("UID", "NAME", "GWEIGHT")
desc_col <- names(dt)[3]
skip_cols <- c(meta_cols, desc_col)
cond_cols <- setdiff(names(dt), skip_cols)
mat <- as.matrix(dt[, ..cond_cols])
mode(mat) <- "numeric"
mean_expr <- rowMeans(mat, na.rm = TRUE)
sd_expr <- apply(mat, 1, sd, na.rm = TRUE)
keep <- !is.na(sd_expr) & sd_expr > 0
res <- data.table(
gene_id = gene_ids[keep],
mean_expr = mean_expr[keep],
sd_expr = sd_expr[keep]
)
res[, cv := sd_expr / abs(mean_expr)]
res <- res[is.finite(cv)]
setorder(res, -cv)
top200 <- res[1:min(200, .N)]
# Summary TSV
top200_out <- copy(top200)
top200_out[, `:=`(
mean_expr = round(mean_expr, 4),
sd_expr = round(sd_expr, 4),
cv = round(cv, 4)
)]
fwrite(top200_out, "results/yeast_stress_cv_top200.tsv", sep = "\t")
# Heatmap matrix
idx <- match(top200$gene_id, gene_ids)
submat <- mat[idx, , drop = FALSE]
rownames(submat) <- top200$gene_id
# Save (1800Γ1200 px, dpi 200)
png("results/yeast_stress_cv_top200_heatmap.png",
width = 1800, height = 1200, res = 200)
pheatmap(
submat,
cluster_rows = FALSE,
cluster_cols = FALSE,
fontsize_col = 4,
fontsize_row = 3,
angle_col = 90,
main = "Yeast stress response, CV top200 (Gasch et al. 2000)"
)
dev.off()
print(top200_out[1:min(10, .N)])
cat("Saved: results/yeast_stress_cv_top200.tsv\n")
cat("Saved: results/yeast_stress_cv_top200_heatmap.png\n")
Example 2: Comparison Summary
| Aspect | Stage 1 (Vague) | Stage 2 (Format) | Stage 3 (Detailed) |
|---|---|---|---|
| Variability metric | Variance | CV (sd/abs(mean)) | CV (sd/abs(mean)) |
| Filtering | None | Remove NA/zero-variance | Remove NA/zero-variance |
| Data transformation | None | None | Already log2 (no extra transform) |
| File output | None | TSV | TSV + PNG (size/dpi specified) |
| Reusability | Low | Medium | High |
Interpretation Points
Key Insights
- Removing NA and zero-variance genes prevents infinite/undefined CV values from dominating results
- CV captures relative variability independent of absolute expression β fair comparison across expression levels
- The Gasch 2000 data is already log2-transformed, so no additional transformation is needed for the heatmap
- Genes with high CV across stress conditions are likely part of the Environmental Stress Response (ESR) β a conserved transcriptional program in yeast
Part 2: Homework Assignments
Homework 1 (Python): Yeast mRNA FASTA Analysis + GC Distribution Graph
Research Question
What is the GC content distribution of yeast mRNA sequences, and are there distinct GC-content subpopulations?
Problem Description
Extract sequence information from the UCSC yeast (Saccharomyces cerevisiae) mRNA FASTA file (mrna.fa.gz), analyze GC content distribution, and produce a summary table and distribution graph.
Objective
Write a prompt detailed enough to produce the desired result in one shot. You may iterate to fix errors, but your goal is a single, well-crafted prompt that works on the first try.
Input Data
# Yeast mRNA FASTA
curl -L -o data/mrna.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/mrna.fa.gz
# Yeast genome FASTA (for reference)
curl -L -o data/sacCer3.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz
FASTA file structure:
>BC001547 /gb=BC001547 /gi=12654078 /ug=Sc.3456 /len=1254
ATGTCTGCTCCAGCTAGCAGTGAAACTTTATTCAGAAACTGCTTAG...
Expected Output
| Output | Specification |
|---|---|
results/mrna_metrics.tsv |
accession, length, gc_content (4 decimals, sorted by gc_content desc) |
results/gc_content_distribution.png |
Histogram (1600Γ900 px, dpi 200) |
Plot specifications:
- X-axis: GC content (0β1)
- Histogram with density curve overlay
- Mean and median as vertical dashed lines
- Caption showing n, mean, median, sd
Grading Criteria
| Criterion | Weight | Description |
|---|---|---|
| Prompt quality | 40% | Input format, accession parsing, gzip handling, output specs (filename, columns, decimals, graph size) |
| Code correctness | 40% | Correct gzip parse, accurate GC computation, both files generated |
| Result interpretation | 20% | Explain 2 possible biological reasons for the GC content distribution pattern in yeast mRNA |
Submission
Push all files to your
bch709_vibe_codingGitHub repo. Include your prompt (as a comment at the top of your script or in a separateprompt.txt), the generated script, and the output files. Write your interpretation as a comment in the script or in theREADME.md.
Homework 2 (R): Z-Score Clustering of CV Top 200 Genes + Pattern Visualization
Research Question
Do the top 200 most variable yeast stress-response genes cluster into distinct expression patterns, and what biological processes characterize each cluster?
Problem Description
Prerequisite
You must complete Example 2 first β this homework uses
results/yeast_stress_cv_top200.tsvgenerated by that example.
Using the 200 genes from results/yeast_stress_cv_top200.tsv (from Example 2):
- Z-score normalize (row-wise)
- Hierarchical clustering (ward.D2 method, euclidean distance)
- Cut tree at k=4 to assign clusters
- Save clustered heatmap as PDF
- Save cluster assignment table as TSV
Objective
Practice writing prompts that precisely control R visualization output.
Input Data
results/yeast_stress_cv_top200.tsvβ gene_id list from Example 2data/gasch2000.txtβ original log2 expression ratios
Expected Output
1. Clustered Heatmap β results/cv_top200_cluster_heatmap.pdf
| Specification | Value |
|---|---|
| Data | Log2 expression ratios β row-wise Z-score |
| Z-score | (value β row_mean) / row_sd |
| Rows | gene_id (hierarchical clustering, ward.D2, euclidean) |
| Columns | First 30 stress condition columns only (original order, cluster_cols = FALSE) |
| Annotation | k=4 cutree as color bar |
| Size | 8 Γ 12 inches |
2. Assignment Table β results/cluster_assignment.tsv
| Column | Description |
|---|---|
| gene_id | Gene identifier (e.g., YAL001C) |
| cluster | 1β4 |
Sort by cluster ascending.
Prompt-Writing Hints
Analysis procedure:
1. From gasch2000.txt, extract only the top 200 genes (by gene_id list from results/yeast_stress_cv_top200.tsv)
2. Skip metadata columns (NAME, description, GWEIGHT), keep only numeric condition columns
3. Subset to the first 30 condition columns only
4. Z-score normalize: for each row (gene), compute (value - mean) / sd
5. Hierarchical clustering: dist(euclidean) β hclust(ward.D2)
6. cutree(k=4) to assign 4 clusters
For the heatmap:
- Use pheatmap
- Show cluster assignment as annotation_row color bar
- Use only the first 30 condition columns (original order, cluster_cols = FALSE)
Grading Criteria
| Criterion | Weight | Description |
|---|---|---|
| Prompt quality | 40% | Z-score definition, clustering method (ward.D2, euclidean), k=4, metadata column handling, output specs |
| Code correctness | 40% | Accurate normalization, clustering, cutree, both files generated |
| Result interpretation | 20% | Describe what each of the 4 clusters represents in terms of stress response (1 sentence per cluster) |
Submission
Push all files to your
bch709_vibe_codingGitHub repo. Include your prompt (as a comment at the top of your script or in a separateprompt.txt), the generated script, and the output files. Write your interpretation as a comment in the script or in theREADME.md.
Appendix: Effective Vibe Coding Prompt Template
[Environment]: Write [language] code that runs in the [env name] conda environment.
[Installed packages] are available.
**Input specification:**
- File: [filename] ([format], [delimiter], [gzip?])
- Structure: [column descriptions, special parsing rules]
- Additional inputs: [chrom.sizes, gene lists, or other reference files]
**Analysis conditions:**
- [Filter criteria (e.g., remove NA/zero-variance rows)]
- [Computation method (e.g., CV = sd/abs(mean))]
- [Definitions (e.g., exon count = unique intervals only)]
**Output 1 β Table:**
- Filename: [filename]
- Columns: [list column names]
- Decimal places: [number]
- Sorting: [criterion, direction]
- Filter: [top N]
**Output 2 β Plot:**
- Filename: [filename]
- Size: [px or inches], dpi: [value]
- Colors: [specify explicitly]
- Axes/labels/legend: [specify explicitly]
**Output 3 β QC:**
- [Dropped items filename]
- [Summary information to print to console]
Input/Output Prompt Checklist
When Specifying Input
- Filename and path
- File format (GFF3, TSV, FASTA, etc.)
- Compression (gzip or not)
- Delimiter (tab, comma, space)
- Header presence
- Data structure (column names, what rows represent)
- Special structures (e.g., GFF3 attribute parsing rules)
- External reference files (chrom.sizes, etc.)
When Specifying Output
- File format (TSV, CSV, PDF, PNG, SVG, HTML)
- Filename
- Column names and order
- Decimal places
- Sorting criterion (ascending/descending)
- Filter conditions (top N, minimum threshold, etc.)
- Plot: size, resolution, colors, font, legend position, axis range
- QC artifacts (dropped items, summary statistics)
When Specifying Analysis Definitions
- Metric definitions (CV = sd/abs(mean), Z-score = (xβmean)/sd)
- Filter rules (remove NA/zero-variance, top N)
- Deduplication handling (unique intervals, etc.)
- Transformation methods (log2(x+1), Z-score normalization)
- Clustering parameters (method, distance metric, k)
Data Sources
- Yeast GFF3 (SGD): http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff.gz
- Yeast chrom.sizes (UCSC sacCer3): https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes
- Yeast mRNA FASTA (UCSC sacCer3): https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/mrna.fa.gz
- Yeast genome FASTA (UCSC sacCer3): https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz
- Yeast stress response (Gasch 2000): https://www.shackett.org/files/gasch2000.txt β Original paper: Gasch et al. (2000) Mol Biol Cell