πŸ€– BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: Git and GitHub

Introduction to Git and GitHub

Git is a distributed version control system that allows you to track changes in your code and collaborate with others. GitHub is a cloud-based platform that hosts Git repositories, making it easier to share and collaborate on projects. Version control using Git is the most reasonable way to keep track of changes in code, manuscripts, presentations, and data analysis projects.

Why Use GitHub in Bioinformatics?

Why Version Control?

Version control is essential in creating any project that takes longer than 5 minutes to complete. Even if your memory is longer than 5 minutes, next month you are not likely to be able to retrace your steps.

github-workflow

github

How Git Works: The Big Picture

Git manages your project across four areas: your Working Directory (where you edit files), the Staging Area (where you prepare changes), the Local Repository (your committed history), and the Remote Repository (GitHub).

Git Workflow Overview

The basic flow is: edit β†’ add β†’ commit β†’ push. Here is this workflow in action:

Git Basic Workflow Animation

1. Create a GitHub Account

Step 1: Sign Up for GitHub

  1. Go to https://github.com
  2. Click Sign up
  3. Enter your email, create a password, and choose a username
  4. Verify your email address
  5. Complete your profile (optional)

GitHub Education (Free Pro Features for Students)

Students can get free access to GitHub Pro features:

  1. Go to https://education.github.com/
  2. Click Get benefits
  3. Verify your student status with your university email
  4. Get free GitHub Pro, Copilot, and other developer tools!

2. Setting Up Git

macOS

$ brew install git
$ git --version
git version 2.43.0

Linux (WSL)

$ sudo apt-get update
$ sudo apt-get install git
$ git --version

3. Configure Git

Set your Git username and email (used for commit history):

$ git config --global user.name "Your Name"
$ git config --global user.email "youremail@example.com"

Verify your configuration:

$ git config --list
user.name=Your Name
user.email=youremail@example.com

Additional Useful Configurations

# Set default branch name to 'main'
$ git config --global init.defaultBranch main

# Enable colored output
$ git config --global color.ui auto

# Set default editor (choose one)
$ git config --global core.editor "nano"    # or "vim" or "code --wait"

4. SSH Key Setup (Required for GitHub)

SSH keys allow secure communication with GitHub without entering your password each time.

What are SSH Keys?

Step 1: Check for Existing SSH Keys

$ ls -la ~/.ssh

If you see id_ed25519 and id_ed25519.pub, you already have keys. Skip to Step 3.

Step 2: Generate a New SSH Key

For Linux/WSL/macOS:

$ ssh-keygen -t ed25519 -C "youremail@example.com"
Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/user/.ssh/id_ed25519):

Press Enter to accept the default location.

Enter passphrase (empty for no passphrase):
Enter same passphrase again:

Enter a passphrase (recommended) or press Enter for no passphrase.

Your identification has been saved in /home/user/.ssh/id_ed25519
Your public key has been saved in /home/user/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:abc123... youremail@example.com

Alternative: RSA Key (for older systems)

If ed25519 is not supported:

$ ssh-keygen -t rsa -b 4096 -C "youremail@example.com"

Step 3: Start SSH Agent and Add Key

# Start the SSH agent
$ eval "$(ssh-agent -s)"
Agent pid 12345
# Add your SSH key to the agent
$ ssh-add ~/.ssh/id_ed25519
Identity added: /home/user/.ssh/id_ed25519 (youremail@example.com)

Step 4: Copy Your Public Key

$ cat ~/.ssh/id_ed25519.pub
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx youremail@example.com

Copy the entire output (starting with ssh-ed25519 and ending with your email).

Copy to Clipboard (Alternative Methods)

Linux/WSL:

$ cat ~/.ssh/id_ed25519.pub | clip.exe    # WSL
$ cat ~/.ssh/id_ed25519.pub | xclip       # Linux with xclip

macOS:

$ pbcopy < ~/.ssh/id_ed25519.pub

Step 5: Add SSH Key to GitHub

  1. Go to GitHub.com and log in
  2. Click your profile picture (top-right) β†’ Settings
  3. In the left sidebar, click SSH and GPG keys
  4. Click New SSH key (green button)
  5. Fill in the form:
    • Title: Give it a descriptive name (e.g., β€œMy Laptop WSL”, β€œLab Computer”)
    • Key type: Authentication Key
    • Key: Paste your public key (from Step 4)
  6. Click Add SSH key
  7. Confirm with your GitHub password if prompted

Step 6: Test Your SSH Connection

$ ssh -T git@github.com
Hi username! You've successfully authenticated, but GitHub does not provide shell access.

Troubleshooting SSH Connection

If you see β€œPermission denied”:

# Check if SSH agent has your key
$ ssh-add -l

# If empty, add your key again
$ ssh-add ~/.ssh/id_ed25519

# Test with verbose output
$ ssh -vT git@github.com

5. Creating a Repository

  1. Go to GitHub.com and log in
  2. Click the + icon (top-right) β†’ New repository
  3. Fill in the repository details:
    • Repository name: e.g., my-rnaseq-project
    • Description: Brief description (optional)
    • Public/Private: Choose visibility
    • Initialize with README: Check this box
    • Add .gitignore: Select a template (e.g., Python, R)
    • Choose a license: MIT is common for open source
  4. Click Create repository

Method 2: Clone the Repository to Your Computer

After creating the repository on GitHub, you clone it to create a local copy. Cloning copies the entire repository – all files and the full commit history – to your machine.

Git Clone

# Using SSH (recommended)
$ git clone git@github.com:yourusername/my-rnaseq-project.git

# Or using HTTPS
$ git clone https://github.com/yourusername/my-rnaseq-project.git

$ cd my-rnaseq-project
$ ls -la
total 8
drwxr-xr-x 3 user user 4096 Jan 20 10:00 .
drwxr-xr-x 5 user user 4096 Jan 20 10:00 ..
drwxr-xr-x 8 user user 4096 Jan 20 10:00 .git
-rw-r--r-- 1 user user   50 Jan 20 10:00 README.md

Method 3: Create Local Repository First (Terminal)

# Create project directory
$ mkdir my-project
$ cd my-project

# Initialize git repository
$ git init
Initialized empty Git repository in /home/user/my-project/.git/
# Create initial files
$ echo "# My Project" > README.md
$ echo "*.log" > .gitignore

# Stage and commit
$ git add .
$ git commit -m "Initial commit"

# Create repository on GitHub, then connect:
$ git remote add origin git@github.com:yourusername/my-project.git
$ git branch -M main
$ git push -u origin main

6. Basic Git Commands

Git Workflow Overview

Git uses a three-step process to save changes. Understanding this flow is the key to using Git effectively.

Working Directory  β†’  Staging Area  β†’  Local Repository  β†’  Remote Repository
      (edit)           (git add)        (git commit)          (git push)

Step 1: git add – Stage Your Changes

The staging area (also called the β€œindex”) is a holding zone where you prepare which changes to include in your next commit. This lets you selectively commit changes rather than committing everything at once.

Git Add - Staging Area

You can stage files individually, allowing you to organize your commits by feature or logical group:

Staging Area Concept

# Add specific file
$ git add filename.txt

# Add all changed files
$ git add .

# Add all files matching pattern
$ git add *.py

Step 2: git commit – Save a Snapshot

A commit is a snapshot of your entire repository at a specific point in time. Each commit is identified by a unique SHA hash and includes a message describing the changes. Commits form a linked list, building your project’s complete history.

Git Commit

$ git commit -m "Add analysis script for RNA-seq data"
[main 1a2b3c4] Add analysis script for RNA-seq data
 1 file changed, 50 insertions(+)

Step 3: git push – Upload to GitHub

Push sends your committed changes from your local repository to the remote repository on GitHub, making them available to collaborators.

Git Push

$ git push origin main

git status – Check Repository Status

$ git status
On branch main
Your branch is up to date with 'origin/main'.
nothing to commit, working tree clean

git pull – Download Updates from GitHub

Pull downloads changes from GitHub and merges them into your local branch. It combines two operations: fetch (download changes) and merge (integrate them).

Git Fetch

Git Pull

$ git pull origin main

git log / git diff – Viewing History

# View commit history
$ git log --oneline
1a2b3c4 Add analysis script for RNA-seq data
5e6f7g8 Initial commit
# View changes in a file
$ git diff filename.txt

# View who changed each line
$ git blame filename.txt

7. Working with Branches

A branch is a separate version of your code within a repository. Branching allows you to work on different features or bug fixes in isolation without affecting the default codebase (often called main). Once satisfied, you can merge changes back into the main branch.

Why Branching Matters

Creating a Branch

When you create a new branch, Git creates a new pointer that starts at the same commit as the branch you branched from. As you make new commits, the branch pointer advances, while the original branch stays where it was.

Git Branch Create

# Create and switch to a new branch
$ git checkout -b feature-branch

Modern Alternative: git switch

In newer versions of Git (2.23+), git switch replaces git checkout for branch operations:

# Create and switch to a new branch
$ git switch -c feature-branch

# Switch to an existing branch
$ git switch main

Both git checkout and git switch work – use whichever you prefer.

Switching Between Branches

Checking out a branch moves the HEAD pointer to point at the branch you are switching to. Your working directory is updated to reflect the state of that branch.

Git Checkout

# Switch to an existing branch
$ git checkout main

Making Commits on a Branch

Each commit on a branch advances that branch’s pointer forward, creating a separate line of development.

Git Branch and Commit Animation

Merging Branches

After completing work on a branch, merge it back into the main branch.

Fast-Forward Merge

When the main branch has not diverged (no new commits since the branch was created), Git can simply move the main pointer forward. This is called a fast-forward merge.

Git Fast-Forward Merge

Git Fast-Forward Merge Animation

$ git checkout main
$ git merge feature-branch

No-Fast-Forward Merge

When both branches have new commits (the history has diverged), Git creates a merge commit that combines the two lines of development.

Git No-Fast-Forward Merge

Git No-Fast-Forward Merge Animation

Handling Merge Conflicts

When the same lines of a file are changed differently in two branches, Git cannot automatically merge. You must resolve the conflict manually by editing the file, then staging and committing the result.

Git Merge Conflict Resolution Animation

Git marks the conflicting sections in the file with special markers:

<<<<<<< HEAD
print("Hello from main branch")
=======
print("Hello from feature branch")
>>>>>>> feature-branch

To resolve the conflict:

  1. Open the file and find the conflict markers (<<<<<<<, =======, >>>>>>>)
  2. Choose which version to keep (or combine both)
  3. Remove all conflict markers
  4. Stage and commit:
$ git add resolved-file.txt
$ git commit -m "Resolve merge conflict"

Deleting Branches

After merging, you can delete the branch:

$ git branch -d feature-branch

Pushing a Branch to GitHub

$ git push -u origin feature-branch

8. Using Git with VS Code

VS Code has excellent built-in Git support with a graphical interface.

Setup VS Code for Git

  1. Install VS Code: https://code.visualstudio.com/
  2. Open your project folder: File β†’ Open Folder
  3. Install Git Graph extension (optional but recommended):
    • Press Ctrl+Shift+X (Extensions)
    • Search β€œGit Graph”
    • Install

VS Code Git Interface

The Source Control panel (Ctrl+Shift+G) shows:

Common Git Operations in VS Code

Clone a Repository

  1. Press Ctrl+Shift+P β†’ Type β€œGit: Clone”
  2. Paste the repository URL
  3. Choose a folder location
  4. Open the cloned repository

Stage Changes

Commit Changes

  1. Enter a commit message in the text box
  2. Click the checkmark (or Ctrl+Enter)

Push/Pull

View Diff

Create/Switch Branches

VS Code Git Keyboard Shortcuts

Shortcut Action
Ctrl+Shift+G Open Source Control panel
Ctrl+Enter Commit (when in commit message box)
Ctrl+Shift+P β†’ β€œGit” Access all Git commands
  • GitLens: Advanced Git features, blame annotations
  • Git Graph: Visual commit history
  • GitHub Pull Requests: Manage PRs directly in VS Code

9. Complete Git Workflow Example

Step-by-Step Terminal Workflow

# 1. Clone repository
$ git clone git@github.com:yourusername/my-project.git
$ cd my-project

# 2. Create a new branch for your work
$ git checkout -b feature/add-qc-script

# 3. Make changes (create/edit files)
$ nano qc_script.py

# 4. Check what changed
$ git status
On branch feature/add-qc-script
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        qc_script.py
# 5. Stage your changes
$ git add qc_script.py

# 6. Commit with a descriptive message
$ git commit -m "Add quality control script for FASTQ files"
[feature/add-qc-script 1a2b3c4] Add quality control script for FASTQ files
 1 file changed, 50 insertions(+)
 create mode 100644 qc_script.py
# 7. Push to GitHub
$ git push -u origin feature/add-qc-script
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 1.20 KiB | 1.20 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To github.com:yourusername/my-project.git
 * [new branch]      feature/add-qc-script -> feature/add-qc-script
Branch 'feature/add-qc-script' set up to track remote branch 'feature/add-qc-script' from 'origin'.
# 8. Create Pull Request on GitHub website
# 9. After merge, update local main branch
$ git checkout main
$ git pull origin main

# 10. Delete the feature branch (optional)
$ git branch -d feature/add-qc-script

Daily Git Workflow Summary

# Start of day: Get latest changes
$ git pull origin main

# Create branch for new work
$ git checkout -b feature/my-feature

# Work on your code...
# Then stage and commit frequently
$ git add .
$ git commit -m "Descriptive message"

# Push when ready for review
$ git push -u origin feature/my-feature

# Create Pull Request on GitHub
# After merge, clean up
$ git checkout main
$ git pull origin main
$ git branch -d feature/my-feature

10. Collaborating with Others

GitHub facilitates collaboration through features like pull requests, issues, and code reviews.

Understanding Remote and Local Repositories

When working with others, understanding the relationship between remote and local repositories is essential. Your local repository has remote tracking branches that monitor the state of the remote repository.

Remote and Local Repositories

Forking a Repository

If you want to contribute to someone else’s project, you can create a fork:

Clone your forked repository to work on it locally:

$ git clone git@github.com:your-username/repository-name.git

Creating a Pull Request

A pull request (PR) is a way to propose changes you’ve made on a branch to be merged back into the main branch. Other collaborators can review, discuss, and approve the changes before merging.

Pull Request Review Process

After making changes in your forked repository:

Managing Issues

Issues are used to track tasks, enhancements, and bugs.

Code Reviews

Collaborators can review pull requests, suggest changes, and approve merges.

GitHub Flow

GitHub Flow is a simplified branching model ideal for most projects. It uses a single main branch with short-lived feature branches.

GitHub Flow

  1. Create a branch from main
  2. Make changes and commit
  3. Open a Pull Request
  4. Review and discuss the changes
  5. Merge back into main

11. Best Practices for Bioinformatics Projects

Organize Your Repository

Structure your repository to make it easy to navigate. A typical bioinformatics repo might include:

Write Clear Commit Messages

Use descriptive commit messages to explain what changes were made and why.

Good Example:

Add script for RNA-seq data normalization

Bad Example:

Update stuff

Use .gitignore Files

Exclude unnecessary files (e.g., large datasets, temporary files) from your repository by creating a .gitignore file.

Example .gitignore:

# Ignore data files
/data/raw/
/data/processed/

# Ignore temporary files
*.tmp
*.log

Document Your Work

Maintain clear documentation to help others understand and reproduce your analyses.

Version Control for Data

While Git handles code effectively, managing large datasets can be challenging. Consider using Git LFS (Large File Storage) for large files.

Installing Git LFS:

$ git lfs install

Tracking Large Files:

$ git lfs track "*.csv"
$ git lfs track "*.bam"
$ git lfs track "*.fastq.gz"

12. Advanced Git Commands

Git Stash - Save Work Temporarily

When you need to switch branches but have uncommitted changes, git stash temporarily saves your work without committing it.

Git Stash

# Save current changes temporarily
$ git stash

# List stashed changes
$ git stash list
stash@{0}: WIP on main: 1a2b3c4 Add analysis script
# Apply stashed changes
$ git stash pop

# Apply specific stash
$ git stash apply stash@{0}

Git Rebase - Clean History

Rebase replays your commits on top of another branch’s latest commit, creating a cleaner, linear project history.

Git Rebase

# Rebase your branch onto main
$ git checkout feature-branch
$ git rebase main

Warning: Rebase vs Merge

  • Merge preserves history as it happened (safe for shared branches)
  • Rebase creates linear history (use only for local/unshared branches)

Never rebase commits that have been pushed to a shared repository!

Git Cherry-pick - Select Specific Commits

Apply a specific commit from another branch:

# Get the commit hash from git log
$ git log --oneline other-branch
a1b2c3d Fix critical bug in alignment script
# Apply that commit to current branch
$ git cherry-pick a1b2c3d

Undoing Changes

Git provides several ways to undo changes depending on how far along you are in the workflow.

Git Reset

# Undo changes in working directory (before staging)
$ git checkout -- filename.txt

# Unstage a file (keep changes)
$ git reset HEAD filename.txt

# Undo last commit (keep changes)
$ git reset --soft HEAD~1

# Undo last commit (discard changes) - DANGEROUS!
$ git reset --hard HEAD~1

To safely undo a commit that has already been pushed, use git revert which creates a new commit that undoes the changes:

Git Revert

# Create a new commit that undoes a previous commit
$ git revert <commit-hash>

Git Tags - Mark Important Points

Tags are useful for marking releases:

# Create a tag
$ git tag -a v1.0 -m "Version 1.0 - Initial release"

# List tags
$ git tag

# Push tags to remote
$ git push origin --tags

13. Git Workflows for Bioinformatics

Feature Branch Workflow

main ─────●─────●─────●─────●─────●
           \         /
feature     ●───●───●
# Create feature branch
$ git checkout -b feature/add-qc-script

# Make changes and commit
$ git add qc_script.py
$ git commit -m "Add quality control script for FASTQ files"

# Push to GitHub
$ git push -u origin feature/add-qc-script

# Create Pull Request on GitHub, then merge
# After merge, clean up
$ git checkout main
$ git pull origin main
$ git branch -d feature/add-qc-script

Bioinformatics Project Structure

my_rnaseq_project/
β”œβ”€β”€ README.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ environment.yml          # Conda environment
β”œβ”€β”€ config/
β”‚   └── config.yaml          # Pipeline configuration
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                  # Git LFS or .gitignore
β”‚   └── processed/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 01_qc.sh
β”‚   β”œβ”€β”€ 02_align.sh
β”‚   └── 03_count.sh
β”œβ”€β”€ notebooks/
β”‚   └── analysis.ipynb
└── results/
    β”œβ”€β”€ figures/
    └── tables/

Example .gitignore for Bioinformatics

# Large data files
*.fastq
*.fastq.gz
*.bam
*.sam
*.vcf
*.bed

# Results (regeneratable)
results/

# Temporary files
*.tmp
*.log
*.swp
*~

# OS files
.DS_Store
Thumbs.db

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# R
.Rhistory
.RData

14. Quick Reference

Command Description
git init Initialize repository
git clone <url> Clone repository
git status Check status
git add <file> Stage changes
git commit -m "msg" Commit changes
git push origin <branch> Push to remote
git pull origin <branch> Pull from remote
git branch <name> Create branch
git checkout <branch> Switch branch
git checkout -b <name> Create & switch branch
git switch -c <name> Create & switch branch (modern)
git merge <branch> Merge branch
git log --oneline View history
git diff View changes
git stash Stash changes
git revert <hash> Undo a commit safely
git reset --hard HEAD~1 Undo last commit

15. Example: Cloning a Bioinformatics Tool from GitHub

Many bioinformatics tools are available on GitHub:

$ git clone https://github.com/DaehwanKimLab/hisat2.git
$ cd hisat2
$ ls -la
$ less README.md

For compilation instructions, see the Compile and Software Installation lesson.