Compile and Software installation – BCH709 Introduction to Bioinformatics

software_compile

macOS

Install Homebrew

Homebrew is a package manager for macOS. Install it with:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
After installation, update Homebrew:
$ brew update

Install Prerequisite Software

$ brew install openssl readline sqlite3 xz wget

Ubuntu on Windows (WSL)

Install essential build tools and libraries:

$ sudo apt update
$ sudo apt install -y build-essential git curl wget libssl-dev libbz2-dev libreadline-dev libsqlite3-dev llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev zlib1g-dev

Install test package!

On Ubuntu systems:

$ sudo apt install screenfetch
$ screenfetch

On macOS systems:

$ brew install screenfetch
$ screenfetch

Programming Languages

Type	Languages
Compiled	FORTRAN, C, C++, Java, Rust, Go
Interpreted	Unix-Shell, awk, Perl, Ruby, Python, R, JavaScript

Perl

Perl is flexible and has a global repository (CPAN), which makes it easy to install new modules. It also has BioPerl, one of the first biological unit repositories, enhancing usability for tasks such as phylogenetic analysis. Although Perl was widely used, Python has become more popular due to its ease of use, especially for beginners.

R & Python

R is excellent for statistical analysis, but if you prefer coding, Python might suit you more. Python’s rules are easier to follow, making it more beginner-friendly. It’s also easier to develop command-line tools in Python, and there are useful bioinformatics packages available in Python.

Bash

Bash (or shell scripting) is essential for bioinformaticians. It’s a powerful tool for data manipulation (sorting, filtering, etc.) and is often used on institutional clusters. It may seem intimidating at first, but with time, you will find it very efficient for repetitive tasks and system administration.

Python, Perl, R and bash

For wet-lab researchers starting with bioinformatics, R is a good choice to learn first. If you aim for a bioinformatics career, knowing R, Python, and Bash is recommended. For beginners, focusing on either R or Python, while learning Bash, can still be effective.

Other programming languages

C and C++

C and C++ are great for high-performance tools like aligners, but they are harder to learn and take more code to accomplish tasks that can be done more simply in Python.

Ruby

Ruby is popular for web applications but lacks the package support for bioinformatics that Python and R have.

JavaScript or PHP

These languages are better suited for web applications. Bioinformatics should start with Python or R before considering web development languages.

Java

Java has some uses in bioinformatics (e.g., IGV genome browser), but it’s not beginner-friendly, especially when compared to Python or R.

language

Package Library Module

Library

Refers to a collection of related packages or modules. It’s often used to describe the Python Standard Library, which contains many modules that provide additional functionality.

Module

A module is a single file containing Python code. When you import a module, Python executes the code inside that file. Modules make code more reusable and manageable.

Package

A package is a collection of related modules that work together. It usually includes several files and folders organized in a specific structure.

Programming languages module or library manager

Python - pip
Perl - cpan
R - native manager

File Permission

language

Understanding of attribute which can be out put by `ls -l`:

language

Chmod (Change mode)

chmod is the command and system call which is used to change the access permissions of file system objects (files and directories). It is also used to change special mode flags. The request is filtered by the umask. The name is an abbreviation of change mode.

Applying Permission:

language

Using Octal number for Permissions:

language

Check your CPUs and Memory

Understanding your system resources is important for running bioinformatics tools efficiently.

# Check CPU information
$ lscpu

# Check memory usage
$ free -h

# Interactive process viewer (press 'q' to quit)
$ htop

# Check number of CPU cores
$ nproc

Quick System Info

# One-liner to show cores and memory
$ echo "CPUs: $(nproc), Memory: $(free -h | awk '/^Mem:/ {print $2}')"

RC file such as .bashrc, .zshrc

RC files configure the environment and prepare the system to run specific software. These are commonly used in Unix-like systems to automate shell configurations.

Shell Customization

Prompt Customization for Linux/WSL

$ echo 'export PS1="\[\033[38;5;164m\]\u\[\033[0m\]@\[\033[38;5;2m\]\h\[\033[0m\] \[\033[38;5;172m\]\t\[\033[0m\] \[\033[38;5;2m\]\w\[\033[0m\]\n$ "' >> ~/.bashrc
$ echo "alias ls='ls --color=auto'" >> ~/.bashrc
$ source ~/.bashrc

Prompt Customization for macOS (Oh My Zsh)

$ sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
$ source ~/.zshrc

Connecting to HPC Cluster

For information on connecting to Pronghorn HPC cluster, see the HPC Cluster lesson.

Linux family tree

Package Management Concepts

Package management in Linux allows for easier installation and updating of software. It handles dependencies and ensures proper installation across systems. Popular tools include APT (Debian/Ubuntu), YUM (CentOS/Fedora), and Homebrew (macOS).

Without package management, users must ensure that all of the required dependencies for a piece of software are installed and up-to-date, compile the software from the source code (which takes time and introduces compiler-based variations from system to system), and manage configuration for each piece of software. Without package management, application files are located in the standard locations for the system to which the developers are accustomed, regardless of which system they’re using.

Package management systems attempt to solve these problems and are the tools through which developers attempt to increase the overall quality and coherence of a Linux-based operating system. The features that most package management applications provide are:

Package downloading: Operating-system projects provide package repositories which allow users to download their packages from a single, trusted provider. When you download from a package manager, the software can be authenticated and will remain in the repository even if the original source becomes unreliable.

Dependency resolution: Packages contain metadata which provides information about what other files are required by each respective package. This allows applications and their dependencies to be installed with one command, and for programs to rely on common, shared libraries, reducing bulk and allowing the operating system to manage updates to the packages.

A standard binary package format: Packages are uniformly prepared across the system to make installation easier. While some distributions share formats, compatibility issues between similarly formatted packages for different operating systems can occur.

Common installation and configuration locations: Linux distribution developers often have conventions for how applications are configured and the layout of files in the /etc/ and /etc/init.d/ directories; by using packages, distributions are able to enforce a single standard.

Additional system-related configuration and functionality: Occasionally, operating system developers will develop patches and helper scripts for their software which get distributed within the packages. These modifications can have a significant impact on user experience.

Quality control: Operating-system developers use the packaging process to test and ensure that the software is stable and free of bugs that might affect product quality and that the software doesn’t cause the system to become unstable. The subjective judgments and community standards that guide packaging and package management also guide the “feel” and “stability” of a given system. In general, we recommend that you install the versions of software available in your distribution’s repository and packaged for your operating system. If packages for the application or software that you need to install aren’t available, we recommend that you find packages for your operating system, when available, before installing from source code.

The remainder of this guide will cover how to use specific package management systems and how to compile and package software yourself.

Advanced Packaging Tool (APT)

You may already be familiar with apt-get, a command which uses the advanced packaging tool to interact with the operating system’s package system. The most relevant and useful commands are (to be run with root privileges):

‘apt-get install package-name(s)’ - Installs the package(s) specified, along with any dependencies.

‘apt-get remove package-name(s)’ - Removes the package(s) specified, but does not remove dependencies.

‘apt-get autoremove’ - Removes any orphaned dependencies, meaning those that remain installed but are no longer required.

‘apt-get clean’ - Removes downloaded package files (.deb) for software that is already installed.

‘apt-get purge package-name(s)’ - Combines the functions of remove and clean for a specific package, as well as configuration files.

‘apt-get update’ - Reads the /etc/apt/sources.list file and updates the system’s database of packages available for installation. Run this after changing sources.list.

‘apt-get upgrade’ - Upgrades all packages if there are updates available. Run this after running apt-get update. While apt-get provides the most often-used functionality, APT provides additional information in the apt-cache command.

‘apt-cache search package-name(s)’ - If you know the name of a piece of software but apt-get install fails or points to the wrong software, this looks for other possible names.

‘apt-cache show package-name(s)’ - Shows dependency information, version numbers and a basic description of the package.

‘apt-cache depends package-name(s)’ - Lists the packages that the specified packages depends upon in a tree. These are the packages that will be installed with the apt-get install command.

‘apt-cache rdepends package-name(s)’ - Outputs a list of packages that depend upon the specified package. This list can often be rather long, so it is best to pipe its output through a command, like less.

‘apt-cache pkgnames’ - Generates a list of the currently installed packages on your system. This list is often rather long, so it is best to pipe its output through a program, like less, or direct the output to a text file. Combining most of these commands with apt-cache show can provide you with a lot of useful information about your system, the software that you might want to install, and the software that you have already installed.

Aptitude

Aptitude is another front-end interface for APT. In addition to a graphical interface, Aptitude provides a combined command-line interface for most APT functionality. Some notable commands are:

Using dpkg

Apt-get and apt-cache are merely frontend programs that provide a more usable interface and connections to repositories for the underlying package management tools called dpkg and debconf. These tools are quite powerful, and fully explaining their functionality is beyond the scope of this document. However, a basic understanding of how to use these tools is useful. Some important commands are:

‘dpkg -i package-file-name.deb’ - Installs a .deb file.

‘dpkg –list search-pattern’ - Lists packages currently installed on the system.

‘dpkg –configure package-name(s)’ - Runs a configuration interface to set up a package.

‘dpkg-reconfigure package-name(s)’ - Runs a configuration interface on an already installed package

Fedora and CentOS Package Management

Fedora and CentOS are closely related distributions, being upstream and downstream (respectively) from Red Hat Enterprise Linux (RHEL). Their main differences stem from how packages are chosen for inclusion in their repositories.

CentOS uses yum, Yellowdog Updater, Modified, as a front end to interact with system repositories and install dependencies, and also includes a lower-level tool called rpm, which allows you to interact with individual packages.

Starting with version 22, Fedora uses the dnf package manager instead of YUM to interact with rpm. DNF supports many of the same commands as YUM, with some slight changes.

Note: Many operating systems aside from RedHat use rpm packages. These include OpenSuSE, AIX, and Mandriva. While it may be possible to install an RPM packaged for one operating system on another, this is not supported or recommended, and the results of this action can vary greatly.

How about macOS?

Homebrew is package manager for Macs which makes installing lots of different software like Git, Ruby, and Node simpler. Homebrew lets you avoid possible security problems associated with using the sudo command to install software like Node. Homebrew has made extensive use of GitHub to expand the support of several packages through user contributions. In 2010, Homebrew was the third-most-forked repository on GitHub. In 2012, Homebrew had the largest number of new contributors on GitHub. In 2013, Homebrew had both the largest number of contributors and issues closed of any project on GitHub. Homebrew has spawned several sub-projects such as Linuxbrew, a Linux port now officially merged into Homebrew; Homebrew Cask, which builds upon Homebrew and focuses on the installation of GUI applications and “taps” dedicated to specific areas or programming languages like PHP.

macOS Requirements

A 64-bit Intel CPU macOS 10.12 (or higher) Command Line Tools (CLT) for Xcode A Bourne-compatible shell for installation (e.g. bash or zsh)

Homebrew Commands

Update Homebrew:

$ brew update

Uninstall Homebrew:

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/uninstall.sh)"

List installed packages

We can get a list of all the installed packages on a Debian / Ubuntu server by issuing:

$ sudo dpkg --get-selections

Ubuntu server by issuing:

$ apt list --installed

on macOS

$ brew list

On RPM systems:

$ yum list installed

On BSD systems:

$ pkg_version

It is good practice to save this file as it can be useful when migrating, so we pipe it into a file:

$ dpkg --get-selections > ~/package_list
 #yum list installed
 #pkg_version

To search for a specific package run:

dpkg --get-selections | grep <package>
yum list installed "package_name"

Search packages

On Ubuntu systems:

apt search <package-name>

apt search firefox
apt search ^firefox 

On macOS systems:

brew search <package-name>

$ brew search firefox
$ brew search /^firefox/

\^ means regular expressions start of the line.

Install packages

Install single packages:

On Ubuntu systems:

$ sudo apt install <package-name>

On macOS systems:

$ brew install <package-name>

Install multiple packages:

On Ubuntu systems:

$ sudo apt install <package-name> <package-name> ...

On macOS systems:

$ brew install <package-name> <package-name> ...

Install specific version

Search version

On Ubuntu systems:

$ apt-cache policy <package-name>

On macOS systems:

$ brew search <package-name>

Install specific version

On Ubuntu systems:

$ sudo apt install firefox=68.0.1+build1-0ubuntu0.18.04.1

On macOS systems:

$ brew install firefox@68.0.2

Software

Software	Version	Manual	Available for	Description
FastQC	0.11.7	Link	Linux, MacOS, Windows	Quality control tool for high throughput sequence data.
HISAT2	2.1.0	Link	Linux, MacOS, Windows	Mapping RNA sequences against genome
BWA	0.7.17	Link	Linux, MacOS	Mapping DNA sequences against reference genome.

Micromamba

Micromamba is a fast, lightweight package manager that is fully compatible with conda. It helps manage package dependencies and environments, making it easier to install packages and maintain reproducibility.

Why Micromamba?

Fast: Micromamba is written in C++ and is significantly faster than conda

Lightweight: No base environment or Python required

Compatible: Uses the same package repositories as conda (conda-forge, bioconda)

Simple: Single binary with no dependencies

Why Use a Package Manager?

Dependencies: When you install a package like Matplotlib, it automatically installs all dependencies (Numpy, Scipy, etc.) so you don’t have to install them manually.
Environments: You can have multiple isolated environments for different projects. For example, Project A needs Python 2.7 and Biopython 1.60, while Project B needs Python 3.10 and Biopython 1.80. Micromamba lets you switch between them easily.

Install Micromamba

Linux / WSL
$ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
Follow the prompts, then restart your shell or run:
$ source ~/.bashrc

macOS (Intel and Apple Silicon)
$ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
Follow the prompts, then restart your shell or run:
$ source ~/.zshrc

Verify Installation

$ micromamba --version

1.5.6

Create Symbolic Link for Conda Command

To use the familiar conda command with micromamba, create a symbolic link:

$ mkdir -p ~/bin
$ ln -sf $(which micromamba) ~/bin/conda
$ echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc  # or ~/.zshrc for macOS
$ source ~/.bashrc  # or source ~/.zshrc for macOS

Now you can use either micromamba or conda:

$ conda --version

1.5.6

Creating and Using Environments

To create a new environment with Python 3.10 and activate it:

$ conda create -n bch709 python=3.10
$ conda activate bch709

(bch709) $

You will see the environment name (bch709) in your prompt.

Installing Packages

Install packages in your active environment:

$ conda install <package-name>

Environments are stored in ~/micromamba/envs/<environment_name>.

Deactivating and Removing Environments

Deactivate the current environment:

$ conda deactivate

Remove an environment:

$ conda env remove --name bch709

Setting Up Channels for Bioinformatics

Bioconda is a channel dedicated to bioinformatics software. Set up channels in the correct priority order:

$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict

Installing Bioinformatics Packages

Search for a package:

$ conda search hisat2

Install from Bioconda:

$ conda install hisat2

Installing R and R Packages

Install R and popular R packages:

$ conda install -c conda-forge r-base r-essentials

Quick Reference: Common Commands

Command	Description
`conda create -n <env> python=3.10`	Create new environment
`conda activate <env>`	Activate environment
`conda deactivate`	Deactivate current environment
`conda env list`	List all environments
`conda list`	List installed packages
`conda install <package>`	Install a package
`conda update <package>`	Update a package
`conda remove <package>`	Remove a package
`conda env remove -n <env>`	Remove an environment
`conda search <package>`	Search for a package

Environment Management In-Depth

Listing Environments

View all your environments:

$ conda env list

  Name       Active  Path
──────────────────────────────────────────────────────────
  base               /home/user/micromamba
  bch709      *      /home/user/micromamba/envs/bch709
  rnaseq             /home/user/micromamba/envs/rnaseq

The * indicates the currently active environment.

Creating Environments with Specific Packages

Create an environment with multiple packages at once:

# Create environment with Python and packages
$ conda create -n rnaseq python=3.10 hisat2 samtools fastqc

# Create environment with specific versions
$ conda create -n legacy python=2.7 biopython=1.70

Cloning an Environment

Make a copy of an existing environment:

$ conda create --name rnaseq_backup --clone rnaseq

Installing Specific Package Versions

# Install specific version
$ conda install numpy=1.24.0

# Install minimum version
$ conda install "numpy>=1.20"

# Install within version range
$ conda install "numpy>=1.20,<1.25"

Searching for Packages

# Search for package
$ conda search biopython

# Search with channel
$ conda search -c bioconda hisat2

# Show detailed package info
$ conda search biopython --info

biopython 1.81 py310h5eee18b_0
────────────────────────────────
file name   : biopython-1.81-py310h5eee18b_0.conda
channel     : conda-forge
dependencies:
  - numpy >=1.22
  - python >=3.10,<3.11.0a0

Updating Packages

# Update specific package
$ conda update numpy

# Update all packages in environment
$ conda update --all

# Update conda/micromamba itself
$ micromamba self-update

Removing Packages

# Remove a package
$ conda remove numpy

# Remove multiple packages
$ conda remove numpy scipy pandas

Using pip Inside Conda Environments

Sometimes packages are only available via pip. Always install conda packages first, then pip packages.

# Activate your environment first
$ conda activate bch709

# Install pip packages
$ pip install some-package

# Best practice: create environment with pip included
$ conda create -n myenv python=3.10 pip

Warning: Mixing Conda and Pip

Always install as many packages as possible with conda first

Only use pip for packages not available in conda

After using pip, avoid running conda install (can cause conflicts)

If you must mix, reinstall pip packages after conda changes

Environment History and Reverting Changes

View environment change history:

$ conda list --revisions

2024-01-20 10:00:00  (rev 0)
    +python-3.10.0
    +pip-24.0

2024-01-20 10:05:00  (rev 1)
    +numpy-1.24.0
    +scipy-1.11.0

Revert to a previous revision:

$ conda install --revision 0

Exporting and Importing Environments

Export Full Environment (Exact Reproduction)

$ conda env export --name bch709 > bch709_env.yaml

This creates a file like:

name: bch709
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.10.13
  - numpy=1.24.0
  - pandas=2.0.3
  - hisat2=2.2.1

Export Cross-Platform Environment (Recommended)

For sharing with others on different systems:

$ conda env export --name bch709 --no-builds > bch709_env.yaml

Create Environment from File

$ conda env create --file bch709_env.yaml

Update Existing Environment from File

$ conda env update --name bch709 --file bch709_env.yaml

Environment Best Practices

1. One Project = One Environment

# Create separate environments for each project
$ conda create -n project_rnaseq python=3.10 hisat2 samtools
$ conda create -n project_variant python=3.10 bwa gatk4

2. Document Your Environment

Always save your environment specification:

# After installing all packages
$ conda env export --no-builds > environment.yaml

# Add to your project's git repository
$ git add environment.yaml
$ git commit -m "Add conda environment specification"

3. Use Environment Files for Reproducibility

Create environment.yaml manually for your project:

name: my_rnaseq_project
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.10
  - hisat2=2.2.1
  - samtools=1.17
  - fastqc=0.12.1
  - multiqc=1.14
  - pandas
  - matplotlib
  - pip:
    - some-pip-only-package

Then create the environment:

$ conda env create -f environment.yaml

4. Naming Conventions

Use descriptive names:

# Good names
$ conda create -n rnaseq_2024
$ conda create -n chipseq_analysis
$ conda create -n python27_legacy

# Avoid generic names
# Bad: env1, test, myenv

Troubleshooting Common Issues

Environment Activation Not Working

# Initialize shell (run once after installation)
$ micromamba shell init --shell bash --root-prefix ~/micromamba

# Restart your terminal or source the rc file
$ source ~/.bashrc

Solving Package Conflicts

If conda is slow or fails to solve:

# Create minimal environment first
$ conda create -n myenv python=3.10

# Then install packages one by one
$ conda activate myenv
$ conda install numpy
$ conda install pandas

Disk Space Issues

Conda environments can grow large. Clean up unused packages:

# Remove unused packages and cache
$ conda clean --all

# Check environment size
$ du -sh ~/micromamba/envs/*

2.1G    /home/user/micromamba/envs/bch709
1.5G    /home/user/micromamba/envs/rnaseq

Using Environments Without Activation

You can run commands from a specific environment without activating it first. This is useful for scripts, automation, and one-off commands.

Method 1: Using Full Path to Executable

Run programs directly using their full path:

WSL/Linux:

# Run Python from a specific environment
$ ~/micromamba/envs/bch709/bin/python script.py

# Run hisat2 from a specific environment
$ ~/micromamba/envs/bch709/bin/hisat2 --version

# Run any tool
$ ~/micromamba/envs/bch709/bin/fastqc reads.fastq.gz

macOS:

# Run Python from a specific environment
$ ~/micromamba/envs/bch709/bin/python script.py

# Run hisat2 from a specific environment
$ ~/micromamba/envs/bch709/bin/hisat2 --version

Method 2: Using `conda run` (Recommended)

The conda run command executes a command in an environment without activation:

# Basic syntax
$ conda run -n <env_name> <command>

# Examples
$ conda run -n bch709 python --version

Python 3.10.13

$ conda run -n bch709 hisat2 --version

hisat2-align-s version 2.2.1

# Run a Python script
$ conda run -n bch709 python my_analysis.py

# Run with arguments
$ conda run -n bch709 fastqc -o results/ reads.fastq.gz

# Run multiple commands (use quotes)
$ conda run -n bch709 bash -c "hisat2 --version && samtools --version"

Method 3: Using `micromamba run`

If using micromamba directly:

$ micromamba run -n bch709 python script.py
$ micromamba run -n bch709 hisat2 --version

Use Cases for Running Without Activation

1. Shell Scripts:

#!/bin/bash
# No need to activate - just use conda run
conda run -n bch709 fastqc raw_reads/*.fastq.gz
conda run -n bch709 multiqc .

2. Cron Jobs / Scheduled Tasks:

# In crontab - run daily at midnight
0 0 * * * /home/user/micromamba/bin/micromamba run -n bch709 python /home/user/scripts/backup.py

3. One-off Commands:

# Quick check without changing your current environment
$ conda run -n rnaseq samtools --version
$ conda run -n variant bwa

4. Comparing Tool Versions Across Environments:

$ conda run -n env1 python --version
$ conda run -n env2 python --version

Setting PATH Temporarily

You can also prepend the environment’s bin directory to PATH:

# Temporarily use environment's tools (single command)
$ PATH=~/micromamba/envs/bch709/bin:$PATH hisat2 --version

# For a subshell session
$ (export PATH=~/micromamba/envs/bch709/bin:$PATH; hisat2 --version; samtools --version)

Quick Reference: Environment Commands

Command	Description
`conda env list`	List all environments
`conda create -n <name>`	Create environment
`conda create -n <name> --clone <source>`	Clone environment
`conda activate <name>`	Activate environment
`conda deactivate`	Deactivate environment
`conda run -n <name> <cmd>`	Run command without activation
`conda env remove -n <name>`	Remove environment
`conda env export > env.yaml`	Export environment
`conda env create -f env.yaml`	Create from file
`conda env update -f env.yaml`	Update from file
`conda list --revisions`	Show history
`conda install --revision N`	Revert to revision
`conda clean --all`	Clean cache

Using Conda Environments in VS Code

VS Code integrates well with conda/micromamba environments, making it easy to develop and run code in isolated environments.

Step 1: Install VS Code

Windows (WSL)

Download VS Code from https://code.visualstudio.com/

Install on Windows (not inside WSL)

Install the WSL extension in VS Code

Open WSL terminal and type code . to launch VS Code connected to WSL

macOS

Download VS Code from https://code.visualstudio.com/

Move to Applications folder

Open VS Code, press Cmd+Shift+P, type “Shell Command: Install ‘code’ command in PATH”

Now you can use code . from Terminal

Step 2: Install Required Extensions

Open VS Code and install these extensions:

Platform	Open Extensions
WSL/Linux	`Ctrl+Shift+X`
macOS	`Cmd+Shift+X`

Install these extensions:

Python (by Microsoft) - Required for Python development
Pylance (by Microsoft) - Enhanced Python language support
WSL (by Microsoft) - Required for Windows/WSL users

Or install from command line:

$ code --install-extension ms-python.python
$ code --install-extension ms-vscode-remote.remote-wsl  # WSL only

Step 3: Select Python Interpreter (Conda Environment)

Open VS Code in your project folder:
```
$ cd ~/my_project
$ code .
```
Open Command Palette:
- WSL/Linux: Ctrl+Shift+P
- macOS: Cmd+Shift+P
Type “Python: Select Interpreter” and press Enter

You’ll see a list of available environments:

WSL/Linux:

Python 3.10.13 ('bch709')    ~/micromamba/envs/bch709/bin/python
Python 3.10.13 ('rnaseq')    ~/micromamba/envs/rnaseq/bin/python

macOS:

Python 3.10.13 ('bch709')    ~/micromamba/envs/bch709/bin/python
Python 3.10.13 ('rnaseq')    ~/micromamba/envs/rnaseq/bin/python

Select your desired environment (e.g., bch709)
The selected environment appears in the bottom status bar

Can’t Find Your Environment? (WSL/Linux)

If your conda environment doesn’t appear:
# Make sure conda is initialized
$ micromamba shell init --shell bash --root-prefix ~/micromamba
$ source ~/.bashrc

# Verify environment exists
$ conda env list
Then restart VS Code and try again.

Can’t Find Your Environment? (macOS)

If your conda environment doesn’t appear:
# Make sure conda is initialized
$ micromamba shell init --shell zsh --root-prefix ~/micromamba
$ source ~/.zshrc

# Verify environment exists
$ conda env list
Then restart VS Code and try again.

Step 4: Understanding VS Code Settings

VS Code has two types of settings:

Type	Location	Scope
User Settings	`settings.json`	Applies to ALL projects
Workspace Settings	`.vscode/settings.json`	Applies to ONE project only

Workspace settings override User settings for that specific project.

Step 5: Configure User Settings (Global)

User settings apply to all your VS Code projects.

How to open User settings.json:

Platform	Method 1: Command Palette	Method 2: File Location
WSL/Linux	`Ctrl+Shift+P` → “Preferences: Open User Settings (JSON)”	`~/.config/Code/User/settings.json`
macOS	`Cmd+Shift+P` → “Preferences: Open User Settings (JSON)”	`~/Library/Application Support/Code/User/settings.json`

Complete User settings.json for WSL/Linux

{
    // Python and Conda Settings
    "python.condaPath": "/home/YOURUSERNAME/micromamba/bin/micromamba",
    "python.defaultInterpreterPath": "/home/YOURUSERNAME/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,
    "python.terminal.activateEnvInCurrentTerminal": true,

    // Terminal Settings
    "terminal.integrated.env.linux": {
        "PATH": "/home/YOURUSERNAME/micromamba/bin:/home/YOURUSERNAME/micromamba/condabin:${env:PATH}"
    },
    "terminal.integrated.defaultProfile.linux": "bash",

    // Editor Settings (optional but recommended)
    "editor.fontSize": 14,
    "editor.tabSize": 4,
    "editor.insertSpaces": true,
    "files.autoSave": "afterDelay"
}

Important: Replace YOURUSERNAME with your actual username (use whoami command to check).

Complete User settings.json for macOS

{
    // Python and Conda Settings
    "python.condaPath": "/Users/YOURUSERNAME/micromamba/bin/micromamba",
    "python.defaultInterpreterPath": "/Users/YOURUSERNAME/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,
    "python.terminal.activateEnvInCurrentTerminal": true,

    // Terminal Settings
    "terminal.integrated.env.osx": {
        "PATH": "/Users/YOURUSERNAME/micromamba/bin:/Users/YOURUSERNAME/micromamba/condabin:${env:PATH}"
    },
    "terminal.integrated.defaultProfile.osx": "zsh",

    // Editor Settings (optional but recommended)
    "editor.fontSize": 14,
    "editor.tabSize": 4,
    "editor.insertSpaces": true,
    "files.autoSave": "afterDelay"
}

Important: Replace YOURUSERNAME with your actual username (use whoami command to check).

Step 6: Configure Workspace Settings (Project-Specific)

Workspace settings apply only to a specific project. This is useful when different projects need different Python environments.

How to create .vscode/settings.json:

# Navigate to your project folder
$ cd ~/my_project

# Create .vscode directory
$ mkdir -p .vscode

# Create settings.json file
$ nano .vscode/settings.json

Or in VS Code:

Open Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
Type “Preferences: Open Workspace Settings (JSON)”
This creates .vscode/settings.json automatically

Complete .vscode/settings.json for WSL/Linux

{
    // Project-specific Python environment
    "python.defaultInterpreterPath": "/home/YOURUSERNAME/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,

    // Terminal will use this environment
    "terminal.integrated.env.linux": {
        "PATH": "/home/YOURUSERNAME/micromamba/envs/bch709/bin:${env:PATH}",
        "CONDA_DEFAULT_ENV": "bch709",
        "CONDA_PREFIX": "/home/YOURUSERNAME/micromamba/envs/bch709"
    },

    // Python analysis settings
    "python.analysis.extraPaths": [
        "${workspaceFolder}/src",
        "${workspaceFolder}/lib"
    ],

    // File associations (optional)
    "files.associations": {
        "*.fasta": "plaintext",
        "*.fastq": "plaintext",
        "*.fa": "plaintext",
        "*.fq": "plaintext",
        "*.gff": "plaintext",
        "*.gtf": "plaintext",
        "*.bed": "plaintext",
        "*.sam": "plaintext",
        "*.vcf": "plaintext"
    }
}

Complete .vscode/settings.json for macOS

{
    // Project-specific Python environment
    "python.defaultInterpreterPath": "/Users/YOURUSERNAME/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,

    // Terminal will use this environment
    "terminal.integrated.env.osx": {
        "PATH": "/Users/YOURUSERNAME/micromamba/envs/bch709/bin:${env:PATH}",
        "CONDA_DEFAULT_ENV": "bch709",
        "CONDA_PREFIX": "/Users/YOURUSERNAME/micromamba/envs/bch709"
    },
    "terminal.integrated.defaultProfile.osx": "zsh",

    // Python analysis settings
    "python.analysis.extraPaths": [
        "${workspaceFolder}/src",
        "${workspaceFolder}/lib"
    ],

    // File associations (optional)
    "files.associations": {
        "*.fasta": "plaintext",
        "*.fastq": "plaintext",
        "*.fa": "plaintext",
        "*.fq": "plaintext",
        "*.gff": "plaintext",
        "*.gtf": "plaintext",
        "*.bed": "plaintext",
        "*.sam": "plaintext",
        "*.vcf": "plaintext"
    }
}

Settings Reference

Setting	Description
`python.condaPath`	Path to conda/micromamba executable
`python.defaultInterpreterPath`	Default Python interpreter for the project
`python.terminal.activateEnvironment`	Auto-activate environment in terminal
`python.terminal.activateEnvInCurrentTerminal`	Activate in existing terminal
`terminal.integrated.env.linux`	Environment variables for Linux terminal
`terminal.integrated.env.osx`	Environment variables for macOS terminal
`terminal.integrated.defaultProfile.linux`	Default shell (bash)
`terminal.integrated.defaultProfile.osx`	Default shell (zsh)
`python.analysis.extraPaths`	Additional paths for Python imports
`files.associations`	Associate file extensions with languages

Finding Your Username and Paths

# Find your username
$ whoami

john

# Find micromamba path
$ which micromamba

/home/john/micromamba/bin/micromamba

# Find Python path in environment
$ conda activate bch709
$ which python

/home/john/micromamba/envs/bch709/bin/python

Example Project Structure with .vscode

my_rnaseq_project/
├── .vscode/
│   └── settings.json      # Project-specific VS Code settings
├── data/
│   ├── raw/
│   └── processed/
├── scripts/
│   ├── qc.py
│   └── analysis.py
├── results/
├── environment.yaml       # Conda environment file
└── README.md

Quick Setup Script

Create your .vscode/settings.json quickly:

WSL/Linux:

$ cd ~/my_project
$ mkdir -p .vscode
$ USERNAME=$(whoami)
$ cat > .vscode/settings.json << EOF
{
    "python.defaultInterpreterPath": "/home/${USERNAME}/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,
    "terminal.integrated.env.linux": {
        "PATH": "/home/${USERNAME}/micromamba/envs/bch709/bin:\${env:PATH}"
    }
}
EOF

macOS:

$ cd ~/my_project
$ mkdir -p .vscode
$ USERNAME=$(whoami)
$ cat > .vscode/settings.json << EOF
{
    "python.defaultInterpreterPath": "/Users/${USERNAME}/micromamba/envs/bch709/bin/python",
    "python.terminal.activateEnvironment": true,
    "terminal.integrated.env.osx": {
        "PATH": "/Users/${USERNAME}/micromamba/envs/bch709/bin:\${env:PATH}"
    },
    "terminal.integrated.defaultProfile.osx": "zsh"
}
EOF

Now VS Code will use this environment whenever you open this project.

Running Python Scripts

Method	How
Run Button	Click ▶️ in top-right corner of `.py` file
Terminal	Ctrl+` → `python my_script.py`
Run Selection	Select code → `Shift+Enter`

Running Bioinformatics Tools

When your conda environment is active in VS Code terminal:

# Check that tools are available
(bch709) $ which hisat2
(bch709) $ which samtools

# Run tools directly
(bch709) $ fastqc reads.fastq.gz
(bch709) $ hisat2 --version

VS Code Keyboard Shortcuts

Note: On macOS, replace Ctrl with Cmd

Action	Shortcut
Command Palette	`Ctrl+Shift+P`
Open Settings	`Ctrl+,`
Toggle Terminal	Ctrl+`
Run Python File	`F5` or Click ▶️
Run Selection	`Shift+Enter`
Save File	`Ctrl+S`
Find in Files	`Ctrl+Shift+F`
Go to File	`Ctrl+P`

Troubleshooting VS Code + Conda

Issue	Solution
Environment not listed	Restart VS Code, run `source ~/.bashrc` (Linux) or `source ~/.zshrc` (macOS)
Terminal not activating	Add `"python.terminal.activateEnvironment": true` to settings.json
Import errors	Verify package installed: `conda list`
WSL not connecting	Install WSL extension, reopen folder in WSL
Settings not applying	Check for JSON syntax errors in settings.json
Path not found	Use absolute paths, verify with `which python`

References

Micromamba documentation: https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html
Conda-forge: https://conda-forge.org/
BioConda: https://bioconda.github.io/
Conda cheat sheet: https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html
VS Code Python: https://code.visualstudio.com/docs/python/environments

Compiling Software from Source

Sometimes you need to compile software from source code when:

The software isn’t available in package managers
You need a specific version or custom options
You want the latest development version

Prerequisites for Compiling

Make sure you have build tools installed:

Linux/WSL:

$ sudo apt update
$ sudo apt install build-essential git curl wget

macOS:

$ xcode-select --install
$ brew install gcc make

Example 1: Compiling HISAT2 from GitHub

HISAT2 is a fast aligner for RNA-seq data.

# Create a directory for bioinformatics tools
$ mkdir -p ~/bch709/bin
$ cd ~/bch709/bin

# Clone the repository
$ git clone https://github.com/DaehwanKimLab/hisat2.git
$ cd hisat2

# Check the documentation
$ less README.md

# Compile (replace <NUM_CPUS> with number of CPU cores, e.g., 4)
$ make -j 4

After compilation, add to your PATH:

$ echo 'export PATH="$HOME/bch709/bin/hisat2:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

# Test installation
$ hisat2 --version

Example 2: Compiling BWA from Source

BWA is a DNA sequence aligner.

$ cd ~/bch709/bin

# Download source code
$ curl -OL http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2
$ tar xvf bwa-0.7.17.tar.bz2
$ cd bwa-0.7.17

# Compile
$ make

Troubleshooting: Missing zlib

If you see an error about zlib.h:
# Linux/WSL
$ sudo apt install zlib1g-dev

# macOS
$ brew install zlib
Then run make again.

Test the installation:

$ ./bwa

# Add to PATH
$ echo 'export PATH="$HOME/bch709/bin/bwa-0.7.17:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

Basic BWA Usage

# Index a reference genome
$ bwa index reference.fasta

# Align reads to reference
$ bwa mem reference.fasta reads.fastq > aligned.sam

Conda vs. Compiling from Source

Method	Pros	Cons
Conda	Easy, handles dependencies	May not have latest version
Source	Latest version, customizable	More complex, manual dependencies

Recommendation: Use conda when possible. Compile from source only when needed.

Advanced: Understanding Build Systems

Makefile Basics

Most bioinformatics tools use make for compilation. Understanding Makefiles helps troubleshoot build errors.

Basic Makefile structure:

# Target: dependencies
#     commands (must use TAB, not spaces)

CC = gcc
CFLAGS = -O3 -Wall

all: my_program

my_program: main.o utils.o
	$(CC) $(CFLAGS) -o my_program main.o utils.o

main.o: main.c
	$(CC) $(CFLAGS) -c main.c

clean:
	rm -f *.o my_program

Common make commands:

# Compile with default target
$ make

# Compile with multiple CPU cores (faster)
$ make -j $(nproc)

# Clean compiled files
$ make clean

# Install to system (usually requires sudo)
$ sudo make install

# Specify installation directory
$ make install PREFIX=$HOME/local

CMake for Complex Projects

Some modern tools use CMake instead of Makefiles:

# Typical CMake workflow
$ mkdir build
$ cd build
$ cmake ..
$ make -j $(nproc)
$ make install

Example: Compiling samtools

$ cd ~/bch709/bin
$ git clone https://github.com/samtools/samtools.git
$ cd samtools
$ autoheader
$ autoconf -Wno-syntax
$ ./configure --prefix=$HOME/local
$ make -j $(nproc)
$ make install

Common Compilation Errors and Solutions

Error	Cause	Solution
`zlib.h: No such file`	Missing zlib	`sudo apt install zlib1g-dev`
`curses.h: No such file`	Missing ncurses	`sudo apt install libncurses5-dev`
`openssl/ssl.h: No such file`	Missing OpenSSL	`sudo apt install libssl-dev`
`bz2.h: No such file`	Missing bzip2	`sudo apt install libbz2-dev`
`lzma.h: No such file`	Missing LZMA	`sudo apt install liblzma-dev`
`Permission denied`	No write access	Use `PREFIX=$HOME/local`

Setting Up Local Installation Directory

Install software to your home directory (no sudo required):

# Create local directories
$ mkdir -p ~/local/bin ~/local/lib ~/local/include

# Add to PATH permanently
$ echo 'export PATH="$HOME/local/bin:$PATH"' >> ~/.bashrc
$ echo 'export LD_LIBRARY_PATH="$HOME/local/lib:$LD_LIBRARY_PATH"' >> ~/.bashrc
$ source ~/.bashrc

Parallel Compilation

Speed up compilation with multiple CPU cores:

# Check available CPU cores
$ nproc

# Compile using all cores
$ make -j $(nproc)

# Or specify number of cores
$ make -j 4

Pro Tip: Compilation Flags

Many bioinformatics tools can be optimized for your CPU:
# Enable CPU-specific optimizations
$ CFLAGS="-O3 -march=native" make
This can significantly improve performance for computationally intensive tools.

BCH709 Introduction to Bioinformatics: Compile and Software installation

macOS

Install Homebrew

Install Prerequisite Software

Ubuntu on Windows (WSL)

Install test package!

On Ubuntu systems:

On macOS systems:

Programming Languages

Perl

R & Python

Bash

Python, Perl, R and bash

Other programming languages

C and C++

Ruby

JavaScript or PHP

Java

Package Library Module

Library

Module

Package

Programming languages module or library manager

File Permission

Understanding of attribute which can be out put by ls -l:

Chmod (Change mode)

Applying Permission:

Using Octal number for Permissions:

Check your CPUs and Memory

Quick System Info

RC file such as .bashrc, .zshrc

Shell Customization

Prompt Customization for Linux/WSL

Prompt Customization for macOS (Oh My Zsh)

Connecting to HPC Cluster

Linux family tree

Package Management Concepts

Advanced Packaging Tool (APT)

Aptitude

Using dpkg

Fedora and CentOS Package Management

How about macOS?

macOS Requirements

Homebrew Commands

List installed packages

Search packages

Install packages

Install single packages:

Install multiple packages:

Install specific version

Search version

Install specific version

Software

Micromamba

Why Use a Package Manager?

Install Micromamba

Linux / WSL

macOS (Intel and Apple Silicon)

Verify Installation

Create Symbolic Link for Conda Command

Creating and Using Environments

Installing Packages

Deactivating and Removing Environments

Setting Up Channels for Bioinformatics

Installing Bioinformatics Packages

Installing R and R Packages

Quick Reference: Common Commands

Environment Management In-Depth

Listing Environments

Creating Environments with Specific Packages

Cloning an Environment

Installing Specific Package Versions

Searching for Packages

Updating Packages

Removing Packages

Using pip Inside Conda Environments

Warning: Mixing Conda and Pip

Environment History and Reverting Changes

Exporting and Importing Environments

Export Full Environment (Exact Reproduction)

Understanding of attribute which can be out put by `ls -l`:

Method 2: Using `conda run` (Recommended)

Method 3: Using `micromamba run`