Bioinformatics File Formats

Understanding common bioinformatics file formats is essential for working with genomic data.

SAM/BAM Format

SAM (Sequence Alignment Map) is a text format for storing sequence alignments. BAM is its binary, compressed version.

SAM Format Structure

A SAM file consists of:

Header section (lines starting with @)

Alignment section (tab-delimited fields)

Header lines:

@HD - Header line

@SQ - Reference sequence dictionary

@RG - Read group

@PG - Program used

Alignment fields (11 mandatory):

Col	Field	Description
1	QNAME	Query name
2	FLAG	Bitwise flag
3	RNAME	Reference name
4	POS	Position
5	MAPQ	Mapping quality
6	CIGAR	CIGAR string
7	RNEXT	Mate reference name
8	PNEXT	Mate position
9	TLEN	Template length
10	SEQ	Sequence
11	QUAL	Quality string

Working with SAM/BAM

# View SAM file
$ less alignment.sam

# Convert SAM to BAM
$ samtools view -bS alignment.sam > alignment.bam

# Sort BAM file
$ samtools sort alignment.bam -o alignment.sorted.bam

# Index BAM file
$ samtools index alignment.sorted.bam

# View BAM file
$ samtools view alignment.bam | head

# View specific region
$ samtools view alignment.sorted.bam chr1:1000-2000

# Get statistics
$ samtools flagstat alignment.bam
$ samtools stats alignment.bam

BED Format

BED (Browser Extensible Data) format is used to define genomic regions.

BED Format Structure

BED files are tab-delimited with at least 3 columns.

Column	Name	Description
1	chrom	Chromosome
2	chromStart	Start position (0-based)
3	chromEnd	End position
4	name	Feature name (optional)
5	score	Score (optional)
6	strand	+ or - (optional)

Example:

chr1    1000    2000    gene1    100    +
chr1    3000    4000    gene2    200    -
chr2    5000    6000    gene3    150    +

Important: BED uses 0-based, half-open coordinates

Working with BED Files

# Sort BED file
$ sort -k1,1 -k2,2n input.bed > sorted.bed

# Merge overlapping intervals
$ bedtools merge -i sorted.bed > merged.bed

# Find intersections
$ bedtools intersect -a file1.bed -b file2.bed > common.bed

# Subtract regions
$ bedtools subtract -a file1.bed -b file2.bed > unique.bed

# Get flanking regions
$ bedtools flank -i genes.bed -g genome.txt -b 1000 > flanks.bed

VCF Format

VCF (Variant Call Format) stores genetic variation data.

VCF Format Structure

VCF files have:

Meta-information lines (starting with ##)

Header line (starting with #CHROM)

Data lines (one per variant)

Fixed columns:

Col	Field	Description
1	CHROM	Chromosome
2	POS	Position (1-based)
3	ID	Variant ID
4	REF	Reference allele
5	ALT	Alternate allele(s)
6	QUAL	Quality score
7	FILTER	Filter status
8	INFO	Additional info
9	FORMAT	Genotype format
10+	SAMPLE	Sample genotypes

Example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
chr1    100     rs123   A       G       30      PASS    DP=50   GT:DP   0/1:50

Working with VCF Files

# View VCF file
$ bcftools view variants.vcf | head

# Compress and index
$ bgzip variants.vcf
$ tabix -p vcf variants.vcf.gz

# Filter variants
$ bcftools filter -i 'QUAL>30' variants.vcf.gz > filtered.vcf

# Extract specific region
$ bcftools view variants.vcf.gz chr1:1000-2000 > region.vcf

# Statistics
$ bcftools stats variants.vcf.gz > stats.txt

# Convert to table
$ bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' variants.vcf.gz

Common Bioinformatics Tools

samtools

samtools is a suite of programs for interacting with SAM/BAM files.

Essential samtools Commands

# Convert SAM to BAM
$ samtools view -bS input.sam > output.bam

# Sort BAM file
$ samtools sort input.bam -o sorted.bam

# Index BAM file (required for many operations)
$ samtools index sorted.bam

# View alignment statistics
$ samtools flagstat sorted.bam

# Calculate depth
$ samtools depth sorted.bam > depth.txt

# Extract reads from region
$ samtools view sorted.bam chr1:1000-2000 > region.sam

# Extract unmapped reads
$ samtools view -f 4 sorted.bam > unmapped.sam

# Extract properly paired reads
$ samtools view -f 2 sorted.bam > proper_pairs.sam

# Merge multiple BAM files
$ samtools merge merged.bam file1.bam file2.bam file3.bam

# Create FASTA index
$ samtools faidx reference.fa

# Extract sequence from FASTA
$ samtools faidx reference.fa chr1:1000-2000

bedtools

bedtools is a powerful suite for genomic arithmetic operations.

Essential bedtools Commands

# Find overlapping features
$ bedtools intersect -a genes.bed -b peaks.bed > overlaps.bed

# Count overlaps
$ bedtools intersect -a genes.bed -b peaks.bed -c > counts.bed

# Find features NOT overlapping
$ bedtools intersect -a genes.bed -b peaks.bed -v > no_overlap.bed

# Merge overlapping intervals
$ bedtools merge -i sorted.bed > merged.bed

# Calculate coverage
$ bedtools coverage -a genes.bed -b reads.bam > coverage.bed

# Get closest feature
$ bedtools closest -a query.bed -b reference.bed > closest.bed

# Generate genome windows
$ bedtools makewindows -g genome.txt -w 1000 > windows.bed

# Get FASTA sequences for BED regions
$ bedtools getfasta -fi genome.fa -bed regions.bed > sequences.fa

# Shuffle features randomly
$ bedtools shuffle -i features.bed -g genome.txt > shuffled.bed

# Compute Jaccard statistic
$ bedtools jaccard -a file1.bed -b file2.bed

Combining Tools in Pipelines

# Example: Find genes with mapped reads and count
$ bedtools intersect -a genes.bed -b aligned.bam -c | \
    awk '$4 > 10' | \
    sort -k4,4rn > highly_expressed.bed

# Example: Get sequences of peaks
$ bedtools sort -i peaks.bed | \
    bedtools merge -i - | \
    bedtools getfasta -fi genome.fa -bed - > peak_sequences.fa

# Example: Calculate mapping statistics per gene
$ samtools view -F 4 aligned.bam | \
    bedtools bamtobed -i stdin | \
    bedtools intersect -a genes.bed -b stdin -c > gene_counts.bed

BCH709 Introduction to Bioinformatics: Bioinformatics File Formats