🤖 BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: awk - Pattern Scanning and Processing

awk - Pattern Scanning and Processing

awk processes structured text data (columns). Essential for bioinformatics files.

Setup: Create Sample Data

cat > genes.txt << 'EOF'
chr1	100	500	geneA	45.2
chr1	600	900	geneB	78.9
chr2	200	400	geneC	23.1
chr2	800	1200	geneD	92.5
chr3	150	350	geneE	15.8
EOF
cat genes.txt
chr1	100	500	geneA	45.2
chr1	600	900	geneB	78.9
chr2	200	400	geneC	23.1
chr2	800	1200	geneD	92.5
chr3	150	350	geneE	15.8

Step 1: Print Columns

# Print first column (chromosome)
awk '{print $1}' genes.txt
chr1
chr1
chr2
chr2
chr3
# Print columns 1 and 4 (chromosome and gene name)
awk '{print $1, $4}' genes.txt
chr1 geneA
chr1 geneB
chr2 geneC
chr2 geneD
chr3 geneE
# Print last column
awk '{print $NF}' genes.txt
45.2
78.9
23.1
92.5
15.8

Step 2: Filter with Conditions

# Print only chr1 genes
awk '$1 == "chr1"' genes.txt
chr1	100	500	geneA	45.2
chr1	600	900	geneB	78.9
# Print genes with score > 50
awk '$5 > 50' genes.txt
chr1	600	900	geneB	78.9
chr2	800	1200	geneD	92.5
# Combine conditions (chr2 AND score > 50)
awk '$1 == "chr2" && $5 > 50' genes.txt
chr2	800	1200	geneD	92.5

Step 3: Calculations

# Sum of scores (column 5)
awk '{sum += $5} END {print "Total:", sum}' genes.txt
Total: 255.5
# Average score
awk '{sum += $5; count++} END {print "Average:", sum/count}' genes.txt
Average: 51.1
# Calculate gene length (end - start)
awk '{print $4, $3 - $2}' genes.txt
geneA 400
geneB 300
geneC 200
geneD 400
geneE 200

awk Built-in Variables

Variable Description Example
$0 Entire line awk '{print $0}'
$1, $2... Column 1, 2, etc. awk '{print $1}'
NF Number of columns awk '{print NF}'
NR Line number awk '{print NR, $0}'
-F Set delimiter awk -F',' '{print $1}'

Challenge: Analyze Gene Data

Using genes.txt, find:

  1. All genes on chr2
  2. The gene with highest score
  3. Total length of all genes

Solutions

# 1. Genes on chr2
awk '$1 == "chr2" {print $4}' genes.txt
geneC
geneD
# 2. Gene with highest score
awk 'NR==1 || $5 > max {max=$5; gene=$4} END {print gene, max}' genes.txt
geneD 92.5
# 3. Total gene length
awk '{len += $3 - $2} END {print "Total length:", len}' genes.txt
Total length: 1500