Paper Reading
Please read these papers before class:
- Lieberman-Aiden et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289β293
- Rao et al. (2014) A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159:1665β1680
- Dixon et al. (2012) Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485:376β380
Overview: Hi-C and 3D Genome Organization
Hi-C (High-throughput Chromosome Conformation Capture) is a genome-wide method for mapping chromatin contacts β which genomic regions are physically close to each other in the nucleus, even if they are far apart on the linear chromosome.
The Central Dogma of 3D Genomics
Linear DNA is hierarchically folded in the nucleus:
DNA double helix (2 nm)
βββ Nucleosome fiber (10 nm)
βββ Chromatin fiber (30 nm)
βββ Loops (kbβMb scale)
βββ TADs / Compartments (100 kbβMb scale)
βββ Chromosomal territories
Key 3D Genome Features
| Feature | Size | Description |
|---|---|---|
| Loops | 10β300 kb | Direct chromatin contacts, often anchored by CTCF |
| TADs (Topologically Associating Domains) | 100 kbβ2 Mb | Insulated domains of preferential interaction |
| Compartments | 1β10 Mb | A (active/euchromatin) and B (inactive/heterochromatin) |
| Chromosomal territories | Chromosome scale | Distinct nuclear positions per chromosome |
How Hi-C Works
- Crosslink β Fix chromatin contacts with formaldehyde
- Digest β Cut DNA with a restriction enzyme (e.g., HindIII, MboI)
- Ligation β Fill ends with biotinylated nucleotides and ligate nearby fragments
- Shear β Fragment ligated DNA
- Pull down β Enrich biotinylated ligation junctions with streptavidin beads
- Sequence β Paired-end sequencing; each read pair represents a chromatin contact
Hi-C Read Pairs
Each paired-end read represents a chimeric ligation junction: one read from each of two genomic loci that were spatially close in the nucleus.
The Hi-C Analysis Workflow
Raw paired-end FASTQ (Hi-C reads)
β
βββ FastQC β MultiQC (QC)
β
βββ HiC-Pro (full pipeline):
β βββ Trim and align (Bowtie2, used internally by HiC-Pro)
β βββ Parse valid ligation junctions
β βββ Remove duplicates
β βββ Build contact matrices (multiple resolutions)
β βββ ICE normalization
β
βββ Convert to .cool / .hic format
β
βββ Visualization (HiGlass, Juicebox, cooltools)
β
βββ A/B Compartment analysis (PCA on contact matrix)
β
βββ TAD calling (Insulation Score or Arrowhead)
β
βββ Loop calling (HiCCUPS / MUSTACHE)
1. Environment Setup
Create Conda Environment
conda create -n hic python=3.9
conda activate hic
Install Tools
# Core Hi-C processing
# HiC-Pro uses Bowtie2 internally β install both together
conda install -c bioconda -c conda-forge \
hic-pro bowtie2 samtools fastqc trim-galore multiqc
# Analysis and visualization tools
conda install -c bioconda -c conda-forge \
cooler cooltools hicexplorer pyBigWig
# Optional: Juicer tools for .hic format and HiCCUPS loop calling (JAR-based)
wget https://github.com/aidenlab/juicer/releases/latest/download/juicer_tools.jar
Verify Installations
HiC-Pro --version
cooler --help
Create Working Directory
mkdir -p ~/bch709/hic
cd ~/bch709/hic
2. Experimental Design Considerations
Library Preparation Methods
| Method | Enzyme | Fragment Size | Notes |
|---|---|---|---|
| Hi-C (original) | HindIII | ~500 bp | Standard |
| in situ Hi-C | MboI / DpnII | ~100β300 bp | Higher resolution, standard now |
| Micro-C | MNase | Nucleosome-level | Single nucleosome resolution |
| SPRITE | No ligation | β | Multi-way contacts |
| HiChIP | MboI + ChIP | β | Protein-centric contacts |
Sequencing Depth
| Resolution | Recommended Depth | Notes |
|---|---|---|
| 1 Mb compartments | 50β100 M read pairs | Shallow |
| 100 kb TADs | 200β500 M read pairs | Standard |
| 10 kb loops | 1β3 B read pairs | Deep Hi-C |
| Micro-C loops | > 2 B read pairs | Very deep |
Biological Replicates
- Minimum 2 replicates
- Correlation of contact matrices between replicates should be > 0.95 (genome-wide insulation)
3. Quality Control
Download Example Data
cd ~/bch709/hic
# Example: in situ Hi-C of human IMR90 fibroblasts (subsampled)
# (Replace with actual SRA accession for your data)
fasterq-dump --split-files SRR1658693 -O .
mv SRR1658693_1.fastq hic_R1.fastq
mv SRR1658693_2.fastq hic_R2.fastq
gzip hic_R1.fastq hic_R2.fastq
Run FastQC
fastqc -t 4 hic_R1.fastq.gz hic_R2.fastq.gz
multiqc .
Hi-C Specific QC Metrics
| Metric | Good Range | Description |
|---|---|---|
| Valid pairs | > 30β40% of total | Properly ligated read pairs |
| Cis/Trans ratio | Cis > 70β80% | Most contacts should be within chromosomes |
| Short-range vs. long-range | Varies | Long-range (> 20 kb) enrichment indicates good data |
| Duplication rate | < 30% | Higher may indicate low complexity |
4. Hi-C Data Processing with HiC-Pro
HiC-Pro is a comprehensive pipeline for Hi-C data processing, from raw reads to contact matrices.
Prepare Reference Genome
# Download hg38 (or your organism's genome)
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
# Build Bowtie2 index (HiC-Pro uses Bowtie2 internally)
bowtie2-build --threads 8 hg38.fa hg38
# Download chromosome size file (required for contact matrix)
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
Generate Restriction Fragment File
HiC-Pro needs to know where restriction enzyme sites are in the reference:
# For MboI (GATC sites) β in situ Hi-C
digest_genome.py \
-r GATC \
-o hg38_MboI.bed \
hg38.fa
# For HindIII (AAGCTT sites) β original Hi-C
digest_genome.py \
-r AAGCTT \
-o hg38_HindIII.bed \
hg38.fa
Configure HiC-Pro
Create a configuration file hicpro_config.txt:
# Genome
GENOME_SIZE = /path/to/hg38.chrom.sizes
BOWTIE2_IDX_PATH = /path/to/bowtie2_index/
REFERENCE_GENOME = hg38
# Reads
PAIR1_EXT = _R1
PAIR2_EXT = _R2
# Trimming (optional)
MIN_MAPQ = 10
# Restriction enzyme
GENOME_FRAGMENT = /path/to/hg38_MboI.bed
LIGATION_SITE = GATCGATC
MIN_FRAG_SIZE = 100
MAX_FRAG_SIZE = 100000
MIN_INSERT_SIZE = 100
MAX_INSERT_SIZE = 600
# Contact matrix
BIN_SIZE = 1000000 500000 250000 100000 40000
# Output
N_CPU = 8
SORT_RAM = 768000
Run HiC-Pro
# Organize input files
mkdir -p rawdata/sample1
cp hic_R1.fastq.gz hic_R2.fastq.gz rawdata/sample1/
# Run HiC-Pro
HiC-Pro \
-i rawdata \
-o hicpro_output \
-c hicpro_config.txt
HiC-Pro Output
hicpro_output/
βββ logs/ # Log files for each step
βββ stat/ # QC statistics
β βββ sample1/
β βββ sample1_allValidPairs.mergestat
β βββ *.stat
βββ hic_results/
βββ data/sample1/ # Valid pairs file
βββ matrix/sample1/ # Contact matrices
βββ raw/ # Raw counts at each resolution
βββ iced/ # ICE-normalized matrices
Check HiC-Pro Statistics
cat hicpro_output/stat/sample1/*mergestat
# Key metrics to check:
# - Total pairs
# - Valid pairs (ligation junctions)
# - Cis pairs (same chromosome)
# - Trans pairs (different chromosomes)
# - Intra/inter-chromosomal split
5. Convert to .cool Format
.cool is a modern, efficient format for Hi-C contact matrices (HDF5-based).
# Convert HiC-Pro output to .cool
hicpro2higlass.sh \
-i hicpro_output/hic_results/data/sample1/sample1.allValidPairs \
-r 40000 \
-c /path/to/hg38.chrom.sizes \
-p 8 \
-o sample1.cool
# Or convert from HiC-Pro matrix format
cooler load \
--format hicpro \
/path/to/hg38.chrom.sizes:40000 \
hicpro_output/hic_results/matrix/sample1/raw/40000/sample1_40000.matrix \
sample1_40000.cool
Create Multi-Resolution .mcool
cooler zoomify \
--nproc 8 \
--resolutions 1000,5000,10000,25000,50000,100000,250000,500000,1000000 \
sample1_40000.cool \
-o sample1.mcool
6. Visualization
HiGlass (Web-based, interactive)
HiGlass is a web tool for interactive Hi-C visualization. Load your .mcool file at app.higlass.io.
Juicebox
Juicebox is the most commonly used desktop viewer. It requires .hic format.
Convert to .hic format
# Using Juicer tools
java -jar juicer_tools.jar pre \
hicpro_output/hic_results/data/sample1/sample1.allValidPairs \
sample1.hic \
hg38
Quick Plot with cooltools
# Python visualization with cooltools
import cooler
import cooltools
import numpy as np
import matplotlib.pyplot as plt
clr = cooler.Cooler("sample1.mcool::resolutions/100000")
# Plot a region
region = ("chr1", 0, 50_000_000)
mat = clr.matrix(balance=True).fetch(region)
plt.figure(figsize=(8, 8))
plt.imshow(np.log1p(mat), cmap="YlOrRd", vmax=2)
plt.colorbar(label="log1p(balanced contacts)")
plt.title("Hi-C contact map: chr1 0β50 Mb (100 kb resolution)")
plt.savefig("hic_map.png", dpi=150, bbox_inches="tight")
7. A/B Compartment Analysis
A compartments = transcriptionally active, euchromatin B compartments = transcriptionally inactive, heterochromatin
A/B compartments are identified by PCA on the Hi-C contact matrix (or on the correlation matrix of contact patterns).
Compute Compartments with cooltools
# Compute eigenvectors at 100 kb resolution
cooltools eigs-cis \
sample1.mcool::resolutions/100000 \
--phasing-track H3K27ac_chip.bw \
--out-prefix sample1_compartments \
--n-eigs 3
# Output:
# sample1_compartments.cis.vecs.tsv β eigenvectors per bin
# sample1_compartments.cis.lam.txt β eigenvalues
Interpret Compartments
# The sign of the first eigenvector (E1) defines A/B:
# Positive E1 β A compartment (active)
# Negative E1 β B compartment (inactive)
# Sign must be oriented by a mark (H3K27ac, GC content, gene density)
# Summarize compartment genome coverage
awk '$6 > 0 {sum += $3-$2} END {print "A:", sum}' sample1_compartments.cis.vecs.tsv
awk '$6 < 0 {sum += $3-$2} END {print "B:", sum}' sample1_compartments.cis.vecs.tsv
8. TAD Calling
TADs (Topologically Associating Domains) are regions of preferential self-interaction, separated by insulating boundaries.
Method 1: Insulation Score (cooltools)
The insulation score measures local self-interaction; dips in the score correspond to TAD boundaries.
cooltools insulation \
sample1.mcool::resolutions/40000 \
--window-bp 500000 \
--out sample1_insulation.tsv
# Output columns: chr, start, end, ..., insulation_score, is_boundary
Method 2: HiCExplorer TAD Calling
# Find TADs using the insulation score approach
hicFindTADs \
--matrix sample1.mcool::resolutions/40000 \
--outPrefix sample1_TADs \
--correctForMultipleTesting fdr \
--threshold 0.05 \
--minDepth 120000 \
--maxDepth 2000000 \
--step 40000
# Output:
# sample1_TADs_domains.bed β TAD coordinates
# sample1_TADs_boundaries.bed β Boundary coordinates
# sample1_TADs_score.bw β TAD boundary score (bigWig)
Visualize TADs
# Plot contact matrix with TAD boundaries
hicPlotMatrix \
--matrix sample1.mcool::resolutions/40000 \
--outFileName hic_chr1_TADs.png \
--region chr1:10000000-50000000 \
--TAD sample1_TADs_domains.bed \
--colorMap YlOrRd \
--title "Hi-C chr1 with TADs" \
--dpi 300
TAD Statistics
# Count TADs
wc -l sample1_TADs_domains.bed
# TAD size distribution
awk '{print $3-$2}' sample1_TADs_domains.bed | sort -n | \
awk 'BEGIN{sum=0; n=0} {sum+=$1; n++} END{print "Mean TAD size:", sum/n, "bp"}'
9. Loop Calling
Chromatin loops are specific point-to-point contacts, often anchored by CTCF/cohesin.
HiCCUPS (Juicer Tools)
java -Xmx16g -jar juicer_tools.jar hiccups \
--norms KR \
--cpu \
-r 5000,10000,25000 \
sample1.hic \
sample1_loops/
# Output:
# merged_loops.bedpe β Paired genomic coordinates of loops
MUSTACHE (Alternative Loop Caller)
python -m mustache \
-f sample1.mcool \
-r 10000 \
-o sample1_mustache_loops.tsv \
-p 8
Analyze Loops
# Count loops
wc -l sample1_loops/merged_loops.bedpe
# Loop size distribution
awk '{print $5-$2}' sample1_loops/merged_loops.bedpe | sort -n | \
awk 'BEGIN{sum=0; n=0} {sum+=$1; n++} END{print "Mean loop size:", sum/n, "bp"}'
# Loops near CTCF peaks?
bedtools pairtopair \
-a sample1_loops/merged_loops.bedpe \
-b ctcf_peaks.bed \
-type both
10. Differential Hi-C Analysis
Compare 3D genome organization between two conditions (e.g., WT vs. KO, treated vs. untreated).
Using multiHiCcompare (R)
library(multiHiCcompare)
# Load contact matrices
# hic1 <- read_table("condition1.txt")
# hic2 <- read_table("condition2.txt")
# Create HICEXP object
hicexp <- make_hicexp(
data_list = list(cond1_rep1, cond1_rep2, cond2_rep1, cond2_rep2),
groups = c(1, 1, 2, 2)
)
# Normalize
hicexp <- fastlo(hicexp)
# Test for differential interactions
hicexp <- hic_exactTest(hicexp)
# Get significant differential interactions
results <- results(hicexp, alpha=0.05, logFC_cutoff=1)
11. Integrating Hi-C with Other Data
Correlate TADs with Gene Expression (RNA-Seq)
# Find genes within each TAD
bedtools intersect \
-a sample1_TADs_domains.bed \
-b genes.bed \
-wa -wb > genes_in_TADs.txt
Correlate Compartments with ChIP-Seq
import cooltools
import pyBigWig
# Load compartment calls
import pandas as pd
comp = pd.read_table("sample1_compartments.cis.vecs.tsv")
# Load H3K27ac signal
bw = pyBigWig.open("H3K27ac.bw")
comp["H3K27ac"] = comp.apply(
lambda row: bw.stats(row["chrom"], row["start"], row["end"], type="mean")[0],
axis=1
)
# Correlate eigenvector with H3K27ac
from scipy import stats
r, p = stats.pearsonr(comp["E1"].dropna(), comp["H3K27ac"].dropna())
print(f"Pearson r = {r:.3f}, p = {p:.2e}")
Overlap Loops with CTCF ChIP-Seq
# Extract loop anchors
awk '{print $1"\t"$2"\t"$3}' sample1_loops/merged_loops.bedpe > anchors_left.bed
awk '{print $4"\t"$5"\t"$6}' sample1_loops/merged_loops.bedpe > anchors_right.bed
cat anchors_left.bed anchors_right.bed | sort -k1,1 -k2,2n > all_anchors.bed
# Overlap with CTCF peaks
bedtools intersect \
-a all_anchors.bed \
-b ctcf_peaks.narrowPeak \
-u > anchors_with_ctcf.bed
echo "Anchors total: $(wc -l < all_anchors.bed)"
echo "Anchors with CTCF: $(wc -l < anchors_with_ctcf.bed)"
12. Full Workflow Summary
| Step | Tool | Input | Output |
|---|---|---|---|
| QC | FastQC + MultiQC | FASTQ | HTML report |
| Full pipeline | HiC-Pro | FASTQ | Valid pairs, matrices |
| Format conversion | cooler | HiC-Pro output | .cool / .mcool |
| Format conversion | Juicer tools | Valid pairs | .hic |
| Visualization | HiGlass / Juicebox | .mcool / .hic | Interactive contact map |
| Compartments | cooltools eigs-cis | .cool | A/B compartment calls |
| TADs | HiCExplorer / cooltools | .cool | TAD boundaries |
| Loops | HiCCUPS / MUSTACHE | .hic / .cool | Loop anchors |
| Differential | multiHiCcompare | Count matrices | Differential interactions |
| Integration | BEDtools | BED + ChIP/RNA | Overlap tables |
13. Cleanup
conda deactivate
conda env remove --name hic
Glossary
| Term | Definition |
|---|---|
| Contact matrix | Matrix where entry (i,j) = number of read pairs linking bins i and j |
| ICE normalization | Iterative Correction and Eigenvector decomposition β balances matrix biases |
| TAD | Topologically Associating Domain β genomic region of preferential self-interaction |
| Loop | Direct chromatin contact between two genomic loci (point-to-point) |
| Compartment A | Active, gene-rich, open chromatin |
| Compartment B | Inactive, gene-poor, heterochromatin |
| Insulation score | Local measure of cross-boundary interaction β dips mark TAD boundaries |
| CTCF | CCCTC-binding factor β key insulator/loop anchor protein |
| Cohesin | Protein complex that extrudes chromatin loops |
References
| Resource | Link |
|---|---|
| Lieberman-Aiden et al. (2009) Original Hi-C paper | Science |
| Rao et al. (2014) In situ Hi-C / loop calling | Cell |
| Dixon et al. (2012) TADs | Nature |
| HiC-Pro paper | Servant et al. 2015, Genome Biology |
| cooler / cooltools | Abdennur & Mirny 2020, Bioinformatics |
| HiCExplorer | RamΓrez et al. 2018, Nature Communications |
| Juicer / HiCCUPS | Rao et al. 2014, Cell |
| MUSTACHE loop caller | Roayaei Ardakany et al. 2020, Genome Biology |
| HiGlass visualization | Kerpedjiev et al. 2018, Genome Biology |