Tutorial
seqkit
Create and Installation
conda create -n seqkit -c bioconda seqkit bedops csvtk
Environment activate
conda activate seqkit
Export environment
conda env export > seqkit.yml
Conda deactivate
conda deactivate
Conda environment remove
conda remove --name seqkit --all
Reinstall environment
conda env create -f seqkit.yml
Environment activate
conda activate seqkit
use seqkit
seqkit --help
make homework folder
mkdir /data/gpfs/assoc/bch709-4/${USER}
rm -rf ~/bch709
ln -s /data/gpfs/assoc/bch709-4/${USER} ~/bch709
cd ~/bch709
Usage
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.0.0
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962
Available Commands:
amplicon extract amplicon (or specific region around it) via primer(s)
bam monitoring and online histograms of BAM record features
common find common sequences of multiple files by id/name/sequence
completion generate the autocompletion script for the specified shell
concat concatenate sequences with same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
faidx create FASTA index file and extract subsequence
fish look for short sequences in larger sequences using local alignment
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...)
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
help Help about any command
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion
sana sanitize broken single line FASTQ files
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (extract ID, filter by length, remove gaps...)
shuffle shuffle sequences
sliding extract subsequences in sliding windows
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
version print version information and check for update
watch monitoring and online histograms of sequence features
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Current folder
cd ~/bch709
Datasets
Test fastq file
wget https://raw.githubusercontent.com/plantgenomicslab/BCH709/gh-pages/data/test.fastq.gz
ll -tr
Datasets from The miRBase Sequence Database – Release 21
hairpin.fa
wget --no-check-certificate https://mirbase.org/download/hairpin.fa ll -tr
mature.fa.gz
wget --no-check-certificate https://mirbase.org/download/mature.fa ll -tr
miRNA.diff
wget --no-check-certificate https://mirbase.org/download/miRNA.diff ll -tr
Human genome from NCBI (For
seqkit subseq
)
https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/
GRCh38_latest_genomic.fna.gz
wget --no-check-certificate https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz ls -algh
GRCh38_latest_genomic.gtf.gz
wget --no-check-certificate https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz ls -algh
GRCh38_latest_genomic.bed.gz
create with commandzcat GRCh38_latest_genomic.gtf.gz | gtf2bed --do-not-sort | gzip -c > GRCh38_latest_genomic.bed.gz
Only DNA and gtf/bed data of Chr1 were used:
chr1.fa.gz
seqkit grep -p NC_000001.11 GRCh38_latest_genomic.fna.gz -o chr1.fa.gz ls -algh
chr1.gtf.gz
zcat GRCh38_latest_genomic.gtf.gz | grep "NC_000001.11" | gzip -c > chr1.gtf.gz ls -algh
chr1.bed.gz
zcat GRCh38_latest_genomic.bed.gz| grep "NC_000001.11" | gzip -c > chr1.bed.gz ls -algh
seq
Usage
transform sequences (revserse, complement, extract ID...)
Usage:
seqkit seq [flags]
Flags:
-p, --complement complement sequence (blank for Protein sequence)
--dna2rna DNA to RNA
-G, --gap-letter string gap letters (default "- ")
-l, --lower-case print sequences in lower case
-n, --name only print names
-i, --only-id print ID instead of full head
-q, --qual only print qualities
-g, --remove-gaps remove gaps
-r, --reverse reverse sequence)
--rna2dna RNA to DNA
-s, --seq only print sequences
-u, --upper-case print sequences in upper case
-v, --validate-seq validate bases according to the alphabet
-V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)
Examples
- Read and print
- From file:
cat hairpin.fa seqkit seq hairpin.fa
zcat test.fastq.gz
seqkit seq test.fastq.gz
- From stdin:
zcat hairpin.fa.gz | seqkit seq
- Sequence types
- By default,
seqkit seq
automatically detect the sequence typeecho -e ">seq\nacgtryswkmbdhvACGTRYSWKMBDHV" | seqkit stat
echo -e ">seq\nACGUN ACGUN" | seqkit stat
echo -e ">seq\nabcdefghijklmnpqrstvwyz" | seqkit stat
- Only print names
- Full name:
seqkit seq hairpin.fa -n
- Only ID:
seqkit seq hairpin.fa -n -i
- Custom ID region by regular expression (this could be applied to all subcommands):
seqkit seq hairpin.fa -n -i --id-regexp "^[^\s]+\s([^\s]+)\s"
Please check regular expression in Regex
- Only print seq (global flag
-w
defines the output line width, 0 for no wrap)
seqkit seq hairpin.fa -s -w 0
- Reverse comlement sequence
seqkit seq hairpin.fa.gz -r -p
- Remove gaps and to lower/upper case
echo -e ">seq\nACGT-ACTGC-ACC" | seqkit seq -g -u
- RNA to DNA
echo -e ">seq\nUCAUAUGCUUGUCUCAAAGAUUA" | seqkit seq --rna2dna
subseq
Usage
get subsequences by region/gtf/bed, including flanking sequences.
Recommendation: use plain FASTA file, so seqkit could utilize FASTA index.
The definition of region is 1-based and with some custom design.
Examples:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n
1:1 A
2:4 C G T
-4:-2 c g t
-4:-1 c g t n
-1:-1 n
2:-2 C G T N a c g t
1:-1 A C G T N a c g t n
Usage:
seqkit subseq [flags]
Flags:
--bed string by BED file
--chr value select limited sequence with sequence IDs (multiple value supported, case ignored) (default [])
-d, --down-stream int down stream length
--feature value select limited feature types (multiple value supported, case ignored, only works with GTF) (default [])
--gtf string by GTF (version 2.2) file
-f, --only-flank only return up/down stream sequence
-r, --region string by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
-u, --up-stream int up stream length
Examples
Recommendation: use plain FASTA file, so seqkit could utilize FASTA index.
- First 12 bases
cat hairpin.fa | seqkit subseq -r 1:12
- Last 12 bases
cat hairpin.fa | seqkit subseq -r -12:-1
- Subsequences without first and last 12 bases
cat hairpin.fa | seqkit subseq -r 13:-13
- Get subsequence by GTF file
Human genome example:
AVOID loading all data from GRCh38_latest_genomic.gtf.gz, the uncompressed data are so big and may exhaust your RAM.
We could specify chromesomes and features.
seqkit subseq --gtf GRCh38_latest_genomic.gtf.gz --chr NC_000001.11 --feature CDS GRCh38_latest_genomic.fna.gz > chr1.gtf.cds.fa
seqkit stat chr1.gtf.cds.fa
- Get subsequences by BED file.
seqkit subseq --bed GRCh38_latest_genomic.bed.gz --chr NC_000001.11 GRCh38_latest_genomic.fna.gz > chr1.bed.gz.fa
We may need to remove duplicated sequences
seqkit subseq --bed GRCh38_latest_genomic.bed.gz --chr NC_000001.11 GRCh38_latest_genomic.fna.gz | seqkit rmdup > chr1.bed.rmdup.fa
seqkit stat chr1.fa.gz
sliding
- sliding sequences, circular genome supported
Usage:
seqkit sliding [flags]
Flags:
-C, --circular-genome circular genome
-s, --step int step size
-W, --window int window size
Examples
- General use
echo -e ">seq\nACGTacgtNN" | seqkit sliding -s 3 -W 6
- Circular genome
echo -e ">seq\nACGTacgtNN" | seqkit sliding -s 3 -W 6 -C
- Generate GC content for ploting
cat hairpin.fa | seqkit fx2tab | head -n 1 | seqkit tab2fx | seqkit sliding -s 5 -W 30 | seqkit fx2tab -n -g
stat
Usage simple statistics of FASTA files
Usage:
seqkit stat [flags]
Examples
- General use
seqkit stat *.f*{a,q}.gz
fq2fa
covert FASTQ to FASTA
Usage: seqkit fq2fa [flags]
seqkit fq2fa test.fastq.gz -o test_.fa.gz
zcat test_.fa.gz
zcat test.fastq.gz
fx2tab & tab2fx
Usage (fx2tab)
covert FASTA/Q to tabular format, and provide various information,
like sequence length, GC content/GC skew.
Usage:
seqkit fx2tab [flags]
Flags:
-B, --base-content value print base content. (case ignored, multiple values supported) e.g. -B AT -B N (default [])
-g, --gc print GC content
-G, --gc-skew print GC-Skew
-H, --header-line print header line
-l, --length print sequence length
-n, --name only print names (no sequences and qualities)
-i, --only-id print ID instead of full head
Usage (tab2fx)
covert tabular format (first two/three columns) to FASTA/Q format
Usage:
seqkit tab2fx [flags]
Flags:
-p, --comment-line-prefix value comment line prefix (default [#,//])
Examples
- Default output
seqkit fx2tab hairpin.fa| head -n 2
- Print sequence length, GC content, and only print names (no sequences),
we could also print title line by flag
-T
.
seqkit fx2tab hairpin.fa -l -g -n -i -H | head -n 4 | csvtk -t -C '&' pretty
- Use fx2tab and tab2fx in pipe
cat hairpin.fa | seqkit fx2tab | seqkit tab2fx zcat test.fastq.gz | seqkit fx2tab | seqkit tab2fx
- Sort sequences by length (use
seqkit sort -l
)cat hairpin.fa | seqkit fx2tab -l | sort -t"`echo -e '\t'`" -n -k4,4 | seqkit tab2fx
seqkit sort -l hairpin.fa
-
Sorting or filtering by GC (or other base by -flag
-B
) content could also achieved in similar way. -
Get first 1000 sequences
seqkit fx2tab hairpin.fa | head -n 1000 | seqkit tab2fx
seqkit fx2tab test.fastq.gz | head -n 1000 | seqkit tab2fx
Extension
After converting FASTA to tabular format with seqkit fx2tab
,
it could be handled with CSV/TSV tools,
e.g. csvtk, a cross-platform, efficient and practical CSV/TSV toolkit
csvtk grep
could be used to filter sequences (similar withseqkit grep
)csvtk inter
computates intersection of multiple files. It could achieve similar function asseqkit common -n
along with shell.csvtk join
joins multiple CSV/TSV files by multiple IDs.- csv_melt provides melt function, could be used in preparation of data for ploting.
grep
Usage
search sequences by pattern(s) of name or sequence motifs
Usage:
seqkit grep [flags]
Flags:
-n, --by-name match by full name instead of just id
-s, --by-seq match by seq
-d, --degenerate pattern/motif contains degenerate base
--delete-matched delete matched pattern to speedup
-i, --ignore-case ignore case
-v, --invert-match invert the sense of matching, to select non-matching records
-p, --pattern value search pattern (multiple values supported) (default [])
-f, --pattern-file string pattern file
-r, --use-regexp patterns are regular expression
Examples
- Extract human hairpins (i.e. sequences with name starting with
hsa
)cat hairpin.fa | seqkit grep -r -p ^hsa
- Remove human and mice hairpins.
cat hairpin.fa| seqkit grep -r -p ^hsa -p ^mmu -v
-
Extract new entries by information from miRNA.diff
- Get IDs of new entries.
cat miRNA.diff | grep ^# -v | grep NEW | cut -f 2 > list more list ##q to out
- Get IDs of new entries.
- Extract by ID list file
cat hairpin.fa | seqkit grep -f list > new.fa
- Extract sequences starting with AGGCG
cat hairpin.fa | seqkit grep -s -r -i -p ^aggcg
- Extract sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.
cat hairpin.fa | seqkit grep -s -d -i -p UUSAA
It’s equal to but simpler than:
cat hairpin.fa| seqkit grep -s -r -i -p UU[CG]AA
locate
Usage: locate subsequences/motifs
Motifs could be EITHER plain sequence containing "ACTGN" OR regular
expression like "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" for ORFs.
Degenerate bases like "RYMM.." are also supported by flag -d.
By default, motifs are treated as regular expression.
When flag -d given, regular expression may be wrong.
For example: "\w" will be wrongly converted to "\[AT]".
Usage:
seqkit locate [flags]
Flags:
-d, --degenerate pattern/motif contains degenerate base
-i, --ignore-case ignore case
-P, --only-positive-strand only search at positive strand
-p, --pattern value search pattern/motif (multiple values supported) (default [])
-f, --pattern-file string pattern/motif file (FASTA format)
-V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)
Examples
- Locate ORFs.
cat hairpin.fa | seqkit locate -i -r -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)"
- Locate Motif.
cat hairpin.fa | seqkit locate -i -d -p AUGGACUN
Notice that seqkit grep
only searches in positive strand, but seqkit loate
could recognize both strand
rmdup
Usage
remove duplicated sequences by id/name/sequence
Usage:
seqkit rmdup [flags]
Flags:
-n, --by-name by full name instead of just id
-s, --by-seq by seq
-D, --dup-num-file string file to save number and list of duplicated seqs
-d, --dup-seqs-file string file to save duplicated seqs
-i, --ignore-case ignore case
-m, --md5 use MD5 instead of original seqs to reduce memory usage when comparing by seqs
Examples
Similar to common
.
- General use
cat hairpin.fa | seqkit rmdup -s -o clean.fa.gz zcat test.fastq.gz | seqkit rmdup -s -o clean.fa.gz
- Save duplicated sequences to file
cat hairpin.fa | seqkit rmdup -s -i -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt cat duplicated.detail.txt
common
Usage
find common sequences of multiple files by id/name/sequence
Usage:
seqkit common [flags]
Flags:
-n, --by-name match by full name instead of just id
-s, --by-seq match by sequence
-i, --ignore-case ignore case
-m, --md5 use MD5 instead of original seqs to reduce memory usage when comparing by seqs
Examples
- Create file
echo -e ">1234\nATGGTATTAGATA" >> file1.fa
echo -e ">2234\nATGGTATTTGATA" >> file1.fa
echo -e ">3234_A\nATGGTATTCGATA" >> file1.fa
echo -e ">1234_B\nATGGTATTCGATA" >> file2.fa
echo -e ">4234\nATGGTATTAGATA" >> file2.fa
echo -e ">2234\nATGGTATTGGGATA" >> file2.fa
cat file1.fa
cat file2.fa
- By ID (default)
seqkit common file*.fa -o common.fasta
- By full name
seqkit common file*.fa -n -o common.fasta
- By sequence
seqkit common file*.fa -s -i -o common.fasta
split
Usage
split sequences into files by name ID, subsequence of given region,
part size or number of parts.
The definition of region is 1-based and with some custom design.
Examples:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n
1:1 A
2:4 C G T
-4:-2 c g t
-4:-1 c g t n
-1:-1 n
2:-2 C G T N a c g t
1:-1 A C G T N a c g t n
Usage:
seqkit split [flags]
Flags:
Flags:
-i, --by-id split squences according to sequence ID
-p, --by-part int split squences into N parts
-r, --by-region string split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
-s, --by-size int split squences into multi parts with N sequences
-d, --dry-run dry run, just print message and no files will be created.
-f, --force overwrite output directory
-k, --keep-temp keep tempory FASTA and .fai file when using 2-pass mode
-m, --md5 use MD5 instead of region sequence in output file when using flag -r (--by-region)
-O, --out-dir string output directory (default value is infile.split)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
Examples
- Split sequences into parts with at most 10000 sequences
seqkit split hairpin.fa -s 10000
- Split sequences into 4 parts
seqkit split hairpin.fa -p 4
To reduce memory usage when spliting big file, we should alwasy use flag --two-pass
seqkit split hairpin.fa -p 4 -2
- Split sequences by species. i.e. by custom IDs (first three letters)
seqkit split hairpin.fa -i --id-regexp "^([\w]+)\-" -2
- Split sequences by sequence region (for example, sequence barcode)
seqkit split hairpin.fa -r 1:3 -2
[INFO] split by region: 1:3 [INFO] read and write sequences to tempory file: hairpin.fa.gz.fa ... [INFO] read sequence IDs and sequence region from FASTA file ... [INFO] create and read FASTA index ... [INFO] write 463 sequences to file: hairpin.region_1:3_AUG.fa.gz [INFO] write 349 sequences to file: hairpin.region_1:3_ACU.fa.gz [INFO] write 311 sequences to file: hairpin.region_1:3_CGG.fa.gz
Sequence suffix could be defined as
-r -12:-1
sample
Usage
sample sequences by number or proportion.
Usage:
seqkit sample [flags]
Flags:
-n, --number int sample by number (result may not exactly match)
-p, --proportion float sample by proportion
-s, --rand-seed int rand seed for shuffle (default 11)
-2, --two-pass 2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
Examples
- Sample by proportion
cat hairpin.fa | seqkit sample -p 0.1 -o sample.fa.gz
- Sample by number
cat hairpin.fa | seqkit sample -n 1000 -o sample.fa.gz
To reduce memory usage when spliting big file, we could use flag
--two-pass
We can also use
seqkit sample -p
followed withseqkit head -n
:cat hairpin.fa | seqkit sample -p 0.1 | seqkit head -n 1000 -o sample.fa.gz
- Set rand seed to reproduce the result
cat hairpin.fa | seqkit sample -p 0.1 -s 11
- Most of the time, we could shuffle after sampling
cat hairpin.fa | seqkit sample -p 0.1 | seqkit shuffle -o sample.fa.gz
Note that when sampling on FASTQ files, make sure using same random seed by
flag -s
(--rand-seed
)
head
Usage
print first N FASTA/Q records
Usage:
seqkit head [flags]
Flags:
-n, --number int print first N FASTA/Q records (default 10)
Examples
- FASTA
seqkit head -n 1 hairpin.fa
- FASTQ
seqkit head -n 1 test.fastq.gz
replace
Usage replace name/sequence/by regular expression.
Note that the replacement supports capture variables. e.g. $1 represents the text of the first submatch. ATTENTION: use SINGLE quote NOT double quotes in *nix OS.
Examples: Adding space to all bases.
seqkit head -n 1 hairpin.fa | seqkit replace -p "(.)" -r '$1 ' -s
more on: http://shenwei356.github.io/seqkit/usage/#replace
Usage: seqkit replace [flags]
Flags:
-s, --by-seq replace seq
-i, --ignore-case ignore case
-p, --pattern string search regular expression
-r, --replacement string replacement. supporting capture variables. e.g. $1 represents the text of the first submatch. ATTENTION: use SINGLE quote NOT double quotes in *nix OS or use the \ escape character. record number is also supported by "{NR}"
Examples
- Remove descriptions
echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p " .+"
- Replace “-“ with “=”
echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p "\-" -r '='
- Remove gaps in sequences.
echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p " |-" -s
- Add space to every base. ATTENTION: use SINGLE quote NOT double quotes in *nix OS
echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p "(.)" -r '$1 ' -s
- Transpose sequence with csvtk
echo -e ">seq1\nACTGACGT\n>seq2\nactgccgt" | seqkit replace -p "(.)" -r "\$1 " -s | seqkit seq -s -u | csvtk space2tab | csvtk -t transpose
- Rename with number of record
echo -e ">abc\nACTG\n>123\nATTT" | seqkit replace -p .+ -r "seq_{NR}"
shuffle
Usage
shuffle sequences.
By default, all records will be readed into memory.
For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
supported.
Firstly, seqkit reads the sequence IDs. If the file is not plain FASTA file,
seqkit will write the sequences to tempory files, and create FASTA index.
Secondly, seqkit shuffles sequence IDs and extract sequences by FASTA index.
Usage:
seqkit shuffle [flags]
Flags:
-k, --keep-temp keep tempory FASTA and .fai file when using 2-pass mode
-s, --rand-seed int rand seed for shuffle (default 23)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
Examples
- General use.
seqkit shuffle hairpin.fa > shuffled.fa
- For big genome, you’d better use two-pass mode so seqkit could use
FASTA index to reduce memory usage
time seqkit shuffle -2 GRCh38_latest_genomic.fna.gz > shuffle.fa
sort
Usage
sort sequences by id/name/sequence/length.
By default, all records will be readed into memory.
For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
supported.
Firstly, seqkit reads the sequence head and length information.
If the file is not plain FASTA file,
seqkit will write the sequences to tempory files, and create FASTA index.
Secondly, seqkit sort sequence by head and length information
and extract sequences by FASTA index.
Usage:
seqkit sort [flags]
Flags:
-l, --by-length by sequence length
-n, --by-name by full name instead of just id
-s, --by-seq by sequence
-i, --ignore-case ignore case
-k, --keep-temp keep tempory FASTA and .fai file when using 2-pass mode
-r, --reverse reverse the result
-L, --seq-prefix-length int length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
Examples
For FASTA format, use flag -2 (–two-pass) to reduce memory usage
- sort by ID
echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet
- sort by ID, ignoring case.
echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -i
- sort by seq, ignoring case.
echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -s -i
- sort by sequence length
echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAAnnn\n>seq3\nacgt" | seqkit sort --quiet -l
Quick glance
- Sequence number
seqkit stat hairpin.fa
- First 10 bases
cat hairpin.fa | seqkit subseq -r 1:10 | seqkit sort -s | seqkit seq -s | head -n 10
Repeated hairpin sequences
We may want to check how may identical hairpins among different species there are.
seqkit rmdup
could remove duplicated sequences by sequence content,
and save the replicates to another file (here is duplicated.fa.gz
),
as well as replicating details (duplicated.detail.txt
,
1th column is the repeated number,
2nd column contains sequence IDs seperated by comma).
seqkit rmdup -s -i hairpin.fa -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt
The result shows the most conserved miRNAs among different species,
mir-29b
, mir-125
, mir-19b-1
and mir-20a
.
And the dre-miR-430c
has the most multicopies in Danio rerio.
Hairpins in different species
- Before spliting by species, let’s take a look at the sequence names.
seqkit seq hairpin.fa.gz -n | head -n 3
The first three letters (e.g. cel
) are the abbreviation of species names. So we could split hairpins by the first letters by defining custom sequence ID parsing regular expression ^([\w]+)\-
.
By default, seqkit
takes the first non-space letters as sequence ID.
- Split sequences by species.
A custom ID parsing regular expression is used,
^([\w]+)\-
.seqkit split hairpin.fa -i --id-regexp "^([\w]+)\-" --two-pass
To reduce memory usage when splitting big file, we should always use flag
--two-pass
- Species with most miRNA hairpins. Third column is the sequences number.
cd hairpin.fa.split/; seqkit stat hairpin.part_* | csvtk space2tab | csvtk -t sort -k num_seqs:nr | csvtk -t pretty| more
Here, a CSV/TSV tool csvtk is used to sort and view the result.
Softwares
- seqkit. (Go). Version v2. Compiled with Go 1.7rc5.
- fasta_utilities. (Perl). Version 3dcc0bc. Lots of dependencies to install.
- fastx_toolkit. (Perl). Version 0.0.13. Can’t handle multi-line FASTA files.
- seqmagick. (Python). Version 0.6.1
- seqtk. (C). Version 1.1-r92-dirty.
Not used:
- pyfaidx. (Python). Version 0.4.7.1. Not used, because it exhausted my memory (10G) when computing reverse-complement on a 5GB fasta file of 250 bp.
A Python script memusg was used to compute running time and peak memory usage of a process.
Features
Categories | Features | seqkit | fasta_utilities | fastx_toolkit | pyfaidx | seqmagick | seqtk |
---|---|---|---|---|---|---|---|
Formats support | Multi-line FASTA | Yes | Yes | – | Yes | Yes | Yes |
FASTQ | Yes | Yes | Yes | – | Yes | Yes | |
Multi-line FASTQ | Yes | Yes | – | – | Yes | Yes | |
Validating sequences | Yes | – | Yes | Yes | – | – | |
Supporting RNA | Yes | Yes | – | – | Yes | Yes | |
Functions | Searching by motifs | Yes | Yes | – | – | Yes | – |
Sampling | Yes | – | – | – | Yes | Yes | |
Extracting sub-sequence | Yes | Yes | – | Yes | Yes | Yes | |
Removing duplicates | Yes | – | – | – | Partly | – | |
Splitting | Yes | Yes | – | Partly | – | – | |
Splitting by seq | Yes | – | Yes | Yes | – | – | |
Shuffling | Yes | – | – | – | – | – | |
Sorting | Yes | Yes | – | – | Yes | – | |
Locating motifs | Yes | – | – | – | – | – | |
Common sequences | Yes | – | – | – | – | – | |
Cleaning bases | Yes | Yes | Yes | Yes | – | – | |
Transcription | Yes | Yes | Yes | Yes | Yes | Yes | |
Translation | – | Yes | Yes | Yes | Yes | – | |
Filtering by size | Indirect | Yes | – | Yes | Yes | – | |
Renaming header | Yes | Yes | – | – | Yes | Yes | |
Other features | Cross-platform | Yes | Partly | Partly | Yes | Yes | Yes |
Reading STDIN | Yes | Yes | Yes | – | Yes | Yes | |
Reading gzipped file | Yes | Yes | – | – | Yes | Yes | |
Writing gzip file | Yes | – | – | – | Yes | – |
Note 2: See usage for detailed options of seqkit.