BCH709 Introduction to Bioinformatics: seqkit_tutorial

Tutorial

seqkit

Create and Installation

conda create -n seqkit -c bioconda seqkit bedops csvtk

Environment activate

conda activate seqkit

Export environment

conda env export > seqkit.yml

Conda deactivate

conda deactivate

Conda environment remove

conda remove --name seqkit --all

Reinstall environment

conda env create -f seqkit.yml

Environment activate

conda activate seqkit

use seqkit

seqkit --help

make homework folder

mkdir /data/gpfs/assoc/bch709-4/${USER}
rm -rf ~/bch709
ln -s /data/gpfs/assoc/bch709-4/${USER} ~/bch709
cd ~/bch709

Usage

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 2.0.0

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Available Commands:
  amplicon        extract amplicon (or specific region around it) via primer(s)
  bam             monitoring and online histograms of BAM record features
  common          find common sequences of multiple files by id/name/sequence
  completion      generate the autocompletion script for the specified shell
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fish            look for short sequences in larger sequences using local alignment
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (and length, GC content, average quality...)
  genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  head-genome     print sequences of the first genome with common prefixes in name
  help            Help about any command
  locate          locate subsequences/motifs, mismatch allowed
  mutate          edit sequence (point mutation, insertion, deletion)
  pair            match up paired-end reads from two fastq files
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by ID/name/sequence
  sample          sample sequences by number or proportion
  sana            sanitize broken single line FASTQ files
  scat            real time recursive concatenation and streaming of fastx files
  seq             transform sequences (extract ID, filter by length, remove gaps...)
  shuffle         shuffle sequences
  sliding         extract subsequences in sliding windows
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  version         print version information and check for update
  watch           monitoring and online histograms of sequence features

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Current folder

cd ~/bch709

Datasets

Test fastq file

wget https://raw.githubusercontent.com/plantgenomicslab/BCH709/gh-pages/data/test.fastq.gz
ll -tr

Datasets from The miRBase Sequence Database – Release 21

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/

Only DNA and gtf/bed data of Chr1 were used:

seq

Usage

transform sequences (revserse, complement, extract ID...)

Usage:
  seqkit seq [flags]

Flags:
  -p, --complement                complement sequence (blank for Protein sequence)
      --dna2rna                   DNA to RNA
  -G, --gap-letter string         gap letters (default "- ")
  -l, --lower-case                print sequences in lower case
  -n, --name                      only print names
  -i, --only-id                   print ID instead of full head
  -q, --qual                      only print qualities
  -g, --remove-gaps               remove gaps
  -r, --reverse                   reverse sequence)
      --rna2dna                   RNA to DNA
  -s, --seq                       only print sequences
  -u, --upper-case                print sequences in upper case
  -v, --validate-seq              validate bases according to the alphabet
  -V, --validate-seq-length int   length of sequence to validate (0 for whole seq) (default 10000)

Examples

  1. Read and print
zcat test.fastq.gz
seqkit seq test.fastq.gz
  1. Sequence types
  1. Only print names

Please check regular expression in Regex

  1. Only print seq (global flag -w defines the output line width, 0 for no wrap)
seqkit seq hairpin.fa -s -w 0
  1. Reverse comlement sequence
    seqkit seq hairpin.fa.gz -r -p
    
  2. Remove gaps and to lower/upper case
    echo -e ">seq\nACGT-ACTGC-ACC" | seqkit seq -g -u
    
  3. RNA to DNA
    echo -e ">seq\nUCAUAUGCUUGUCUCAAAGAUUA" | seqkit seq --rna2dna
    

subseq

Usage

get subsequences by region/gtf/bed, including flanking sequences.

Recommendation: use plain FASTA file, so seqkit could utilize FASTA index.

The definition of region is 1-based and with some custom design.

Examples:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n

Usage:
  seqkit subseq [flags]

Flags:
      --bed string        by BED file
      --chr value         select limited sequence with sequence IDs (multiple value supported, case ignored) (default [])
  -d, --down-stream int   down stream length
      --feature value     select limited feature types (multiple value supported, case ignored, only works with GTF) (default [])
      --gtf string        by GTF (version 2.2) file
  -f, --only-flank        only return up/down stream sequence
  -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
  -u, --up-stream int     up stream length

Examples

Recommendation: use plain FASTA file, so seqkit could utilize FASTA index.

  1. First 12 bases
    cat hairpin.fa | seqkit subseq -r 1:12
    
  2. Last 12 bases
    cat hairpin.fa | seqkit subseq -r -12:-1
    
  3. Subsequences without first and last 12 bases
    cat hairpin.fa | seqkit subseq -r 13:-13
    
  4. Get subsequence by GTF file

Human genome example:

AVOID loading all data from GRCh38_latest_genomic.gtf.gz, the uncompressed data are so big and may exhaust your RAM.

We could specify chromesomes and features.

seqkit subseq --gtf GRCh38_latest_genomic.gtf.gz --chr NC_000001.11 --feature CDS  GRCh38_latest_genomic.fna.gz > chr1.gtf.cds.fa
seqkit stat chr1.gtf.cds.fa
  1. Get subsequences by BED file.
    seqkit subseq --bed GRCh38_latest_genomic.bed.gz --chr NC_000001.11 GRCh38_latest_genomic.fna.gz >  chr1.bed.gz.fa
    

We may need to remove duplicated sequences

seqkit subseq --bed GRCh38_latest_genomic.bed.gz --chr NC_000001.11  GRCh38_latest_genomic.fna.gz | seqkit rmdup > chr1.bed.rmdup.fa
seqkit stat chr1.fa.gz

sliding

Usage:
  seqkit sliding [flags]

Flags:
  -C, --circular-genome   circular genome
  -s, --step int        step size
  -W, --window int      window size

Examples

  1. General use
echo -e ">seq\nACGTacgtNN" | seqkit sliding -s 3 -W 6
  1. Circular genome
    echo -e ">seq\nACGTacgtNN" | seqkit sliding -s 3 -W 6 -C
    
  2. Generate GC content for ploting
    cat hairpin.fa | seqkit fx2tab | head -n 1 | seqkit tab2fx | seqkit sliding -s 5 -W 30 | seqkit fx2tab -n -g
    

stat

Usage simple statistics of FASTA files

Usage:
  seqkit stat [flags]

Examples

  1. General use
    seqkit stat *.f*{a,q}.gz
    

fq2fa

covert FASTQ to FASTA

Usage: seqkit fq2fa [flags]

seqkit fq2fa test.fastq.gz -o test_.fa.gz
zcat test_.fa.gz
zcat test.fastq.gz

fx2tab & tab2fx

Usage (fx2tab)

covert FASTA/Q to tabular format, and provide various information,
like sequence length, GC content/GC skew.

Usage:
  seqkit fx2tab [flags]

Flags:
  -B, --base-content value   print base content. (case ignored, multiple values supported) e.g. -B AT -B N (default [])
  -g, --gc                   print GC content
  -G, --gc-skew              print GC-Skew
  -H, --header-line          print header line
  -l, --length               print sequence length
  -n, --name                 only print names (no sequences and qualities)
  -i, --only-id              print ID instead of full head

Usage (tab2fx)
covert tabular format (first two/three columns) to FASTA/Q format

Usage:
  seqkit tab2fx [flags]

Flags:
  -p, --comment-line-prefix value   comment line prefix (default [#,//])

Examples

  1. Default output
seqkit fx2tab hairpin.fa| head -n 2
  1. Print sequence length, GC content, and only print names (no sequences), we could also print title line by flag -T.
seqkit fx2tab hairpin.fa -l -g -n -i -H | head -n 4 | csvtk -t -C '&' pretty
  1. Use fx2tab and tab2fx in pipe
    cat hairpin.fa | seqkit fx2tab | seqkit tab2fx
    zcat test.fastq.gz | seqkit fx2tab | seqkit tab2fx
    
  2. Sort sequences by length (use seqkit sort -l)
    cat hairpin.fa | seqkit fx2tab -l | sort -t"`echo -e '\t'`" -n -k4,4 | seqkit tab2fx
    
seqkit sort -l hairpin.fa
  1. Sorting or filtering by GC (or other base by -flag -B) content could also achieved in similar way.

  2. Get first 1000 sequences

seqkit fx2tab hairpin.fa | head -n 1000 | seqkit tab2fx

seqkit fx2tab test.fastq.gz | head -n 1000 | seqkit tab2fx

Extension

After converting FASTA to tabular format with seqkit fx2tab, it could be handled with CSV/TSV tools, e.g. csvtk, a cross-platform, efficient and practical CSV/TSV toolkit

grep

Usage

search sequences by pattern(s) of name or sequence motifs

Usage:
  seqkit grep [flags]

Flags:
  -n, --by-name               match by full name instead of just id
  -s, --by-seq                match by seq
  -d, --degenerate            pattern/motif contains degenerate base
      --delete-matched        delete matched pattern to speedup
  -i, --ignore-case           ignore case
  -v, --invert-match          invert the sense of matching, to select non-matching records
  -p, --pattern value         search pattern (multiple values supported) (default [])
  -f, --pattern-file string   pattern file
  -r, --use-regexp            patterns are regular expression

Examples

  1. Extract human hairpins (i.e. sequences with name starting with hsa)
    cat hairpin.fa | seqkit grep -r -p ^hsa
    
  2. Remove human and mice hairpins.
    cat hairpin.fa| seqkit grep -r -p ^hsa -p ^mmu -v
    
  3. Extract new entries by information from miRNA.diff

    1. Get IDs of new entries.
      cat miRNA.diff | grep ^# -v | grep NEW | cut -f 2 > list
      more  list ##q to out
      
  4. Extract by ID list file
    cat hairpin.fa | seqkit grep -f list > new.fa
    
  5. Extract sequences starting with AGGCG
    cat hairpin.fa | seqkit grep -s -r -i -p ^aggcg
    
  6. Extract sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.
    cat hairpin.fa | seqkit grep -s -d -i -p UUSAA
    

    It’s equal to but simpler than:

    cat hairpin.fa| seqkit grep -s -r -i -p UU[CG]AA
    

locate

Usage: locate subsequences/motifs

Motifs could be EITHER plain sequence containing "ACTGN" OR regular
expression like "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" for ORFs.
Degenerate bases like "RYMM.." are also supported by flag -d.

By default, motifs are treated as regular expression.
When flag -d given, regular expression may be wrong.
For example: "\w" will be wrongly converted to "\[AT]".

Usage:
  seqkit locate [flags]

Flags:
  -d, --degenerate                pattern/motif contains degenerate base
  -i, --ignore-case               ignore case
  -P, --only-positive-strand      only search at positive strand
  -p, --pattern value             search pattern/motif (multiple values supported) (default [])
  -f, --pattern-file string       pattern/motif file (FASTA format)
  -V, --validate-seq-length int   length of sequence to validate (0 for whole seq) (default 10000)

Examples

  1. Locate ORFs.
    cat hairpin.fa | seqkit locate -i -r -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)"
    
  2. Locate Motif.
    cat hairpin.fa | seqkit locate -i -d -p AUGGACUN
    

Notice that seqkit grep only searches in positive strand, but seqkit loate could recognize both strand

rmdup

Usage

remove duplicated sequences by id/name/sequence

Usage:
  seqkit rmdup [flags]

Flags:
    -n, --by-name                by full name instead of just id
    -s, --by-seq                 by seq
    -D, --dup-num-file string    file to save number and list of duplicated seqs
    -d, --dup-seqs-file string   file to save duplicated seqs
    -i, --ignore-case            ignore case
    -m, --md5                    use MD5 instead of original seqs to reduce memory usage when comparing by seqs

Examples

Similar to common.

  1. General use
    cat hairpin.fa | seqkit rmdup -s -o clean.fa.gz
    zcat test.fastq.gz | seqkit rmdup -s -o clean.fa.gz
    
  2. Save duplicated sequences to file
    cat hairpin.fa | seqkit rmdup -s -i -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt
    cat duplicated.detail.txt 
    

common

Usage

find common sequences of multiple files by id/name/sequence

Usage:
  seqkit common [flags]

Flags:
    -n, --by-name       match by full name instead of just id
    -s, --by-seq        match by sequence
    -i, --ignore-case   ignore case
    -m, --md5           use MD5 instead of original seqs to reduce memory usage when comparing by seqs

Examples

  1. Create file
echo -e ">1234\nATGGTATTAGATA" >> file1.fa
echo -e ">2234\nATGGTATTTGATA" >> file1.fa
echo -e ">3234_A\nATGGTATTCGATA" >> file1.fa
echo -e ">1234_B\nATGGTATTCGATA" >> file2.fa
echo -e ">4234\nATGGTATTAGATA" >> file2.fa
echo -e ">2234\nATGGTATTGGGATA" >> file2.fa
cat file1.fa
cat file2.fa
  1. By ID (default)
    seqkit common file*.fa -o common.fasta
    
  2. By full name
    seqkit common file*.fa -n -o common.fasta
    
  3. By sequence
    seqkit common file*.fa -s -i -o common.fasta
    

split

Usage

split sequences into files by name ID, subsequence of given region,
part size or number of parts.

The definition of region is 1-based and with some custom design.

Examples:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n

Usage:
  seqkit split [flags]

Flags:
Flags:
  -i, --by-id              split squences according to sequence ID
  -p, --by-part int        split squences into N parts
  -r, --by-region string   split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
  -s, --by-size int        split squences into multi parts with N sequences
  -d, --dry-run            dry run, just print message and no files will be created.
  -f, --force              overwrite output directory
  -k, --keep-temp          keep tempory FASTA and .fai file when using 2-pass mode
  -m, --md5                use MD5 instead of region sequence in output file when using flag -r (--by-region)
  -O, --out-dir string     output directory (default value is infile.split)
  -2, --two-pass           two-pass mode read files twice to lower memory usage. (only for FASTA format)

Examples

  1. Split sequences into parts with at most 10000 sequences
seqkit split hairpin.fa -s 10000
  1. Split sequences into 4 parts
seqkit split hairpin.fa -p 4

To reduce memory usage when spliting big file, we should alwasy use flag --two-pass

seqkit split hairpin.fa -p 4 -2
  1. Split sequences by species. i.e. by custom IDs (first three letters)
    seqkit split hairpin.fa -i --id-regexp "^([\w]+)\-" -2
    
  2. Split sequences by sequence region (for example, sequence barcode)
    seqkit split hairpin.fa -r 1:3 -2
    
     [INFO] split by region: 1:3
     [INFO] read and write sequences to tempory file: hairpin.fa.gz.fa ...
     [INFO] read sequence IDs and sequence region from FASTA file ...
     [INFO] create and read FASTA index ...
     [INFO] write 463 sequences to file: hairpin.region_1:3_AUG.fa.gz
     [INFO] write 349 sequences to file: hairpin.region_1:3_ACU.fa.gz
     [INFO] write 311 sequences to file: hairpin.region_1:3_CGG.fa.gz
    

    Sequence suffix could be defined as -r -12:-1

sample

Usage

sample sequences by number or proportion.

Usage:
  seqkit sample [flags]

Flags:
  -n, --number int         sample by number (result may not exactly match)
  -p, --proportion float   sample by proportion
  -s, --rand-seed int      rand seed for shuffle (default 11)
  -2, --two-pass           2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin

Examples

  1. Sample by proportion
    cat hairpin.fa | seqkit sample -p 0.1 -o sample.fa.gz
    
  2. Sample by number
    cat hairpin.fa | seqkit sample -n 1000 -o sample.fa.gz
    

    To reduce memory usage when spliting big file, we could use flag --two-pass

    We can also use seqkit sample -p followed with seqkit head -n:

    cat hairpin.fa | seqkit sample -p 0.1 | seqkit head -n 1000 -o sample.fa.gz
    
  3. Set rand seed to reproduce the result
    cat hairpin.fa | seqkit sample -p 0.1 -s 11
    
  4. Most of the time, we could shuffle after sampling
    cat hairpin.fa | seqkit sample -p 0.1 | seqkit shuffle -o sample.fa.gz
    

Note that when sampling on FASTQ files, make sure using same random seed by flag -s (--rand-seed)

Usage

print first N FASTA/Q records

Usage:
  seqkit head [flags]

Flags:
  -n, --number int   print first N FASTA/Q records (default 10)

Examples

  1. FASTA
    seqkit head -n 1 hairpin.fa
    
  2. FASTQ
    seqkit head -n 1 test.fastq.gz
    

replace

Usage replace name/sequence/by regular expression.

Note that the replacement supports capture variables. e.g. $1 represents the text of the first submatch. ATTENTION: use SINGLE quote NOT double quotes in *nix OS.

Examples: Adding space to all bases.

seqkit head -n 1 hairpin.fa | seqkit replace -p "(.)" -r '$1 ' -s

more on: http://shenwei356.github.io/seqkit/usage/#replace

Usage: seqkit replace [flags]

Flags:
  -s, --by-seq               replace seq
  -i, --ignore-case          ignore case
  -p, --pattern string       search regular expression
  -r, --replacement string   replacement. supporting capture variables.  e.g. $1 represents the text of the first submatch. ATTENTION: use SINGLE quote NOT double quotes in *nix OS or use the \ escape character. record number is also supported by "{NR}"

Examples

  1. Remove descriptions
    echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p " .+"
    
  2. Replace “-“ with “=”
    echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p "\-" -r '='
    
  3. Remove gaps in sequences.
    echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p " |-" -s
    
  4. Add space to every base. ATTENTION: use SINGLE quote NOT double quotes in *nix OS
    echo -e ">seq1 abc-123\nACGT-ACGT" | seqkit replace -p "(.)" -r '$1 ' -s
    
  5. Transpose sequence with csvtk
    echo -e ">seq1\nACTGACGT\n>seq2\nactgccgt" | seqkit replace -p "(.)" -r     "\$1 " -s | seqkit seq -s -u | csvtk space2tab | csvtk -t transpose
    
  6. Rename with number of record
    echo -e ">abc\nACTG\n>123\nATTT" |  seqkit replace -p .+ -r "seq_{NR}"
    

shuffle

Usage

shuffle sequences.

By default, all records will be readed into memory.
For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
supported.

Firstly, seqkit reads the sequence IDs. If the file is not plain FASTA file,
seqkit will write the sequences to tempory files, and create FASTA index.

Secondly, seqkit shuffles sequence IDs and extract sequences by FASTA index.

Usage:
  seqkit shuffle [flags]

Flags:
  -k, --keep-temp       keep tempory FASTA and .fai file when using 2-pass mode
  -s, --rand-seed int   rand seed for shuffle (default 23)
  -2, --two-pass        two-pass mode read files twice to lower memory usage. (only for FASTA format)

Examples

  1. General use.
    seqkit shuffle hairpin.fa > shuffled.fa
    
  2. For big genome, you’d better use two-pass mode so seqkit could use FASTA index to reduce memory usage
    time seqkit shuffle -2 GRCh38_latest_genomic.fna.gz > shuffle.fa
    

sort

Usage

sort sequences by id/name/sequence/length.

By default, all records will be readed into memory.
For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
supported.

Firstly, seqkit reads the sequence head and length information.
If the file is not plain FASTA file,
seqkit will write the sequences to tempory files, and create FASTA index.

Secondly, seqkit sort sequence by head and length information
and extract sequences by FASTA index.

Usage:
  seqkit sort [flags]

Flags:
  -l, --by-length               by sequence length
  -n, --by-name                 by full name instead of just id
  -s, --by-seq                  by sequence
  -i, --ignore-case             ignore case
  -k, --keep-temp               keep tempory FASTA and .fai file when using 2-pass mode
  -r, --reverse                 reverse the result
  -L, --seq-prefix-length int   length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
  -2, --two-pass                two-pass mode read files twice to lower memory usage. (only for FASTA format)

Examples

For FASTA format, use flag -2 (–two-pass) to reduce memory usage

  1. sort by ID
    echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet
    
  2. sort by ID, ignoring case.
    echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -i
    
  3. sort by seq, ignoring case.
    echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -s -i
    
  4. sort by sequence length
    echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAAnnn\n>seq3\nacgt" | seqkit sort --quiet -l
    

Quick glance

  1. Sequence number
    seqkit stat hairpin.fa
    
  2. First 10 bases
cat hairpin.fa | seqkit subseq -r 1:10 | seqkit sort -s | seqkit seq -s | head -n 10

Repeated hairpin sequences

We may want to check how may identical hairpins among different species there are. seqkit rmdup could remove duplicated sequences by sequence content, and save the replicates to another file (here is duplicated.fa.gz), as well as replicating details (duplicated.detail.txt, 1th column is the repeated number, 2nd column contains sequence IDs seperated by comma).

seqkit rmdup -s -i hairpin.fa -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt

The result shows the most conserved miRNAs among different species, mir-29b, mir-125, mir-19b-1 and mir-20a. And the dre-miR-430c has the most multicopies in Danio rerio.

Hairpins in different species

  1. Before spliting by species, let’s take a look at the sequence names.
    seqkit seq hairpin.fa.gz -n | head -n 3
    

The first three letters (e.g. cel) are the abbreviation of species names. So we could split hairpins by the first letters by defining custom sequence ID parsing regular expression ^([\w]+)\-.

By default, seqkit takes the first non-space letters as sequence ID.

  1. Split sequences by species. A custom ID parsing regular expression is used, ^([\w]+)\-.
    seqkit split hairpin.fa -i --id-regexp "^([\w]+)\-" --two-pass
    

    To reduce memory usage when splitting big file, we should always use flag --two-pass

  2. Species with most miRNA hairpins. Third column is the sequences number.
    cd hairpin.fa.split/;
    seqkit stat hairpin.part_* | csvtk space2tab | csvtk -t sort -k num_seqs:nr | csvtk -t pretty| more
    

    Here, a CSV/TSV tool csvtk is used to sort and view the result.

Softwares

  1. seqkit. (Go). Version v2. Compiled with Go 1.7rc5.
  2. fasta_utilities. (Perl). Version 3dcc0bc. Lots of dependencies to install.
  3. fastx_toolkit. (Perl). Version 0.0.13. Can’t handle multi-line FASTA files.
  4. seqmagick. (Python). Version 0.6.1
  5. seqtk. (C). Version 1.1-r92-dirty.

Not used:

  1. pyfaidx. (Python). Version 0.4.7.1. Not used, because it exhausted my memory (10G) when computing reverse-complement on a 5GB fasta file of 250 bp.

A Python script memusg was used to compute running time and peak memory usage of a process.

Features

Categories Features seqkit fasta_utilities fastx_toolkit pyfaidx seqmagick seqtk
Formats support Multi-line FASTA Yes Yes Yes Yes Yes
  FASTQ Yes Yes Yes Yes Yes
  Multi-line FASTQ Yes Yes Yes Yes
  Validating sequences Yes Yes Yes
  Supporting RNA Yes Yes Yes Yes
Functions Searching by motifs Yes Yes Yes
  Sampling Yes Yes Yes
  Extracting sub-sequence Yes Yes Yes Yes Yes
  Removing duplicates Yes Partly
  Splitting Yes Yes Partly
  Splitting by seq Yes Yes Yes
  Shuffling Yes
  Sorting Yes Yes Yes
  Locating motifs Yes
  Common sequences Yes
  Cleaning bases Yes Yes Yes Yes
  Transcription Yes Yes Yes Yes Yes Yes
  Translation Yes Yes Yes Yes
  Filtering by size Indirect Yes Yes Yes
  Renaming header Yes Yes Yes Yes
Other features Cross-platform Yes Partly Partly Yes Yes Yes
  Reading STDIN Yes Yes Yes Yes Yes
  Reading gzipped file Yes Yes Yes Yes
  Writing gzip file Yes Yes

Note 2: See usage for detailed options of seqkit.