Skip to main content

Count missing characters in FASTA files with a shell one-liner

One of the best things about working with FASTA and PHYLIP files is they are relatively simple file formats and thus are easy to parse with command-line tools. There are certainly a lot of negatives to these file types but it is handy for certain types of tasks.

I needed to find out how much missing data were in our FASTA and PHYLIP multiple sequence alignments. While it would be straightforward to write a one-off Python or R script, for these simple tasks the power of a full programming environment isn’t strictly necessary. Here I show you how to build up a small shell pipeline to count missing characters.

To count the total number of characters in a file:

wc -c file.fasta

Count the total number of sequence characters in a file:

grep -v "^>" file.fasta | wc -c

Do this for all sequence files:

find . -name "*.fasta" -exec sh -c 'grep -v "^>" "$1" | wc -c' -- {} \;

…and add them up:

find . -name "*.fasta" -exec sh -c 'grep -v "^>" "$1" | wc -c' -- {} \; | paste -s -d+ - | bc

Count the number of gap characters (-) in a file (assumes no hyphens in names):

fgrep -o - file.fasta | wc -c

Get the proportion of missing data (gaps divided by total number of characters) in a file:

(fgrep -o - file.fasta | wc -c && grep -v "^>" file.fasta | wc -c) | paste -s -d/ - | bc -

Do the previous, but for all files in the current directory:

find . -name "*.fasta" -exec sh -c '(fgrep -o - "$1" | wc -c && grep -v "^>" "$1" | wc -c) | paste -s -d/ - | bc -l' -- {} \;

Add up the number of missing characters for all files in a certain directory:

find folder/ -name "*.fasta" -exec sh -c 'fgrep -o - "$1" | wc -c' -- {} \; | paste -s -d+ - | bc

Add up the number of missing characters for all files in all directories in the current directory, in comma-separated format:

while read f; do echo -n $f,; find $f -name "*.fasta" -exec sh -c 'fgrep -o - "$1" | wc -c' -- {} \; | paste -s -d+ - | bc; done < <(find . -depth 1 -type d)

Other variations are left as an exercise to the reader. I hope this has been an enjoyable journey through the wonderful world of shell pipelines.