Skip to main content

Some utilities for dealing with character data

Concatenation

SequenceMatrix

Concatenating character matrices seems like it ought to be easy, but I’ve had far too many issues with data formats to naively make that kind of statement. Mesquite never seemed to give output that could be consumed by other programs, Geneious had import issues, and all of my own hand-rolled techniques would always seem to hit edge cases.

SequenceMatrix is quite a robust little piece of software that will generally “just work”. It’s a cross-platform GUI program (anywhere that Java runs) and will fill your missing data with gaps, make sure that all taxa are represented in your concatenated matrix, fix species names by removing Genbank junk, and so on. Best of all it supports drag and drop.

Conversion

I haven’t yet found an equivalently easy GUI program that will convert character data from one format to another, so I wrote my own. This script converts character data (e.g., DNA, RNA, morphology) from one format to another. Currently the supported formats are:

The script uses DendroPy for most of its heavy lifting, and will also automatically parallelize the operation using Python multiprocessing. Here’s how you might use it:

# convert all fasta files in the current directory to nexus files
./convert_characters.py *.fasta --input-format=fasta --output-format=nexus --type=dna

# convert all fasta files in a subdirectory to nexus files in a different subdirectory
# this uses the short versions of the input and output commands

./convert_characters.py alignments/*.fasta -ifasta -onexus -tdna --prefix=nexus/; --basename

Download the script from GitHub. (gist link)

If you found this post useful, please consider supporting my work with an avocado toast 🥑.