FASTA format

Last updated
FASTA format
Filename extensions
.fasta, .fas, .fa, .fna, .ffn, .faa, .mpfa, .frn
Internet media type
text/x-fasta
Uniform Type Identifier (UTI) no
Developed by David J. Lipman
William R. Pearson [1] [2]
Initial release1985
Type of format Bioinformatics
Extended from ASCII for FASTA
Extended to FASTQ format [3]
Website www.ncbi.nlm.nih.gov/BLAST/fasta.shtml

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

Contents

The format allows for sequence names and comments to precede the sequences. It originated from the FASTA software package and has since become a near-universal standard in bioinformatics. [4]

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages.

Overview

A sequence begins with a greater-than character (">") followed by a description of the sequence (all in a single line). The lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid, and are typically no more than 80 characters in length.

For example:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* 

Original format

The original FASTA/Pearson format is described in the documentation for the FASTA suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc, or fastaVN.me—where VN is the Version Number).

In the original format, a sequence was represented as a series of lines, each of which was no longer than 120 characters and usually did not exceed 80 characters. This probably was to allow for the preallocation of fixed line sizes in software: at the time most users relied on Digital Equipment Corporation (DEC) VT220 (or compatible) terminals which could display 80 or 132 characters per line.[ citation needed ] Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70) in FASTA lines. Also, the width of a standard printed page is 70 to 80 characters (depending on the font). Hence, 80 characters became the norm.[ citation needed ]

The first line in a FASTA file started either with a ">" (greater-than) symbol or, less frequently, a ";"[ citation needed ] (semicolon) was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace to always use ">" for the first line and to not use ";" comments (which would otherwise be ignored).

Following the initial line (used for a unique description of the sequence) was the actual sequence itself in the standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...). It was also common to end the sequence with an "*" (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence. Below are a few sample sequences:

;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*  >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK*  >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY 

A multiple-sequence FASTA format, or multi-FASTA format, would be obtained by concatenating several single-sequence FASTA files in one file. This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", forcing all subsequent sequences to start with a ">" in order to be taken as separate sequences (and further forcing the exclusive reservation of ">" for the sequence definition line). Thus, the examples above would be a multi-FASTA file if taken together.

Modern bioinformatics programs that rely on the FASTA format expect the sequence headers to be preceded by ">". The sequence is generally represented as "interleaved", or on multiple lines as in the above example, but may also be "sequential", or on a single line. Running different bioinformatics programs may require conversions between "sequential" and "interleaved" FASTA formats.

Description line

The description line (defline) or header/identifier line, which begins with ">", gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A (Control-A) character. In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:

>SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH 

NCBI identifiers

The NCBI defined a standard for the unique identifier used for the sequence (SeqID) in the header line. This allows a sequence that was obtained from a database to be labelled with a reference to its database record. The database identifier format is understood by the NCBI tools like makeblastdb and table2asn. The following list describes the NCBI FASTA defined format for sequence identifiers. [5]

TypeFormat(s)Example(s)
local (i.e. no database reference)lcl|integer

lcl|string

lcl|123

lcl|hmm271

GenInfo backbone seqidbbs|integerbbs|123
GenInfo backbone moltypebbm|integerbbm|123
GenInfo import IDgim|integergim|123
GenBank gb|accession|locusgb|M73307|AGMA13GT
EMBL emb|accession|locusemb|CAM43271.1|
PIR pir|accession|namepir||G36364
SWISS-PROT sp|accession|namesp|P01013|OVAX_CHICK
patentpat|country|patent|sequence-numberpat|US|RE33188|1
pre-grant patentpgp|country|application-number|sequence-numberpgp|EP|0238993|7
RefSeq ref|accession|nameref|NM_010450.1|
general database reference
(a reference to a database that's not in this list)
gnl|database|integer

gnl|database|string

gnl|taxon|9606

gnl|PID|e1632

GenInfo integrated databasegi|integergi|21434723
DDBJ dbj|accession|locusdbj|BAC85684.1|
PRF prf|accession|nameprf||0806162C
PDB pdb|entry|chainpdb|1I4L|D
third-party GenBank tpg|accession|nametpg|BK003456|
third-party EMBL tpe|accession|nametpe|BN000123|
third-party DDBJ tpd|accession|nametpd|FAA00017|
TrEMBLtr|accession|nametr|Q90RT2|Q90RT2_9HIV1

The vertical bars ("|") in the above list are not separators in the sense of the Backus–Naur form but are part of the format. Multiple identifiers can be concatenated, also separated by vertical bars.

Sequence representation

Following the header line, the actual sequence is represented. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence. The nucleic acid codes supported are: [6] [7] [8]

Nucleic Acid CodeMeaningMnemonic
AA Adenine
CC Cytosine
GG Guanine
TT Thymine
UU Uracil
(i)i inosine (non-standard)
RA or G (I) puRine
YC, T or U pYrimidines
KG, T or Ubases which are Ketones
MA or Cbases with aMino groups
SC or GStrong interaction
WA, T or UWeak interaction
Bnot A (i.e. C, G, T or U)B comes after A
Dnot C (i.e. A, G, T or U)D comes after C
Hnot G (i.e., A, C, T or U)H comes after G
Vneither T nor U (i.e. A, C or G)V comes after U
NA C G T UNucleic acid
-gap of indeterminate length

The amino acid codes supported (22 amino acids and 3 special codes) are:

Amino Acid CodeMeaning
A Alanine
B Aspartic acid (D) or Asparagine (N)
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
J Leucine (L) or Isoleucine (I)
K Lysine
L Leucine
M Methionine/Start codon
N Asparagine
O Pyrrolysine (rare)
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine (rare)
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid (E) or Glutamine (Q)
Xany
*translation stop
-gap of indeterminate length

FASTA file

Filename extension

There is no standard filename extension for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.

ExtensionMeaningNotes
fasta, fas, fa [9] generic FASTAAny generic FASTA file
fnaFASTA nucleic acidUsed generically to specify nucleic acids
ffnFASTA nucleotide of gene regionsContains coding regions for a genome
faaFASTA amino acidContains amino acid sequences
mpfaFASTA amino acidsContains multiple protein sequences
frnFASTA non-coding RNA Contains non-coding RNA regions for a genome, e.g. tRNA, rRNA

Compression

The compression of FASTA files requires a specific compressor to handle both channels of information: identifiers and sequence. For improved compression results, these are mainly divided into two streams where the compression is made assuming independence. For example, the algorithm MFCompress [10] performs lossless compression of these files using context modelling and arithmetic encoding. Genozip, [11] a software package for compressing genomic files, uses an extensible context-based model. Benchmarks of FASTA file compression algorithms have been reported by Hosseini et al. in 2016, [12] and Kryukov et al. in 2020. [13]

Encryption

The encryption of FASTA files can be performed with various tools, including Cryfa and Genozip. Cryfa uses AES encryption and also enables data compression. [14] [15] Similarly, Genozip can encrypt FASTA files with AES-256 during compression. [11]

Extensions

FASTQ format is a form of FASTA format extended to indicate information related to sequencing. It is created by the Sanger Centre in Cambridge. [3]

A2M/A3M are a family of FASTA-derived formats used for sequence alignments. In A2M/A3M sequences, lowercase characters are taken to mean insertions, which are then indicated in the other sequences as the dot (".") character. The dots can be discarded for compactness without loss of information. As with typical FASTA files used in alignments, the gap ("-") is taken to mean exactly one position. [16] A3M is similar to A2M, with the added rule that gaps aligned to insertions can too be discarded. [17]

Working with FASTA files

A plethora of user-friendly scripts are available from the community to perform FASTA file manipulations. Online toolboxes, such as FaBox [18] or the FASTX-Toolkit within Galaxy servers, are also available. [19] These can be used to segregate sequence headers/identifiers, rename them, shorten them, or extract sequences of interest from large FASTA files based on a list of wanted identifiers (among other available functions). A tree-based approach to sorting multi-FASTA files (TREE2FASTA [20] ) also exists based on the coloring and/or annotation of sequences of interest in the FigTree viewer. Additionally, the Bioconductor Biostrings package can be used to read and manipulate FASTA files in R. [21]

Several online format converters exist to rapidly reformat multi-FASTA files to different formats (e.g. NEXUS, PHYLIP) for use with different phylogenetic programs, such as the converter available on phylogeny.fr. [22]

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the Fast Fourier Transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

<span class="mw-page-title-main">Dot plot (bioinformatics)</span>

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

<span class="mw-page-title-main">Phred quality score</span> Measurement in DNA sequencing

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

Stockholm format is a multiple sequence alignment format used by Pfam, Rfam and Dfam, to disseminate protein, RNA and DNA sequence alignments. The alignment editors Ralee, Belvu and Jalview support Stockholm format as do the probabilistic database search tools, Infernal and HMMER, and the phylogenetic analysis tool Xrate. Stockholm format files often have the filename extension .sto or .stk.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

<span class="mw-page-title-main">Sequence Read Archive</span>

The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

References

  1. Lipman DJ, Pearson WR (March 1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID   2983426. Closed Access logo transparent.svg
  2. Pearson WR, Lipman DJ (April 1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–8. Bibcode:1988PNAS...85.2444P. doi: 10.1073/pnas.85.8.2444 . PMC   280013 . PMID   3162770.
  3. 1 2 Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (April 2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research. 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC   2847217 . PMID   20015970.
  4. "What is FASTA format?". Zhang Lab. Archived from the original on 2022-12-04. Retrieved 2022-12-04.
  5. NCBI C++ Toolkit Book. National Center for Biotechnology Information. Retrieved 2018-12-19.
  6. Tao Tao (2011-08-24). "Single Letter Codes for Nucleotides". [NCBI Learning Center]. National Center for Biotechnology Information. Archived from the original on 2012-09-14. Retrieved 2012-03-15.
  7. "IUPAC code table". NIAS DNA Bank. Archived from the original on 2011-08-11.
  8. "anysymbol". MAFFT - a multiple sequence alignment program.
  9. "Alignment Fileformats". 22 May 2019. Retrieved 22 May 2019.
  10. Pinho AJ, Pratas D (January 2014). "MFCompress: a compression tool for FASTA and multi-FASTA data". Bioinformatics. 30 (1): 117–8. doi:10.1093/bioinformatics/btt594. PMC   3866555 . PMID   24132931.
  11. 1 2 Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (2021-02-15). "Genozip: a universal extensible genomic data compressor". Bioinformatics. 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. ISSN   1367-4803. PMC   8388020 . PMID   33585897.
  12. Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J. (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi: 10.3390/info7040056 . ISSN   2078-2489.
  13. Kryukov K, Ueda MT, Nakagawa S, Imanishi T (July 2020). "Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences". GigaScience. 9 (7): giaa072. doi:10.1093/gigascience/giaa072. PMC   7336184 . PMID   32627830.
  14. Pratas D, Hosseini M, Pinho A (2017). "Cryfa: a tool to compact and encrypt FASTA files". 11th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB). Advances in Intelligent Systems and Computing. Vol. 616. Springer. pp. 305–312. doi:10.1007/978-3-319-60816-7_37. ISBN   978-3-319-60815-0.
  15. Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J (2019-01-01). Berger, Bonnie (ed.). "Cryfa: a secure encryption tool for genomic data". Bioinformatics. 35 (1): 146–148. doi:10.1093/bioinformatics/bty645. ISSN   1367-4803. PMC   6298042 . PMID   30020420.
  16. "Description of A2M alignment format". SAMtools . Archived from the original on 2022-08-15.
  17. "soedinglab/hh-suite: reformat.pl". GitHub. 20 November 2022.
  18. Villesen, P. (2007). "FaBox: an online toolbox for fasta sequences". Molecular Ecology Notes. 7 (6): 965–968. doi:10.1111/j.1471-8286.2007.01821.x. ISSN   1471-8278.
  19. Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E, Stoler N, Galaxy Team, Taylor J, Nekrutenko A (2014). "Dissemination of scientific software with Galaxy ToolShed". Genome Biology. 15 (2): 403. doi: 10.1186/gb4161 . PMC   4038738 . PMID   25001293.
  20. Sauvage T, Plouviez S, Schmidt WE, Fredericq S (March 2018). "TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees". BMC Research Notes. 11 (1): 403. doi: 10.1186/s13104-018-3268-y . PMC   5838971 . PMID   29506565.
  21. Pagès, H; Aboyoun, P; Gentleman, R; DebRoy, S (2018). "Biostrings: Efficient manipulation of biological strings". Bioconductor.org. R package version 2.48.0. doi:10.18129/B9.bioc.Biostrings.
  22. Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S, Lefort V, Lescot M, Claverie JM, Gascuel O (July 2008). "Phylogeny.fr: robust phylogenetic analysis for the non-specialist". Nucleic Acids Research. 36 (Web Server issue): W465–9. doi:10.1093/nar/gkn180. PMC   2447785 . PMID   18424797.