Staden Package

Last updated
Staden Package
Original author(s) Rodger Staden
Developer(s) James Bonfield, et al.
Initial release1977;44 years ago (1977)
Stable release
2.0.0b9 / 24 January 2012;9 years ago (2012-01-24)
Preview release
2.0.0b11 / 25 April 2016;4 years ago (2016-04-25)
Repository sourceforge.net/projects/staden
Written in C, C++, Fortran, Tcl
Operating system Unix, Linux, macOS, Windows
Platform IA-32, x86-64
Available inEnglish
Type Bioinformatics
License BSD 3-clause
Website staden.sourceforge.net

The Staden Package is computer software, a set of tools for DNA sequence assembly, editing, and sequence analysis. It is open-source software, released under a BSD 3-clause license.

Contents

Package components

The Staden package consists of several different programs. The main components are:

History

The Staden Package was developed by Rodger Staden's group at the Medical Research Council (MRC) Laboratory of Molecular Biology, Cambridge, England, since 1977. [2] [3] [4] The package was available free to academic users, with 2,500 licenses issued in 2003 and an estimated 10,000 users, when funding for further development ended. [5] The package was converted to open-source in 2004, and several new versions have been released since.

During the years of active development, the Staden group published a number of widely used file formats and ideas, including the SCF file format, [6] the use of sequence quality scores to generate accurate consensus sequences, [7] and the ZTR file format. [8]

See also

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

Sequence alignment Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping physical segments (fragments) contained in clones depending on the context.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.

DNA sequencing Process of determining the order of nucleotides in DNA molecules

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Phred quality score

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for Phred base calling to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

RNA-Seq Lab technique in cellular biology

RNA-Seq is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.

Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

Reference genome

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals.

European Nucleotide Archive

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Scaffolding (bioinformatics)

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Vector NTI is a commercial bioinformatics software package used by many life scientists to work, among other things, with nucleic acids and proteins in silico. It allows researchers to, for example, plan a DNA cloning experiment on the computer before actually performing it in the lab.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

CRAM is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.

References

  1. Bonfield JK, Whitwham A (2010). "Gap5—editing the billion fragment sequence assembly". Bioinformatics. 26 (14): 1699–1703. doi:10.1093/bioinformatics/btq268. PMC   2894512 . PMID   20513662.
  2. Staden R (1979). "A strategy of DNA sequencing employing computer programs". Nucleic Acids Res. 6 (7): 2601–2610. doi:10.1093/nar/6.7.2601. PMC   327874 . PMID   461197.
  3. Staden R (1984). "Computer methods to aid the determination and analysis of DNA sequences". Biochem Soc Trans. 12 (6): 1005–1008. doi:10.1042/bst0121005. PMID   6397374.
  4. Staden R, Beal KF, Bonfield JK (2000). "The Staden package, 1998". Methods Mol Biol. 132: 115–130. doi:10.1385/1-59259-192-2:115. PMID   10547834.
  5. "UK s MRC Ends Support for Staden Package: First Sign of Post-HGP Funding Priority Shift?". Genomeweb. Genomeweb LLC. 5 May 2003. Retrieved 15 November 2016.
  6. Dear S, Staden R (1992). "A standard file format for data from DNA sequencing instruments". DNA Seq. 3 (2): 107–110. doi:10.3109/10425179209034003. PMID   1457811.
  7. Bonfield JK, Staden R (1995). "The application of numerical estimates of base calling accuracy to DNA sequencing projects". Nucleic Acids Res. 23: 1406–1410. doi:10.1093/nar/23.8.1406. PMC   306869 . PMID   7753633.
  8. Bonfield JK, Staden R (2002). "ZTR: a new format for DNA sequence trace data". Bioinformatics. 18: 3–10. doi: 10.1093/bioinformatics/18.1.3 . PMID   11836205.