Content | |
---|---|
Description | For automated bacterial genome annotation and chromosomal map generation |
Data types captured | Data input: Raw genome sequence (FASTA format), labeled genome sequence (FAST format) or predicted/labeled proteome sequence (FASTA); Data output: Fully annotated genome along with an interactive, annotated genome map |
Contact | |
Research center | University of Alberta |
Laboratory | David S. Wishart |
Primary citation | [1] |
Access | |
Website | https://www.basys.ca/ |
Download URL | https://www.basys.ca/ |
Miscellaneous | |
Data release frequency | Last update 2012 |
Curation policy | Manually curated |
BASys (Bacterial Annotation System) is a freely available web server that can be used to perform automated, comprehensive annotation of bacterial genomes. [2] With the advent of next generation DNA sequencing it is now possible to sequence the complete genome of a bacterium (typically ~4 million bases) within a single day. This has led to an explosion in the number of fully sequenced microbes. In fact, as of 2013, there were more than 2700 fully sequenced bacterial genomes deposited with GenBank. However, a continuing challenge with microbial genomics is finding the resources or tools for annotating the large number of newly sequenced genomes. BASys was developed in 2005 in anticipation of these needs. In fact, BASys was the world’s first publicly accessible microbial genome annotation web server. Because of its widespread popularity, the BASys server was updated in 2011 through the addition of multiple server nodes to handle the large number of queries it was receiving.
The BASys server is designed to accept either assembled genome data (raw DNA sequence data) or complete proteome assignments as input. If raw DNA sequence is provided, BASys employs Glimmer (version 2.1.3) to identify the genes. [1] The output from BASys is a comprehensive genome-wide annotation (with ~60 annotation subfields for each gene) and a zoomable, hyperlinked genome map of the query genome. BASys uses nearly 30 different programs to determine and annotate gene/protein names, GO functions, COG functions, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The full list of programs used by BASys is given below:
Name | Method |
---|---|
Glimmer 2.1.3 | Glimmer is a popular and very accurate ab initio gene finding program for microbial DNA. On a study of for 31 complete bacterial and archaeal genomes, Glimmer achieved an average gene prediction accuracy of 99.36%. Glimmer uses Interpolated Markov Models to distinguish coding regions from noncoding DNA. Glimmer's performance decreases with increasing GC Content. For genomes with high GC content (>60%), Glimmer may generate a high number of false positive predictions and therefore should be used with caution. |
HMMER 2.3.2 | Used for local Pfam Searches |
Homodeller 2.0 | Locally developed homology modelling program. |
SignalP 3.0 | Signal peptide prediction. |
TMHMM 2.0 | Prediction of transmembrane helices in protein. |
PSIPRED 2.45 | Secondary structure prediction. PSIPRED achieves an average Q3 score of 80.6% for secondary structure prediction. |
PS_scan | Tool for local PROSITE scans. |
VADAR 1.4 | Locally developed protein structure analysis tool. BASys uses VADAR to analyze protein structures for secondary structural information |
PSORT-B 2.0.4 | Used to predict subcellular location. PSORT-B attains a precision of 96% for Gram-positive and Gram-negative bacteria |
ProteinNameExtractor 1.0 | BASys function prediction module. This module was validated against a set of expertly annotated proteins from C.trachomatis. |
FindParalogs 1.0 | BASys module for paralog identification. The paralogs database is created from the conceptual translations for the identified coding regions supplied to BASys by Glimmer or by the submitter. |
FindHomologs 1.0 | BASys module for homolog identification. Searches model organism databases for possible homologs. |
GOSearch 1.0 | BASys module for extracting Gene Ontology information from various sources. |
OperonFinder 1.0 | BASys module for identifying operons. |
StructureManager 1.0 | BASys module for manipulating protein structure files. |
StructureClassifier 1.0 | BASys module for determining structure class from secondary structure information. |
Structure Finder 1.0 | BASys module for generating protein structures from various sources. |
COG_Finder 1.0 | BASys module for identifying COG functional categories and accessions |
Secondary Structure Manager 1.0 | BASys module for generating secondary structure information from various sources. |
ECNumber_Finder | BASys module for mapping EC_number to and from various sources. |
SwissProt Annotation Manager 1.0 | BASys module for comparing and transitively applying annotations from SwissProt records. |
CCDB Annotation Manager 1.0 | BASys module for comparing and transitively applying annotations from CCDB records. |
Gene Identifier 1.0 | BASys module for coordinating gene identification information from glimmer or user submissions |
BASys Annotation Manager 1.0 | The BASys pipeline manager. |
KEGG Search Manager | BASys module for searching and extracting metabolic information from KEGG. |
SubCellLocalization Manager 1.0 | BASys module for generating subcellular location annotation from various sources. |
In addition to its extensive annotation for each gene/protein in the query genome, BASys also generates colorful, clickable and fully zoomable circular maps of each input chromosome. These bacterial genome maps are generated used a program called CGView (Circular Genome Viewer) which was developed in 2004. [3] The genome maps are designed to allow rapid navigation and detailed visualization of all the BASys-generated gene annotations. A complete BASys run takes approximately 16 h for an average bacterial chromosome (approximately 4 Megabases). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. BASys will store its bacterial genome annotations on the server for a maximum of 180 days. BASys handles approximately 1000 submissions a year. BASys is accessible at https://www.basys.ca/
All data in BacMap is non-proprietary or is derived from a non-proprietary source. It is freely accessible and available to anyone. In addition, nearly every data item is fully traceable and explicitly referenced to the original source. BacMap data is available through a public web interface and downloads.
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.
PlasMapper (Plasmid Mapper) is a freely available web server that automatically generates and annotates high-quality circular plasmid maps. It is a particularly useful online service for molecular biologists wishing to generate plasmid maps without having to purchase or maintain expensive, commercial software. PlasMapper accepts plasmid/vector DNA sequence as input (FASTA format) and uses sequence pattern matching and BLAST sequence alignment to automatically identify and label common promoters, terminators, cloning sites, restriction sites, reporter genes, affinity tags, selectable marker genes, origins of replication and open reading frames. PlasMapper then reformats and presents the identified features in both a simple textual form and as high-resolution, multicolored image.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
The Human Metabolome Database (HMDB) is a comprehensive, high-quality, freely accessible, online database of small molecule metabolites found in the human body. It has been created by the Human Metabolome Project funded by Genome Canada and is one of the first dedicated metabolomics databases. The HMDB facilitates human metabolomics research, including the identification and characterization of human metabolites using NMR spectroscopy, GC-MS spectrometry and LC/MS spectrometry. To aid in this discovery process, the HMDB contains three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data (Fig. 1–3). The chemical data includes 41,514 metabolite structures with detailed descriptions along with nearly 10,000 NMR, GC-MS and LC/MS spectra.
The Toxin and Toxin-Target Database (T3DB), also known as the Toxic Exposome Database, is a freely accessible online database of common substances that are toxic to humans, along with their protein, DNA or organ targets. The database currently houses nearly 3,700 toxic compounds or poisons described by nearly 42,000 synonyms. This list includes various groups of toxins, including common pollutants, pesticides, drugs, food toxins, household and industrial/workplace toxins, cigarette toxins, and uremic toxins. These toxic substances are linked to 2,086 corresponding protein/DNA target records. In total there are 42,433 toxic substance-toxin target associations. Each toxic compound record (ToxCard) in T3DB contains nearly 100 data fields and holds information such as chemical properties and descriptors, mechanisms of action, toxicity or lethal dose values, molecular and cellular interactions, medical information, NMR an MS spectra, and up- and down-regulated genes. This information has been extracted from over 18,000 sources, which include other databases, government documents, books, and scientific literature.
MetaboAnalyst is a set of online tools for metabolomic data analysis and interpretation, created by members of the Wishart Research Group at the University of Alberta. It was first released in May 2009 and version 2.0 was released in January 2012. MetaboAnalyst provides a variety of analysis methods that have been tailored for metabolomic data. These methods include metabolomic data processing, normalization, multivariate statistical analysis, and data annotation. The current version is focused on biomarker discovery and classification.
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.
αr14 is a family of bacterial small non-coding RNAs with representatives in a broad group of α-proteobacteria. The first member of this family (Smr14C2) was found in a Sinorhizobium meliloti 1021 locus located in the chromosome (C). It was later renamed NfeR1 and shown to be highly expressed in salt stress and during the symbiotic interaction on legume roots. Further homology and structure conservation analysis identified 2 other chromosomal copies and 3 plasmidic ones. Moreover, full-length Smr14C homologs have been identified in several nitrogen-fixing symbiotic rhizobia, in the plant pathogens belonging to Agrobacterium species as well as in a broad spectrum of Brucella species. αr14C RNA species are 115-125 nt long and share a well defined common secondary structure. Most of the αr14 transcripts can be catalogued as trans-acting sRNAs expressed from well-defined promoter regions of independent transcription units within intergenic regions (IGRs) of the α-proteobacterial genomes.
In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.
BacMap is a freely available web-accessible database containing fully annotated, fully zoomable and fully searchable chromosome maps from more than 2500 prokaryotic species. BacMap was originally developed in 2005 to address the challenges of viewing and navigating through the growing numbers of bacterial genomes that were being generated through large-scale sequencing efforts. Since it was first introduced, the number of bacterial genomes in BacMap has grown by more than 15X. Essentially BacMap functions as an on-line visual atlas of microbial genomes. All of the genome annotations in BacMap were generated through the BASys genome annotation system. BASys is a widely used microbial annotation infrastructure that performs comprehensive bioniformatic analyses on raw bacterial genome sequence data. All of the genome (chromosome) maps in BacMap were constructed using the program known as CGView. CGView is a popular visualization program for generating interactive, web-compatible circular chromosome maps. Each chromosome map in BacMap is extensively hyperlinked and each chromosome image can be interactively navigated, expanded and rotated using navigation buttons or hyperlinks. All identified genes in a BacMap chromosome map are colored according to coding directions and when sufficiently zoomed-in, gene labels are visible. Each gene label on a BacMap genome map is also hyperlinked to a 'gene card'. The gene cards provide detailed information about the corresponding DNA and protein sequences. Each genome map in BacMap is searchable via BLAST and a gene name/synonym search.
CGView is a freely available downloadable Java software program, applet and API for generating colorful, zoomable, hyperlinked, richly annotated images of circular genomes such as bacterial chromosomes, mitochondrial DNA and plasmids. It is commonly used in bacterial sequence annotation pipelines to generate visual output suitable for the web. It has also been used in a variety of popular web servers and databases (BacMap).
METAGENassist is a freely available web server for comparative metagenomic analysis. Comparative metagenomic studies involve the large-scale comparison of genomic or taxonomic census data from bacterial samples across different environments. Historically this has required a sound knowledge of statistics, computer programming, genetics and microbiology. As a result, only a small number of researchers are routinely able to perform comparative metagenomic studies. To circumvent these limitations, METAGENassist was developed to allow metagenomic analyses to be performed by non-specialists, easily and intuitively over the web. METAGENassist is particularly notable for its rich graphical output and its extensive database of bacterial phenotypic information.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.