SoyBase Database

Last updated
SoyBase
Content
DescriptionSoyBase the USDA-ARS soybean genetics and genomic database
Data types
captured
Nucleic Acid, Protein, Expression, Metabolism, Epigenetics
Organisms Glycine max, Glycine soja (soy, soya, soybean)
Contact
Research center USDA Agricultural Research Service
LaboratoryCorn Insects and Crop Genetics Research Unit
Primary citation PMID   20008513
Access
Data format Various
Website soybase.org
Miscellaneous
License Public domain-US Government
Versioning None
Data release
frequency
Continuously
Curation policyProfessionally curated

SoyBase is a database created by the United States Department of Agriculture. It contains genetic information about soybeans. It includes genetic maps, information about Mendelian genetics and molecular data regarding genes and sequences. It was started in 1990 and is freely available to individuals and organizations worldwide.

Contents

History

SoyBase was instituted by the Corn Insects and Corn Genetics Research Unit (CICGRU) in Ames, Iowa as a central repository for the soybean genetics community's published information. [1] Originally, the database concentrated on genetic information such as genetic linkage maps and other Mendalian information. SoyBase genetic maps are a manually-curated composite of all published mapping and QTL studies, and thus provide a species level view of markers and QTL.

In 2010 [2] the soybean genome sequence was released along with gene models and many other types of genome annotations that were integrated in to SoyBase. SoyBase genetic linkage maps were integral to the assembly of the soybean genome.

In 2018 the database received approximately 63,000 page requests from 2,600 users per month from 130 countries. About 40 organizations in the United States and 82 foreign educational institutions access SoyBase yearly. SoyBase supplies data to U.S. and foreign government organizations and corporate entities.

Data submission and release policy

Data is accepted from the original source generators only. Users that independently identify data for inclusion into the database can contact SoyBase directly. A number Excel-based spreadsheet templates are available to facilitate the inclusion of data into SoyBase.

All data in SoyBase are available without restrictions. A number of data sub-setting and download tools are provided, and when needed ad hoc subsets of the data can be requested from the SoyBase Curator.

Search tool

The SoyBase Toolbox. SoyBaseToolBox1.tif
The SoyBase Toolbox.

The SoyBase Database Search Tool uses a text entry box for queries. Results are returned as text and as displays. Results display soybean genetic (and genomic) data using Generic Model Organism Database (GMOD) open-source software. In addition to SoyBase, objects identified by exact lexical matches to the query term, the tool also uses a soybean-specific ontology to identify biologically-related SoyBase objects.

Some SoyBase sequence data and annotations are available through an InterMine instance (SoyMine), which is a collaboration with the Legume Information System Project. [3]

Graphical displays

Genetic maps contain information on markers (SSR, RFLP, SNP, etc.), genes, and biparental and Genome-wide Association Study (GWAS) Quantitative Trait Loci (QTL). Soybean genetic maps are displayed using the CMap comparative genetic map viewer. Soybean genomic sequence and gene model data are displayed using the GBrowse sequence viewer. Other genome annotations in this viewer include epigenetic data such as DNA methylation and gene expression data of various soybean strains subjected to different treatments and from different soybean tissues/cultivars. Metabolic data and biochemical pathway information is displayed using Pathway Tools. Soybean metabolic pathway information (SoyCyc) was inferred by the Plant Metabolic Network [4] project and was used to populate PathwayTools displays.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

A quantitative trait locus (QTL) is a locus that correlates with variation of a quantitative trait in the phenotype of a population of organisms. QTLs are mapped by identifying which molecular markers correlate with an observed trait. This is often an early step in identifying the actual genes that cause the trait variation.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

<span class="mw-page-title-main">Metabolic network modelling</span> Form of biological modelling

Metabolic network modelling, also known as metabolic network reconstruction or metabolic pathway analysis, allows for an in-depth insight into the molecular mechanisms of a particular organism. In particular, these models correlate the genome with molecular physiology. A reconstruction breaks down metabolic pathways into their respective reactions and enzymes, and analyzes them within the perspective of the entire network. In simplified terms, a reconstruction collects all of the relevant metabolic information of an organism and compiles it in a mathematical model. Validation and analysis of reconstructions can allow identification of key features of metabolism such as growth yield, resource distribution, network robustness, and gene essentiality. This knowledge can then be applied to create novel biotechnology.

<span class="mw-page-title-main">KEGG</span> Collection of bioinformatics databases

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.

FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, Drosophila melanogaster, a wide range of data are presented in different formats.

BeeBase was an online bioinformatics database that hosted data related to Apis mellifera, the European honey bee along with some pathogens and other species. It was developed in collaboration with the Honey Bee Genome Sequencing Consortium. In 2020 it was archived and replaced by the Hymenoptera Genome Database.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

GeneCards is a database of human genes, which provides genomic, proteomic, transcriptomic, genetic, medical, and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

<span class="mw-page-title-main">Genome mining</span>

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.

References

  1. Grant, David; Nelson, Rex T.; Cannon, Steven B.; Shoemaker, Randy C. (2010). "SoyBase, the USDA-ARS soybean genetics and genomics database". Nucleic Acids Research. 38 (Suppl 1) (Database issue): D843–D846. doi:10.1093/nar/gkp798. PMC   2808871 . PMID   20008513.
  2. Schmutz, Jeremy; Cannon, Steven B.; Schlueter, Jessica; et al. (2010). "Genome sequence of the palaeopolyploid soybean". Nature. 463 (7278): 178–183. Bibcode:2010Natur.463..178S. doi: 10.1038/nature08670 . PMID   20075913. S2CID   4372224.
  3. Dash, S.; Campbell, J.D.; Cannon, E.K.; et al. (2016). "Legume information system (LegumeInfo. org): a key component of a set of federated data resources for the legume family". Nucleic Acids Research. 44 (D1): D1181–D1188. doi:10.1093/nar/gkv1159. PMC   4702835 . PMID   26546515.
  4. Schlapfer, P.; Zhang, P.; Wang, C.; et al. (2010). "Genome-Wide Prediction of Metabolic Enzymes, and Gene Clusters in Plants". Plant Physiology. 173 (4): 2041–2059. doi:10.1104/pp.16.01942. PMC   5373064 . PMID   28228535.