SEA-PHAGES stands for Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science; it was formerly called the National Genomics Research Initiative. [1] This was the first initiative launched by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA) by their director Tuajuanda C. Jordan in 2008 to improve the retention of Science, technology, engineering, and mathematics (STEM) students. [2] SEA-PHAGES is a two-semester undergraduate research program administered by the University of Pittsburgh's Graham Hatfull's group and the Howard Hughes Medical Institute's Science Education Division. Students from over 100 universities nationwide engage in authentic individual research that includes a wet-bench laboratory and a bioinformatics component. [3]
During the first semester of this program, classes of around 18-24 undergraduate students work under the supervision of one or two university faculty members and a graduate student assistant—who have completed two week-long training workshops—to isolate and characterize their own personal bacteriophage that infects a specific bacterial host cell from local soil samples. [1] Once students have successfully isolated a phage, they are able to classify them by visualizing them through Electron microscope (EM) images. [3] Also, DNA is extracted and purified by the students, and one sample is sent for sequencing to be ready for the second semester's curriculum. [1]
The second semester consists of the annotation of the genome the class sent to be sequenced. In that case, students work together to evaluate the genes for start-stop coordinates, ribosome-binding sites, and possible functions of those proteins in which the sequence codes. Once the annotation is completed, it is submitted to the National Center for Biotechnology Information's (NCBI) DNA sequence database GenBank. [1] If there is still time in the semester or the sent DNA was not able to be sequenced, the class could request genome file from the University of Pittsburgh that had yet to be sequenced. In addition to the laboratory and bioinformatic skills acquired, students have the opportunity to publish their work in academic journals and attend the national SEA-PHAGES conference in Washington, D.C. or a regional symposium.
All of the details regarding each student's phage is made public by entering it into the online database PhagesDB to expand the knowledge of the SEA-PHAGES community as a whole. [4]
Starterator creates a report by comparing the called start sites of genes in the same Pham in annotated phage genomes and other drafts; therefore, students can suggest an appropriate start for the auto-annotated genes in their actinobacteriophage genome. [3] [4] This is not usually a primary source for calling a gene start because it is not always supported by the information from other programs or the start-stop coordinates are not the same for a gene called by DNA Master. [5]
These compare the amino acid sequence of a gene to other sequenced or annotated phage genomes within the database for students in the SEA-PHAGES community to predict starts and functions of their proteins. [5]
This software generates a report with its algorithm that shows the coding potential for the six possible open reading frames of a specific genome, so the probability of a gene's existence can be assessed during annotation. [5]
DNA Master is a free software tool that students can download on a Windows computer that utilizes the programs GLIMMER, GeneMark, Aragorn, and tRNAscan-SE to auto-annotate a genome that is uploaded as a FASTA format file. [3] Since this is done by a computer algorithm that only uses three programs and may not be as updated as the online versions, each suggested gene has to be confirmed by student annotations. These go through several rounds of peer-review before it is accepted to be reviewed by experts from PhagesDB, then it can be submitted to GenBank. [1]
These programs are used by DNA Master to predict the starts of the genes by assessing the probability of the six open reading frames (ORFs) and the ribosome binding site (RBS) signals. [6] Oftentimes, GLIMMER and GeneMark agree on the predictions during the auto-annotation, but sometimes they give different starts which have to be assessed during manual annotation; GLIMMER is currently the most updated software and is usually used for the final start coordinate.
This algorithm is utilized by DNA Master, and there is an online version that can be used to cross-reference the calls made by the software. [3] It shows definitive tRNAs and tmRNAs within a genome by looking for very specific sequences that would fold into the distinctive cloverleaf secondary structure. [7] Although this algorithm is considered very accurate considering how fast it produces results, it can miss some tRNAs that are not exactly within its search parameters. [7]
This program allows students the ability to identify possible coding regions for tRNAs in sequence that would have been missed by Aragorn because it includes detection for unusual tRNA homologues; although, both programs have sensitivities between 99-100%. [8] tRNAscan-SE does not detect tRNAs itself, but instead outputs the results of the information processed from three independent tRNA prediction programs: tRNAscan, EufindtRNA, and tRNA covariance model search. [8]
Phamerator shows a visual representation of the genes and their similarity to other selected phage genomes by marking them with colored rectangles based on the Phamily or Pham it groups it in. [3] Students can then view, compare, save, and print color-coded genome maps during their annotations. [5] Possible insertions or deletions can be seen through connecting lines between the selected phage genomes. [9] Also, the nucleotide and protein sequences can be accessed through this program; however, the starts and stops do not always match that of DNA Master so the sequences may be incorrect. [10]
These online programs are used to predict the functions of proteins by comparison of the amino acid or nucleotide sequences of all genomes sequenced, not just that of phages. HHPred detects homology in the sequences with other proteins that have had their functions called in any organism. [11] Also, if the protein has been identified in another sequence, the computer-generated structure might be provided to visualize the possible folding of the amino acids. [11]
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as it's hierarchical, three-dimensional structural configuration.,, In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.
BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
In molecular genetics, an open reading frame (ORF) is the part of a reading frame that has the ability to be translated. An ORF is a continuous stretch of codons that may begin with a start codon and ends at a stop codon. An ATG codon within the ORF may indicate where translation starts. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation. In eukaryotic genes with multiple exons, introns are removed and exons are then joined together after transcription to yield the final mRNA for protein translation. In the context of gene finding, the start-stop definition of an ORF therefore only applies to spliced mRNAs, not genomic DNA, since introns may contain stop codons and/or cause shifts between reading frames. An alternative definition says that an ORF is a sequence that has a length divisible by three and is bounded by stop codons. This more general definition can also be useful in the context of transcriptomics and/or metagenomics, where start and/or stop codon may not be present in the obtained sequences. Such an ORF corresponds to parts of a gene rather than the complete gene.
In bioinformatics, GLIMMER is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.
Cis-regulatory elements (CREs) or Cis-regulatory modules (CRMs) are regions of non-coding DNA which regulate the transcription of neighboring genes. CREs are vital components of genetic regulatory networks, which in turn control morphogenesis, the development of anatomy, and other aspects of embryonic development, studied in evolutionary developmental biology.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.
The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. Genes in eukaryotic genome can be annotated using FINDER.
BASys is a freely available web server that can be used to perform automated, comprehensive annotation of bacterial genomes. With the advent of next generation DNA sequencing it is now possible to sequence the complete genome of a bacterium within a single day. This has led to an explosion in the number of fully sequenced microbes. In fact, as of 2013, there were more than 2700 fully sequenced bacterial genomes deposited with GenBank. However, a continuing challenge with microbial genomics is finding the resources or tools for annotating the large number of newly sequenced genomes. BASys was developed in 2005 in anticipation of these needs. In fact, BASys was the world’s first publicly accessible microbial genome annotation web server. Because of its widespread popularity, the BASys server was updated in 2011 through the addition of multiple server nodes to handle the large number of queries it was receiving.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
The Actinobacteriophage database, more commonly known as PhagesDB, is a database-backed website that gathers and shares information related to the discovery, characterization and genomics of viruses that prefer to infect Actinobacterial hosts. It is a bioinformatics tool that is used worldwide to compare multiple phages and their genomic annotations. Up to recent dates, there have been more than 8,000 bacteriophages, including over 1,600 with already sequenced genomes, have been entered into the database. It is an addition to the wide range of priorly existing bioinformatic tools, like NCBI. It provides results of already sequenced phage genomes and aims to allow access to drafted phage genomes to provide a larger spectrum of information.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome. It has the ability to annotate human genomes hg18, hg19, hg38, and model organisms genomes such as: mouse, zebrafish, fruit fly, roundworm, yeast and many others. The annotations could be used to determine the functional consequences of the mutations on the genes and organisms, infer cytogenetic bands, report functional importance scores, and/or find variants in conserved regions. ANNOVAR along with SNP effect (SnpEFF) and Variant Effect Predictor (VEP) are three of the most commonly used variant annotation tools.