General feature format

Last updated
General feature format
Filename extensions
.gff, .gff3
Internet media type
text/gff3
Developed bySanger Centre (v2), Sequence Ontology Project (v3)
Type of format Bioinformatics
Extended from Tab-separated values
Open format?yes
Website github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

In bioinformatics, the general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences.

Contents

GFF Versions

The following versions of GFF exist:

GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.

The GTF is identical to GFF, version 2. [1]

GFF general structure

All GFF formats (GFF2, GFF3 and GTF) are tab delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ninth field. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:

General GFF3 structure
Position indexPosition nameDescription
1seqidThe name of the sequence where the feature is located.
2sourceThe algorithm or procedure that generated the feature. This is typically the name of a software or database.
3typeThe feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4startGenomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5endGenomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.[ citation needed ]
6scoreNumeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7strandSingle character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8phasephase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9attributesA list of tag-value pairs separated by a semicolon with additional information about the feature.

The 8th field: phase of CDS features

Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the GFF3 specification: [2] [3]

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.

Meta Directives

In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found at Sequence Ontology specifications).

GFF software

Servers

Servers that generate this format:

ServerExample file
UniProt

Clients

Clients that use this format:

NameDescriptionLinks
GBrowseGMOD genome viewer GBrowse
IGBIntegrated Genome Browser Integrated Genome Browser
JalviewA multiple sequence alignment editor & viewer Jalview
STRAPUnderlining sequence features in multiple alignments. Example output:
JBrowseJBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5 JBrowse.org
ZENBUA collaborative, omics data integration and interactive visualization system

Validation

The modENCODE project hosts an online GFF3 validation tool with generous limits of 286.10 MB and 15 million lines.

The Genome Tools software collection contains a gff3validator tool that can be used offline to validate and possibly tidy GFF3 files. An online validation service is also available.

See also

Related Research Articles

The XML Metadata Interchange (XMI) is an Object Management Group (OMG) standard for exchanging metadata information via Extensible Markup Language (XML).

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">MagicDraw</span> Systems modelling software

MagicDraw is a proprietary visual UML, SysML, BPMN, and UPDM modeling tool with team collaboration support.

The Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information. A significant feature of the GTF that can be validated: given a sequence and a GTF file, one can check that the format is correct. This significantly reduces problems with the interchange of data between groups.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, Drosophila melanogaster, a wide range of data are presented in different formats.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

<span class="mw-page-title-main">Integrated Genome Browser</span>

Integrated Genome Browser (IGB) is an open-source genome browser, a visualization tool used to observe biologically-interesting patterns in genomic data sets, including sequence data, gene models, alignments, and data from DNA microarrays.

GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

Echinobase is a Model Organism Database (MOD). It supports the international research community by providing a centralized, integrated web based resource to access the diverse and rich, functional genomics data of echinoderm evolution, development and gene regulatory networks.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

The BED format is a text file format used to store genomic regions as coordinates and associated annotations. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the Human Genome Project and then adopted by other sequencing projects. As a result of this increasingly wide use, this format had already become a de facto standard in bioinformatics before a formal specification was written.

References

  1. "GFF/GTF File Format". Ensembl . Archived from the original on 2022-06-15. Retrieved 2023-11-04.
  2. "GFF3 specification". GitHub . 2018-11-24. Archived from the original on 2023-07-04.
  3. "GFF3". GMOD. 2016-07-12. Archived from the original on 2023-08-25.