BED (file format)

Last updated
BED (file format)
Filename extension
.bed
Internet media type
text/x-bed
Type of format Text file
Website https://samtools.github.io/hts-specs/BEDv1.pdf

The BED (Browser Extensible Data) format is a text file format used to store genomic regions as coordinates and associated annotations. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the Human Genome Project [1] and then adopted by other sequencing projects. As a result of this increasingly wide use, this format had already become a de facto standard in bioinformatics before a formal specification was written.

Contents

One of the advantages of this format is the manipulation of coordinates instead of nucleotide sequences, which optimizes the power and computation time when comparing all or part of genomes. In addition, its simplicity makes it easy to manipulate and read (or parsing) coordinates or annotations using word processing and scripting languages such as Python, Ruby or Perl or more specialized tools such as BEDTools.

History

The end of the 20th century saw the emergence of the first projects to sequence complete genomes. Among these projects, the Human Genome Project was the most ambitious at the time, aiming to sequence for the first time a genome of several gigabases. This required the sequencing centres to carry out major methodological development in order to automate the processing of sequences and their analyses. Thus, many formats were created, such as FASTQ, [2] GFF or BED. [1] However, no official specifications were published at the time, which affected some formats such as FASTQ when sequencing projects multiplied at the beginning of the 21st century.

Its wide use within genome browsers has made it possible to define this format in a relatively stable way as this description is used by many tools.

Format

Initially the BED format did not have any official specification. Instead, the description provided by the UCSC Genome Browser [3] has been widely used as a reference.

A formal BED specification [4] was published in 2021 [5] under the auspices of the Global Alliance for Genomics and Health.

Description

A BED file consists of a minimum of three columns to which nine optional columns can be added for a total of twelve columns. The first three columns contain the names of chromosomes or scaffolds, the start, and the end coordinates of the sequences considered. The next nine columns contain annotations related to these sequences. These columns must be separated by spaces or tabs, the latter being recommended for reasons of compatibility between programs. [6] Each row of a file must have the same number of columns. The order of the columns must be respected: if columns of high numbers are used, the columns of intermediate numbers must be filled in.

Columns of BED files (in red are the obligatory columns)
Column numberTitleDefinition
1chrom Chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) name
2chromStartStart coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 i.e. the number is zero-based)
3chromEndEnd coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart (the first base on the chromosome is numbered 1 i.e. the number is one-based).
4nameName of the line in the BED file
5scoreScore between 0 and 1000
6strandDNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand)
7thickStartStarting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene)
8thickEndEnd coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene)
9itemRgb RGB value in the form R,G,B (e.g. 255,0,0) determining the display color of the annotation contained in the BED file
10blockCountNumber of blocks (e.g. exons) on the line of the BED file
11blockSizesList of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the "blockCount")
12blockStartsList of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the "blockCount")

A BED file can optionally contain a header. However, there is no official description of the format of the header. It may contain one or more lines and be signified by different words or symbols, [6] depending on its functional role or simply descriptive. Thus, a header line can begin with these words or symbol:

Coordinate system

Unlike the coordinate system used by other standards such as GFF, the system used by the BED format is zero-based for the coordinate start and one-based for the coordinate end. [6] Thus, the nucleotide with the coordinate 1 in a genome will have a value of 0 in column 2 and a value of 1 in column 3.

A thousand-base BED interval with the following start and end:

chr7    0    1000

would convert to the following 1-based "human" genome coordinates, as used by a genome browser such as UCSC:

chr7    1    1000

This choice is justified by the method of calculating the lengths of the genomic regions considered, this calculation being based on the simple subtraction of the end coordinates (column 3) by those of the start (column 2): . When the coordinate system is based on the use of 1 to designate the first position, the calculation becomes slightly more complex: . This slight difference can have a relatively large impact in terms of computation time when data sets with several thousand to hundreds of thousands of lines are used.

Alternatively, we can view both coordinates as zero-based, where the end position is non-inclusive. In other words, the zero-based end position denotes the index of the first position after the feature. For the example above, the zero-based end position of 1000 marks the first position after the feature including positions 0 through 999.

Examples

Here is a minimal example:

chr7    127471196    127472363 chr7    127472363    127473530 chr7    127473530    127474697

Here is a typical example with nine columns from the UCSC Genome Browser. The first three lines are settings for the UCSC Genome Browser and are unrelated to the data specified in BED format:

browser position chr7:127471196-127495720 browser hide all track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7    127471196    127472363    Pos1    0    +    127471196    127472363    255,0,0 chr7    127472363    127473530    Pos2    0    +    127472363    127473530    255,0,0 chr7    127473530    127474697    Pos3    0    +    127473530    127474697    255,0,0 chr7    127474697    127475864    Pos4    0    +    127474697    127475864    255,0,0 chr7    127475864    127477031    Neg1    0    -    127475864    127477031    0,0,255 chr7    127477031    127478198    Neg2    0    -    127477031    127478198    0,0,255 chr7    127478198    127479365    Neg3    0    -    127478198    127479365    0,0,255 chr7    127479365    127480532    Pos5    0    +    127479365    127480532    255,0,0 chr7    127480532    127481699    Neg4    0    -    127480532    127481699    0,0,255

File extension

There is currently no standard file extension for BED files, but the ".bed" extension is the most frequently used. The number of columns sometimes is noted in the file extension, for example: ".bed3", ".bed4", ".bed6", ".bed12". [7]

Usage

The use of BED files has spread rapidly with the emergence of new sequencing techniques and the manipulation of larger and larger sequence files. The comparison of genomic sequences or even entire genomes by comparing the sequences themselves can quickly require significant computational resources and become time-consuming. Handling BED files makes this work more efficient by using coordinates to extract sequences of interest from sequencing sets or to directly compare and manipulate two sets of coordinates.

To perform these tasks, various programs can be used to manipulate BED files, including but not limited to the following:

.genome Files

BEDtools also uses .genome files to determine chromosomal boundaries and ensure that padding operations do not extend past chromosome boundaries. Genome files are formatted as shown below, a two-column tab-separated file with one-line header.

 chrom   size                                                                           chr1    248956422  chr2    242193529  chr3    198295559  chr4    190214555  chr5    181538259  chr6    170805979  chr7    159345973  ...

Related Research Articles

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

<span class="mw-page-title-main">Integrated Genome Browser</span>

Integrated Genome Browser (IGB) is an open-source genome browser, a visualization tool used to observe biologically-interesting patterns in genomic data sets, including sequence data, gene models, alignments, and data from DNA microarrays.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

The Variant Call Format (VCF) is a standard text file format used in bioinformatics for storing gene sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-scale genotyping and DNA sequencing projects. VCF is a common output format for variant calling programs due to its relative simplicity and scalability. Many tools have been developed for editing and manipulating VCF files, including VCFtools, which was released in conjunction with the VCF format in 2011, and BCFtools, which was included as part of SAMtools until being split into an independent package in 2014.

<span class="mw-page-title-main">DNA nanoball sequencing</span>

DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then polymerized to anchor sequences bound to known sequences on the DNA template. The base order is determined via the fluorescence of the bound nucleotides This DNA sequencing method allows large numbers of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing platforms. However, a limitation of this method is that it generates only short sequences of DNA, which presents challenges to mapping its reads to a reference genome. After purchasing Complete Genomics, the Beijing Genomics Institute (BGI) refined DNA nanoball sequencing to sequence nucleotide samples on their own platform.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

References

  1. 1 2 Kent WJ., Sugnet CW., Furey TS., Roskin KM., Pringle TH., Zahler AM. & Haussler D. (2002). "The human genome browser at UCSC". Genome Research . 12 (6): 996–1006. doi: 10.1101/gr.229102 . ISSN   1088-9051. PMC   186604 . PMID   12045153.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  2. Cock PJ., Fields CJ., Goto N., Heuer ML. & Rice PM. (2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research . 38 (6): 1767–71. doi: 10.1093/nar/gkp1137 . ISSN   1362-4962. PMC   2847217 . PMID   20015970.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. 1 2 "Frequently Asked Questions: Data File Formats. BED format". UCSC Genome Browser. University of California Santa Cruz Genomics Institute. Retrieved 2 October 2019.
  4. "The Browser Extensible Data (BED) format" (PDF). samtools.github.io.
  5. "GA4GH BED v1.0: A formal standard sets ground rules for genomic features". www.ga4gh.org. 2022-03-30.
  6. 1 2 3 4 Quinlan, AR; Hall, IM (21 September 2010). The BEDTools manual (PDF). Retrieved 3 October 2019.
  7. 1 2 "Datatypes". Galaxy Community Hub. Retrieved 3 October 2019.
  8. Neph, S; Kuehn, MS; Reynolds, AP; Haugen, E; Thurman, RE; Johnson, AK; Rynes, E; Maurano, MT; Vierstra, J; Thomas, S; Sandstrom, R; Humbert, R; Stamatoyannopoulos, JA (15 July 2012). "BEDOPS: high-performance genomic feature operations". Bioinformatics. 28 (14): 1919–20. doi: 10.1093/bioinformatics/bts277 . PMC   3389768 . PMID   22576172.
  9. Li, Heng. "BedTk". GitHub. Retrieved 22 July 2020.
  10. Birolo, Giovanni; Telatin, Andrea (6 March 2020). "covtobed: a simple and fast tool to extract coverage tracks from BAM files". Journal of Open Source Software. 5 (47): 2119. Bibcode:2020JOSS....5.2119B. doi: 10.21105/joss.02119 .