UGENE

Last updated
UGENE
Original author(s) Fursov M.
Developer(s) Unipro
Initial release2008;16 years ago (2008)
Stable release
49 / 8 November 2023;4 months ago (2023-11-08)
Written in C++, Qt
Operating system Windows, macOS, Linux
Available in English, Russian
Type Bioinformatics toolkit
License GPLv2
Website ugene.net

UGENE is computer software for bioinformatics. [1] [2] It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

Contents

UGENE helps biologists to analyze various biological genetics data, such as sequences, annotations, multiple alignments, phylogenetic trees, NGS assemblies, and others. The data can be stored both locally (on a personal computer) and on a shared storage (e.g., a lab database).

UGENE integrates dozens of well-known biological tools, algorithms, and original tools in the context of genomics, evolutionary biology, virology, and other branches of life science. UGENE provides a graphical user interface (GUI) for the pre-built tools so biologists with no computer programming skills can access those tools more easily.

Using UGENE Workflow Designer, it is possible to streamline a multi-step analysis. The workflow consists of blocks such as data readers, blocks executing embedded tools and algorithms, and data writers. Blocks can be created with command line tools or a script. A set of sample workflows is available in the Workflow Designer, to annotate sequences, convert data formats, analyze NGS data, etc.

Beside the graphical interface, UGENE also has a command-line interface. Workflows may also be executed thereby.

To improve performance, UGENE uses multi-core processors (CPUs) and graphics processing units (GPUs) to optimize a few algorithms. [3] [4]

Key features

The software supports the following features:

Sequence View

The Sequence View is used to visualize, analyze and modify nucleic acid or protein sequences. Depending on the sequence type and the options selected, the following views can be present in the Sequence View window:

Alignment Editor

The Alignment Editor allows working with multiple nucleic acid or protein sequences - aligning them, editing the alignment, analyzing it, storing the consensus sequence, building a phylogenetic tree, and so on.

Phylogenetic Tree Viewer

The Phylogenetic Tree Viewer helps to visualize and edit phylogenetic trees. It is possible to synchronize a tree with the corresponding multiple alignment used to build the tree.

Assembly Browser

Assembly Browser Ugene-1.9.3-ab.png
Assembly Browser

The Assembly Browser project was started in 2010 as an entry for Illumina iDEA Challenge 2011. [19] The browser allows users to visualize and browse large (up to hundreds of millions of short reads) next generation sequence assemblies. It supports SAM, [20] BAM (the binary version of SAM), and ACE formats. Before browsing assembly data in UGENE, an input file is converted to a UGENE database file automatically. This approach has its pros and cons. The pros are that this allows viewing the whole assembly, navigating in it, and going to well-covered regions rapidly. The cons are that a conversion may take time for a large file, and needs enough disk space to store the database.

Workflow Designer

UGENE Workflow Designer allows creating and running complex computational workflow schemas. [21]

The distinguishing feature of Workflow Designer, relative to other bioinformatics workflow management systems is that workflows are executed on a local computer. It helps to avoid data transfer issues, whereas other tools’ reliance on remote file storage and internet connectivity does not.

The elements that a workflow consists of correspond to the bulk of algorithms integrated into UGENE. Using Workflow Designer also allows creating custom workflow elements. The elements can be based on a command-line tool or a script.

Workflows are stored in a special text format. This allows their reuse, and transfer between users.

A workflow can be run using the graphical interface or launched from the command line. The graphical interface also allows controlling the workflow execution, storing the parameters, and so on.

There is an embedded library of workflow samples to convert, filter, and annotate data, with several pipelines to analyze NGS data developed in collaboration with NIH NIAID. [22] A wizard is available for each workflow sample.

Supported biological data formats

Release cycle

UGENE is primarily developed by Unipro LLC [23] with headquarters in Akademgorodok of Novosibirsk, Russia. Each iteration lasts about 1–2 months, followed by a new release. Development snapshots may also be downloaded.

The features to include in each release are mostly initiated by users.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes.

<span class="mw-page-title-main">Dot plot (bioinformatics)</span>

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

The Staden Package is computer software, a set of tools for DNA sequence assembly, editing, and sequence analysis. It is open-source software, released under a BSD 3-clause license.

The Virus Pathogen Database and Analysis Resource (ViPR) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for viral pathogens in the U.S. National Institute of Allergy and Infectious Diseases (NIAID) Category A-C Priority Pathogen lists for biodefense research, and other viral pathogens causing emerging/reemerging infectious diseases. ViPR is one of the five Bioinformatics Resource Centers (BRC) funded by NIAID, a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

The Influenza Research Database (IRD) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research. IRD is one of the five Bioinformatics Resource Centers (BRC) funded by the National Institute of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

References

  1. Okonechnikov K, Golosova O, Fursov M, the UGENE team (2012). "Unipro UGENE: a unified bioinformatics toolkit". Bioinformatics. 28 (8): 1166–7. doi: 10.1093/bioinformatics/bts091 . PMID   22368248.
  2. Fursov, M.; Novikova, O. (2008). "Multitasking software system for DNA analysis" (PDF). Proceedings of the Sixth International Conference on Bioinformatics of Genome Regulation and Structure. 1: 78. ISBN   978-5-91291-005-0.
  3. Fursov, M. Y.; Oshchepkov, D. Y; Novikova, O. S. (2009). "UGENE: interactive computational schemes for genome analysis" (PDF). Proceedings of the Fifth Moscow International Congress on Biotechnology. 3: 14–15. ISBN   978-5-7237-0372-8.
  4. Efremov, I. E.; Fursov, M. Y; Danilova, Yu. E. (2009). "UGENE: high performance genome analysis suite". Proceedings of the Fifth Moscow International Congress on Biotechnology. 2: 405–406. ISBN   978-5-7237-0372-8.
  5. "NEW REBASE HOME". rebase.neb.com. Retrieved 18 October 2019.
  6. "Primer3 Input (version 0.4.0)". bioinfo.ut.ee. Retrieved 18 October 2019.
  7. "Burrows–Wheeler Aligner". bio-bwa.sourceforge.net. Retrieved 18 October 2019.
  8. "SAMtools". samtools.sourceforge.net. Retrieved 18 October 2019.
  9. "TopHat". ccb.jhu.edu. Retrieved 18 October 2019.
  10. "IU Webmaster redirect". cufflinks.cbcb.umd.edu. Retrieved 18 October 2019.
  11. "MACS - Model-based Analysis for ChIP-Seq". liulab.dfci.harvard.edu. Retrieved 18 October 2019.
  12. "CEAS - Cis-regulatory Element Annotation System". liulab.dfci.harvard.edu. Retrieved 18 October 2019.
  13. "MrBayes | index". nbisweden.github.io. Retrieved 18 October 2019.
  14. "ATGC: PhyML". atgc.lirmm.fr. Retrieved 18 October 2019.
  15. CAP3
  16. 1 2 "Macromolecular Structures Resource Group". www.ncbi.nlm.nih.gov. Retrieved 18 October 2019.
  17. "Spidey is superceded[sic] by Splign". www.ncbi.nlm.nih.gov. Retrieved 18 October 2019.
  18. Vaskin, Y.; Khomicheva, I.; Ignatieva, E.; Vityaev, E. (2012). "ExpertDiscovery and UGENE integrated system for intelligent analysis of regulatory regions of genes". In Silico Biology. 11 (3–4): 97–108. doi:10.3233/ISB-2012-0448. PMID   22935964.
  19. "Illumina - iDEA Challenge". Archived from the original on 2013-01-26. Retrieved 18 October 2019.
  20. "SAM" (PDF). Retrieved 18 October 2019.
  21. Fursov, M. Y.; Varlamov, A. (2009). "UGENE - A practical approach for complex computational analysis in molecular biology" (PDF). Proceedings of the 10th Annual Bioinformatics Open Source Conference: 7.
  22. "NIH: National Institute of Allergy and Infectious Diseases | Leading research to understand, treat, and prevent infectious, immunologic, and allergic diseases". www.niaid.nih.gov. Retrieved 18 October 2019.
  23. "УНИПРО, Новосибирский центр информационных технологий. | СОФТ. Разработка, тестирование, реинжиниринг, поддержка ПО" ["UNIPRO, Novosibirsk center of information technologies. | SOFT. Development, testing, reengineering, software support"]. Retrieved 18 October 2019.