Phylo (video game)

Last updated
Phylo
Developer(s) McGill University
McGill Centre for Bioinformatics
Initial release2010
Engine
  • Adobe Flash
OOjs UI icon edit-ltr-progressive.svg
Platform Unity (web browser, Android, iOS)
Available inEnglish, French
Type Video game
Website phylo.cs.mcgill.ca [1]

Phylo is an experimental video game about multiple sequence alignment optimisation. [1] Developed by the McGill Centre for Bioinformatics, it was originally released as a free Flash game in November 2010. Designed as a game with a purpose, players solve pattern-matching puzzles that represent nucleotide sequences of different phylogenetic taxa to optimize alignments over a computer algorithm. By aligning together each nucleotide sequence, represented as differently coloured blocks, players attempt to create the highest point value score for each set of sequences by matching as many colours as possible and minimizing gaps.

Contents

The nucleotide sequences generated by Phylo are obtained from actual sequence data from the UCSC Genome Browser. High-scoring player alignments are collected as data and sent back to the McGill Centre for Bioinformatics to be further evaluated with a stronger scoring algorithm. Those player alignments that score higher than the current computer-generated score will be re-introduced into the global alignment as an optimization.

Background

The goal of multiple sequence alignments in phylogenetics is to determine the most likely nucleotide sequence of each species by comparing the sequences of children species with those of a most recent common ancestor. Producing such an optimal multiple sequence alignment is usually determined with a dynamic programming algorithm that finds the most probable evolutionary outcome by minimizing the number of mutations required. These algorithms generate phylogenetic trees for each nucleotide in a sequence for each species, and determine the genetic sequence for a common ancestor by comparing the trees of the child species. The algorithms then score and sort the completed phylogenetic tree, and the alignment with the maximum parsimony score is determined to be the optimal, and thus most evolutionarily likely, multiple sequence alignment. However, finding such an optimal alignment for a large number of sequences has been determined to be an NP-complete problem.

Phylo uses human-based computation to create an interactive genetic algorithm to solve the multiple sequence alignment problem instead. Generation of the ancestral sequences and parsimony scoring is still computed using a variation of the Fitch–Margoliash method, but Phylo abstracts the genetic sequences obtained from the UCSC Genome Browser into a pattern-matching game, allowing human players to suggest the most likely alignment rather than algorithmically considering all possible trees.

Gameplay

A screenshot of Phylo, with eight sequence alignments to be matched and scored Phylo screenshot.png
A screenshot of Phylo, with eight sequence alignments to be matched and scored

Each puzzle in Phylo is categorized based on the number of total sequence fragments to be aligned and a disease that is associated with that fragment in humans. Once a puzzle is chosen, a few of the genetic sequence fragments for each species to be aligned, represented as coloured blocks, are each placed on a single row of a grid. Each nucleotide of a genetic sequence fragment is free to move along the grid. Players can then adjust the sequences as necessary in order to create the largest number of colour matches in each column between them, while minimizing the number of the gaps that appear.

Scoring of the sequence alignment is done by comparing each of the player-aligned sequences with an algorithm-determined ancestral sequence generated at each node. A colour match yields +1 to the score, a mismatch yields -1, an opening of a gap yields -5, and an extension of any existing gap yields -1. The sum of all comparisons is then determined every several seconds, which provides the final score for that player's alignment. For each puzzle, only a few sequences are initially available at the beginning of the game. A computer-determined par score must be beaten by the player before moving on to the next round and unlocking more sequences to match. A player wins and is allowed to submit their sequence alignment to the database by matching or surpassing the final par score generated by the computer for each puzzle.

Levels

As of May 2019 (v 3.1.5), Phylo comes in three game modes:

Results

Compared to the computer output, players were able to improve 70% of the alignments. [1] In 2013, Phylo developers built a webserver called Open-Phylo (now defunct) that allows researchers to upload their own sets of sequences for players to align. Compared to computer alignments, expert players were able to make mostly small improvements over what sequence alignment algorithms could do. There were also some minor cases of significantly better alignments proposed by humans. [3] An 2017 report on five years of historical Phylo data reaches a similar conclusion. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.

Protein engineering is the process of developing useful or valuable proteins through the design and production of unnatural polypeptides, often by altering amino acid sequences found in nature. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to improve the function of many enzymes for industrial catalysis. It is also a product and services market, with an estimated value of $168 billion by 2017.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Clustal</span>

Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its algorithm is also detailed in their respective categories. Available operating systems listed in the sidebar are a combination of the software availability and may not be supported for every current version of the Clustal tools. Clustal Omega has the widest variety of operating systems out of all the Clustal tools.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

A human-based computation game or game with a purpose (GWAP) is a human-based computation technique of outsourcing steps within a computational process to humans in an entertaining way (gamification).

Phyloscan is a web service for DNA sequence analysis that is free and open to all users. For locating matches to a user-specified sequence motif for a regulatory binding site, Phyloscan provides a statistically sensitive scan of user-supplied mixed aligned and unaligned DNA sequence data. Phyloscan's strength is that it brings together

<span class="mw-page-title-main">Molecular Evolutionary Genetics Analysis</span> Software for statistical analysis of molecular evolution

Molecular Evolutionary Genetics Analysis (MEGA) is computer software for conducting statistical analysis of molecular evolution and for constructing phylogenetic trees. It includes many sophisticated methods and tools for phylogenomics and phylomedicine. It is licensed as proprietary freeware. The project for developing this software was initiated by the leadership of Masatoshi Nei in his laboratory at the Pennsylvania State University in collaboration with his graduate student Sudhir Kumar and postdoctoral fellow Koichiro Tamura. Nei wrote a monograph (pp. 130) outlining the scope of the software and presenting new statistical methods that were included in MEGA. The entire set of computer programs was written by Kumar and Tamura. The personal computers then lacked the ability to send the monograph and software electronically, so they were delivered by postal mail. From the start, MEGA was intended to be easy-to-use and include solid statistical methods only.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

References

  1. 1 2 3 Kawrykow, A.; Roumanis, G.; Kam, A.; Kwak, D.; Leung, C.; Wu, C.; Zarour, E.; Phylo, L.; Sarmenta, M.; Blanchette, M.; Waldispühl, J.; Phylo Players (2012). Michalak, Pawel (ed.). "Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment". PLOS ONE. 7 (3): e31362. Bibcode:2012PLoSO...731362K. doi: 10.1371/journal.pone.0031362 . PMC   3296692 . PMID   22412834.
  2. Waldispühl, J; Kam, A; Gardner, PP (2015). "Crowdsourcing RNA structural alignments with an online computer game" (PDF). Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing: 330–41. PMID   25592593.
  3. Kwak, D; Kam, A; Becerra, D; Zhou, Q; Hops, A; Zarour, E; Kam, A; Sarmenta, L; Blanchette, M; Waldispühl, J (2013). "Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment". Genome Biology. 14 (10): R116. doi: 10.1186/gb-2013-14-10-r116 . PMC   4014878 . PMID   24148814.
  4. Waldispühl, Jérôme; Blanchette, Mathieu; Ahsan, Faizy; Singh, Akash (21 September 2017). "Lessons from an Online Massive Genomics Computer Game". Fifth AAAI Conference on Human Computation and Crowdsourcing.