Segregating site

Last updated

Segregating sites are positions which show differences (polymorphisms) between related genes in a sequence alignment (are not conserved). [1] Segregating sites include conservative, semi-conservative and non-conservative mutations.

Polymorphism (biology) Occurrence of two or more clearly different morphs or forms in the population of a species

Polymorphism in biology and zoology is the occurrence of two or more clearly different morphs or forms, also referred to as alternative phenotypes, in the population of a species. To be classified as such, morphs must occupy the same habitat at the same time and belong to a panmictic population.

Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the edit distance cost between strings in a natural language or in financial data.

Conserved sequence Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

The proportion of segregating sites within a gene is an important statistic in population genetics since it can be used to estimate mutation rate assuming no selection. For example it is used to calculate the Tajima's D neutral evolution statistic.

Population genetics Study of genetic differences within and between populations including the study of adaptation, speciation, and population structure

Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and population structure.

Mutation rate A measure of the rate at which mutations occur during some unit of time

In genetics, the mutation rate is the frequency of new mutations in a single gene or organism over time. Mutation rates are not constant and are not limited to a single type of mutation, therefore there are many different types of mutations. Mutation rates are given for specific classes of mutations. Point mutations are a class of mutations which are small or large scale insertions or deletions. There are also Missense and Nonsense mutations, which are variations of point mutations. The rate of these types of substitutions can be further subdivided into a mutation spectrum which describes the influence of the genetic context on the mutation rate.

Natural selection Mechanism of evolution by differential survival and reproduction of individuals

Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the heritable traits characteristic of a population over generations. Charles Darwin popularised the term "natural selection", contrasting it with artificial selection, which in his view is intentional, whereas natural selection is not.

A sequence alignment, produced by ClustalO, of mammalian histone proteins.
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ). Histone Alignment.png
A sequence alignment, produced by ClustalO, of mammalian histone proteins.
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).

See also

An ultra-conserved element (UCE) is a region of DNA that is identical in at least two different species. One of the first studies of UCEs showed that certain human DNA sequences of length 200 nucleotides or greater were entirely conserved in human, rats, and mice. Despite often being noncoding DNA, some ultra-conserved elements have been found to be transcriptionally active, giving non-coding RNA molecules.

Related Research Articles

Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It is also a product and services market, with an estimated value of $168 billion by 2017.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others. Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays, there are many tools and techniques that provide the sequence comparisons and analyze the alignment product to understand its biology.

Nucleic acid sequence A succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of letters that indicate the order of nucleotides forming alleles within a DNA or RNA (GACU) molecule. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

Protein family certain functional class or family of proteins

A protein family is a group of evolutionarily-related proteins. In many cases a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term protein family should not be confused with family as it is used in taxonomy.

In molecular biology and bioinformatics, the consensus sequence is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase.

Clustal software for multiple sequence alignment

Clustal is a series of widely used computer programs used in Bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its algorithm are also detailed in their respective categories. Available operating systems listed in the sidebar are a combination of the software availability and may not be supported for every current version of the Clustal tools. Clustal Omega has the widest variety of operating systems out of all the Clustal tools.

Regulator gene gene involved in controlling the expression of one or more other genes

A regulator gene, regulator, or regulatory gene is a gene involved in controlling the expression of one or more other genes. Regulatory sequences, which encode regulatory genes, are often 5' to the start site of transcription of the gene they regulate. In addition, these sequences can also be found 3' to the transcription start site. In both cases, whether the regulatory sequence occurs before (5') or after (3') the gene it regulates, the sequence is often many kilobases away from the transcription start site. A regulator gene may encode a protein, or it may work at the level of RNA, as in the case of genes encoding microRNAs. An example of a regulator gene is a gene that codes for a repressor protein that inhibits the activity of an operator gene.

Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species and the relationships between specific genes shared by many types of organisms. Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed. The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.

Multiple sequence alignment

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm. The second paper, published in BMC Bioinformatics, presented more technical details.

In population genetics, the Watterson estimator is a method for describing the genetic diversity in a population. It was developed by Margaret Wu and G. A. Watterson in the 1970s. It is estimated by counting the number of polymorphic sites. It is a measure of the "population mutation rate" from the observed nucleotide diversity of a population. , where is the effective population size and is the per-generation mutation rate of the population of interest. The assumptions made are that there is a sample of haploid individuals from the population of interest, that there are infinitely many sites capable of varying, and that . Because the number of segregating sites counted will increase with the number of sequences looked at, the correction factor is used.

Tajima's D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima. Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

A conservative replacement is an amino acid replacement that changes a given amino acid to a different amino acid with similar biochemical properties.

Desmond G. Higgins Professor of Bioinformatics at University College Dublin

Desmond Gerard Higgins is a Professor of Bioinformatics at University College Dublin, widely known for CLUSTAL, a series of computer programs for performing multiple sequence alignment. According to Nature, Higgins papers describing CLUSTAL are among the top ten most highly cited scientific papers of all time.

The Infinite sites model (ISM) is a mathematical model of molecular evolution first proposed by Motoo Kimura in 1969. Like other mutation models, the ISM provides a basis for understanding how mutation develops new alleles in DNA sequences. Using allele frequencies, it allows for the calculation of heterozygosity, or genetic diversity, in a finite population and for the estimation of genetic distances between populations of interest.

DNADynamo is a commercial DNA sequence analysis software package produced by Blue Tractor Software Ltd that runs on Microsoft Windows, Mac OS X and Linux It is used by molecular biologists to analyze DNA and Protein sequences. A free demo is available from the software developers website.

Zinc Finger Protein 800 or ZNF800 is a protein that in humans is encoded by the ZNF800 gene. The specific function of ZNF800 is not yet well understood by the scientific community.

References

  1. Fu, YX (Oct 1995). "Statistical properties of segregating sites". Theoretical Population Biology. 48 (2): 172–97. doi:10.1006/tpbi.1995.1025. PMID   7482370.
  2. "Clustal FAQ #Symbols". Clustal. Retrieved 8 December 2014.