Gene family

Last updated
Phylogenetic tree of the Mup gene family Phylogenetic tree of Mups.jpg
Phylogenetic tree of the Mup gene family

A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on different chromosomes, called the α-globin and β-globin loci. These two gene clusters are thought to have arisen as a result of a precursor gene being duplicated approximately 500 million years ago. [1]


Genes are categorized into families based on shared nucleotide or protein sequences. Phylogenetic techniques can be used as a more rigorous test. The positions of exons within the coding sequence can be used to infer common ancestry. Knowing the sequence of the protein encoded by a gene can allow researchers to apply methods that find similarities among protein sequences that provide more information than similarities or differences among DNA sequences.

If the genes of a gene family encode proteins, the term protein family is often used in an analogous manner to gene family.

The expansion or contraction of gene families along a specific lineage can be due to chance, or can be the result of natural selection. [2] To distinguish between these two cases is often difficult in practice. Recent work uses a combination of statistical models and algorithmic techniques to detect gene families that are under the effect of natural selection. [3]

The HUGO Gene Nomenclature Committee (HGNC) creates nomenclature schemes using a "stem" (or "root") symbol for members of a gene family (by homology or function), with a hierarchical numbering system to distinguish the individual members. [4] [5] For example, for the peroxiredoxin family, PRDX is the root symbol, and the family members are PRDX1 , PRDX2 , PRDX3 , PRDX4 , PRDX5 , and PRDX6 .

Basic structure

Gene phylogeny as lines within grey species phylogeny. Top: An ancestral gene duplication produces two paralogs (histone H1.1 and 1.2). A speciation event produces orthologs in the two daughter species (human and chimpanzee). Bottom: in a separate species (E. coli), an gene has a similar function (histone-like nucleoid-structuring protein) but has a separate evolutionary origin and so is an analog. Ortholog paralog analog examples.svg
Gene phylogeny as lines within grey species phylogeny. Top: An ancestral gene duplication produces two paralogs (histone H1.1 and 1.2). A speciation event produces orthologs in the two daughter species (human and chimpanzee). Bottom: in a separate species (E. coli), an gene has a similar function (histone-like nucleoid-structuring protein) but has a separate evolutionary origin and so is an analog.

One level of genome organization is the grouping of genes into several gene families. [6] [7] Gene families are groups of related genes that share a common ancestor. Members of gene families may be paralogs or orthologs. Gene paralogs are genes with similar sequences from within the same species while gene orthologs are genes with similar sequences in different species. Gene families are highly variable in size, sequence diversity, and arrangement. Depending on the diversity and functions of the genes within the family, families can be classified as multigene families or superfamilies. [6] [8]

Multigene families typically consist of members with similar sequences and functions, though a high degree of divergence (at the sequence and/or functional level) does not lead to the removal of a gene from a gene family. Individual genes in the family may be arranged close together on the same chromosome or dispersed throughout the genome on different chromosomes. Due to the similarity of their sequences and their overlapping functions, individual genes in the family often share regulatory control elements. [6] [8] In some instances, gene members have identical (or nearly identical) sequences. Such families allow for massive amounts of gene product to be expressed in a short time as needed. Other families allow for similar but specific products to be expressed in different cell types or at different stages of an organisms development. [6]

Superfamilies are much larger than single multigene families. Superfamilies contain up to hundreds of genes, including multiple multigene families as well as single, individual gene members. The large number of members allows superfamilies to be widely dispersed with some genes clustered and some spread far apart. The genes are diverse in sequence and function displaying various levels of expression and separate regulation controls. [6] [8]

Some gene families also contain pseudogenes, sequences of DNA that closely resemble established gene sequences but are non-functional. [9] Different types of pseudogenes exist. Non-processed pseudogenes are genes that acquired mutations over time becoming non-functional. Processed pseudogenes are genes that have lost their function after being moved around the genome by retrotransposition. [8] [9] Pseudogenes that have become isolated from the gene family they originated in, are referred to as orphans. [6]


Gene families arose from multiple duplications of an ancestral gene, followed by mutation and divergence. [6] Duplications can occur within a lineage (e.g., humans might have two copies of a gene that is found only once in chimpanzees) or they are the result of speciation. For example, a single gene in the ancestor of humans and chimpanzees now occurs in both species and can be thought of as having been 'duplicated' via speciation. As a result of duplication by speciation, a gene family might include 15 genes, one copy in each of 15 different species.


In the formation of gene families, four levels of duplication exist: 1) exon duplication and shuffling, 2) entire gene duplication, 3) multigene family duplication, and 4) whole genome duplication. Exon duplication and shuffling gives rise to variation and new genes. Genes are then duplicated to form multigene families which duplicate to form superfamilies spanning multiple chromosomes. Whole genome duplication doubles the number of copies of every gene and gene family. [6] Whole genome duplication or polyploidization can be either autopolyploidization or alloploidization. Autopolyploidization is the duplication of the same genome and allopolyploidization is the duplication of two closely related genomes or hybridized genomes from different species. [8]

Duplication occurs primarily through uneven crossing over events in meiosis of germ cells. (1,2) When two chromosomes misalign, crossing over - the exchange of gene alleles - results in one chromosome expanding or increasing in gene number and the other contracting or decreasing in gene number. The expansion of a gene cluster is the duplication of genes that leads to larger gene families. [6] [8]


Gene members of a multigene family or multigene families within superfamilies exist on different chromosomes due to relocation of those genes after duplication of the ancestral gene. Transposable elements play a role in the movement of genes. Transposable elements are recognized by inverted repeats at their 5' and 3' ends. When two transposable elements are close enough in the same region on a chromosome, they can form a composite transposon. The protein transposase recognizes the outermost inverted repeats, cutting the DNA segment. Any genes between the two transposable elements are relocated as the composite transposon jumps to a new area of the genome. [6]

Reverse transcription is another method of gene movement. An mRNA transcript of a gene is reversed transcribed, or copied, back into DNA. This new DNA copy of the mRNA is integrated into another part of the genome, resulting in gene family members being dispersed. [8]

A special type of multigene family is implicated in the movement of gene families and gene family members. LINE (Long INterspersed Elements) and SINE (Short INterspersed Elements) families are highly repetitive DNA sequences spread all throughout the genome. The LINEs contain a sequence that encodes a reverse transcriptase protein. This protein aids in copying the RNA transcripts of LINEs and SINEs back into DNA, and integrates them into different areas of the genome. This self-perpetuates the growth of LINE and SINE families. Due to the highly repetitive nature of these elements, LINEs and SINEs when close together also trigger unequal crossing over events which result in single-gene duplications and the formation of gene families. [6] [8]


Non-synonymous mutations resulting in the substitution of amino acids, increase in duplicate gene copies. Duplication gives rise to multiple copies of the same gene, giving a level of redundancy where mutations are tolerated. With one functioning copy of the gene, other copies are able to acquire mutations without being extremely detrimental to the organisms. Mutations allow duplicate genes to acquire new or different functions. [8]

Concerted evolution

Some multigene families are extremely homogenous, with individual genes members sharing identical or almost identical sequences. The process by which gene families maintain high homogeneity is Concerted evolution. Concerted evolution occurs through repeated cycles of unequal crossing over events and repeated cycles of gene transfer and conversion. Unequal crossing over leads to the expansion and contraction of gene families. Gene families have an optimal size range that natural selection acts towards. Contraction deletes divergent gene copies and keeps gene families from becoming too large. Expansion replaces lost gene copies and prevents gene families from becoming too small. Repeat cycles of gene transfer and conversion increasingly make gene family members more just similar. [6]

In the process of gene transfer, allelic gene conversion is biased. Mutant alleles spreading in a gene family towards homogeneity is the same process of an advantageous allele spreading in a population towards fixation. Gene conversion also aids in creating genetic variation in some cases. [10]


Gene families, part of a hierarchy of information storage in a genome, play a large role in the evolution and diversity of multicellular organisms. Gene families are large units of information and genetic variability. [6] Over evolutionary time, gene families have expanded and contracted with new gene families being formed and some gene families being lost. In several evolutionary lineages, genes are gained and lost at relatively same rates. Adaptive expansion of gene families occurs when natural selection would favour additional gene copies. This is the case when an environmental stressor acts on a species. Gene amplification is more common in bacteria and is a reversible process. Adaptive contraction of gene families commonly results from accumulation of loss of function mutations. A nonsense mutation which prematurely halts gene transcription becomes fixed in the population, leading to the loss of genes. This process occurs when changes in the environment render a gene redundant. [7]

New gene families originate from orphan genes (isolated pseudogenes). These isolated genes occur by different mean. A gene duplicate accumulates enough mutations to be sufficiently divergent to no longer be recognized as part of the original gene family, horizontal transfer of new genes into a genome, or a new gene originate de novo from non-coding sequences. These orphan genes would then go through the processes of duplication, relocation and divergence to form a family. Gene family death occurs when the loss of a gene leads to the loss of the entire gene family. The continuous loss of genes eventually leads to the extinction of the gene family. Gene loss may be the deletion of genes or the complete loss of function, becoming pseudogenes. [7]

Functional family

In addition to classification by evolution (structural gene family), the HGNC also makes "gene families" by function in their stem nomenclature. [11] As a result, a stem can also refer to genes that have the same function, often part of the same protein complex. For example, BRCA1 and BRCA2 are unrelated genes that are both named for their role in breast cancer and RPS2 and RPS3 are unrelated ribosomal proteins found in the same small subunit.

The HGNC also maintains a "gene group" (formerly "gene family") classification. A gene can be a member of multiple groups, and all groups form a hierarchy. As with the stem classification, both structural and functional groups exist. [4] [5]

See also

Related Research Articles

Genome All genetic material of an organism

In the fields of molecular biology and genetics, a genome is all genetic information of an organism. It consists of nucleotide sequences of DNA. The genome includes both the genes and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics. The genome for several organisms have been sequenced and genes analyzed, the human genome project which sequenced the entire genome for Homo sapiens was successfully completed in April 2003.

Mutation Alteration in the nucleotide sequence of a genome

In biology, a mutation is an alteration in the nucleotide sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitosis, or meiosis or other types of damage to DNA, which then may undergo error-prone repair, cause an error during other forms of repair, or cause an error during replication. Mutations may also result from insertion or deletion of segments of DNA due to mobile genetic elements.

Transposable element Semiparasitic DNA sequence

A transposable element is a DNA sequence that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transposition often results in duplication of the same genetic material. Barbara McClintock's discovery of them earned her a Nobel Prize in 1983.

Non-coding DNA sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules. Other functions of non-coding DNA include the transcriptional and translational regulation of protein-coding sequences, scaffold attachment regions, origins of DNA replication, centromeres and telomeres. Its RNA counterpart is non-coding RNA.

Molecular evolution Process of change in the sequence composition of cellular molecules across generations

Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics to explain patterns in these changes. Major topics in molecular evolution concern the rates and impacts of single nucleotide changes, neutral evolution vs. natural selection, origins of new genes, the genetic nature of complex traits, the genetic basis of speciation, evolution of development, and ways that evolutionary forces influence genomic and phenotypic changes.

Pseudogene Functionless relative of a gene

Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are usually identified when genome sequence analysis finds gene-like sequences that lack regulatory sequences needed for transcription or translation, or whose coding sequences are obviously defective due to frameshifts or premature stop codons.

Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. Gene duplications can arise as products of several types of errors in DNA replication and repair machinery as well as through fortuitous capture by selfish genetic elements. Common sources of gene duplications include ectopic recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage.


Retrotransposons are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through the process reverse transcription using an RNA transposition intermediate.

Gene conversion is the process by which one DNA sequence replaces a homologous sequence such that the sequences become identical after the conversion event. Gene conversion can be either allelic, meaning that one allele of the same gene replaces another allele, or ectopic, meaning that one paralogous DNA sequence converts another.

Gene Sequence of DNA or RNA that codes for an RNA or protein product

In biology, a gene is a basic unit of heredity and a sequence of nucleotides in DNA or RNA that encodes the synthesis of a gene product, either RNA or protein.

Exon shuffling is a molecular mechanism for the formation of new genes. It is a process through which two or more exons from different genes can be brought together ectopically, or the same exon can be duplicated, to create a new exon-intron structure. There are different mechanisms through which exon shuffling occurs: transposon mediated exon shuffling, crossover during sexual recombination of parental genomes and illegitimate recombination.

NUMT, pronounced "new might," is an acronym for "nuclear mitochondrial DNA" segment coined by evolutionary geneticist, Jose V. Lopez, which describes a transposition of any type of cytoplasmic mitochondrial DNA into the nuclear genome of eukaryotic organisms.

Gene redundancy

Gene redundancy is the existence of multiple genes in the genome of an organism that perform the same function. Gene redundancy can result from gene duplication. Such duplication events are responsible for many sets of paralogous genes. When an individual gene in such a set is disrupted by mutation or targeted knockout, there can be little effect on phenotype as a result of gene redundancy, whereas the effect is large for the knockout of a gene with only one copy. Gene knockout is a method utilized in some studies aiming to characterize the maintenance and fitness effects functional overlap.

Helitrons are one of the three groups of eukaryotic class 2 transposable elements (TEs) so far described. They are the eukaryotic rolling-circle transposable elements which are hypothesized to transpose by a rolling circle replication mechanism via a single-stranded DNA intermediate. They were first discovered in plants and in the nematode Caenorhabditis elegans, and now they have been identified in a diverse range of species, from protists to mammals. Helitrons make up a substantial fraction of many genomes where non-autonomous elements frequently outnumber the putative autonomous partner. Helitrons seem to have a major role in the evolution of host genomes. They frequently capture diverse host genes, some of which can evolve into novel host genes or become essential for Helitron transposition.

A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production.

Genome evolution Process by which a genome changes in structure or size over time

Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.

Unequal crossing over Chromosomal crossover resulting in gene duplication or deletion

Unequal crossing over is a type of gene duplication or deletion event that deletes a sequence in one strand and replaces it with a duplication from its sister chromatid in mitosis or from its homologous chromosome during meiosis. It is a type of chromosomal crossover between homologous sequences that are not paired precisely. Normally genes are responsible for occurrence of crossing over. It exchanges sequences of different links between chromosomes. Along with gene conversion, it is believed to be the main driver for the generation of gene duplications and is a source of mutation in the genome.

Ac/Ds transposable controlling elements was the first transposable element system recognized in maize. The Ac Activator element is autonomous, whereas the Ds Dissociation element requires an Activator element to transpose. Ac was initially discovered as enabling a Ds element to break chromosomes. Both Ac and Ds can also insert into genes, causing mutants that may revert to normal on excision of the element. The phenotypic consequence of Ac/Ds transposable element includes mosaic colors in kernels and leaves in maize.

Short interspersed nuclear element

Short interspersed nuclear elements (SINEs) are non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates. SINEs compose about 13% of the mammalian genome.

DNA transposons are DNA sequences, sometimes referred to "jumping genes", that can move and integrate to different locations within the genome. They are class II transposable elements (TEs) that move through a DNA intermediate, as opposed to class I TEs, retrotransposons, that move through an RNA intermediate. DNA transposons can move in the DNA of an organism via a single-or double-stranded DNA intermediate. DNA transposons have been found in both prokaryotic and eukaryotic organisms. They can make up a significant portion of an organism's genome, particularly in eukaryotes. In prokaryotes, TE's can facilitate the horizontal transfer of antibiotic resistance or other genes associated with virulence. After replicating and propagating in a host, all transposon copies become inactivated and are lost unless the transposon passes to a genome by starting a new life cycle with horizontal transfer. It is important to note that DNA transposons do not randomly insert themselves into the genome, but rather show preference for specific sites.


  1. Nussbaum, Robert L.; McInnes, Roderick R.; Willard, ikksiiskHuntington F. (2016). Thompson & Thompson Genetics in Medicine (8th ed.). Philadelphia, PA: Elsevier. p. 25. ISBN   978-1-4377-0696-3.
  2. Hartl, D.L. and Clark A.G. 2007. Principles of population genetics. Chapter 7, page 372.
  3. Demuth, Jeffery P.; Bie, Tijl De; Stajich, Jason E.; Cristianini, Nello; Hahn, Matthew W.; Borevitz, Justin (20 December 2006). "The Evolution of Mammalian Gene Families". PLOS ONE. 1 (1): e85. Bibcode:2006PLoSO...1...85D. doi: 10.1371/journal.pone.0000085 . PMC   1762380 . PMID   17183716.
  4. 1 2 Daugherty, LC; Seal, RL; Wright, MW; Bruford, EA (Jul 5, 2012). "Gene family matters: expanding the HGNC resource". Human Genomics. 6 (1): 4. doi:10.1186/1479-7364-6-4. PMC   3437568 . PMID   23245209.
  5. 1 2 HGNC. "Gene group help" . Retrieved 2020-10-13.
  6. 1 2 3 4 5 6 7 8 9 10 11 12 13 Hartwell, Leland H.; et al. (2011). Genetics : from genes to genomes (4th ed.). New York: McGraw-Hill. ISBN   978-0073525266.
  7. 1 2 3 Demuth, JP; Hahn, MW (January 2009). "The life and death of gene families". BioEssays. 31 (1): 29–39. doi:10.1002/bies.080085. PMID   19153999. S2CID   9528185.
  8. 1 2 3 4 5 6 7 8 9 Ohta, Tomoka (2008). "Gene families: multigene families and superfamilies". eLS. doi:10.1038/npg.els.0005126. ISBN   978-0470015902.
  9. 1 2 Nussbaum, Robert L; et al. (2015). Genetics in Medicine (8 ed.). Philadelphia: Elsevier. ISBN   9781437706963.
  10. Ohta, T (30 September 2010). "Gene conversion and evolution of gene families: an overview". Genes. 1 (3): 349–56. doi: 10.3390/genes1030349 . PMC   3966226 . PMID   24710091.
  11. "What is a stem symbol?". HGNC FAQ.