Yaniv Erlich

Last updated
Yaniv Erlich
יניב ארליך
Alma mater Watson School of Biological Sciences
Scientific career
Fields Genomics, Bioinformatics, Genetic Privacy, Crowdsourcing,
Institutions Columbia University
Doctoral advisor Greg Hannon
Website https://teamerlich.org/

Yaniv Erlich is an Israeli-American scientist. He formerly served as an Associate Professor of Computer Science at Columbia University and was the Chief Science Officer of MyHeritage. [1] Erlich's work combines computer science and genomics.

Contents

Biography

Erlich was born in Israel. He earned BSc in Brain Sciences in 2006 from Tel Aviv University and a PhD in bioinformatics in 2010 from Watson School of Biological Sciences at Cold Spring Harbor Laboratory. From 2010 to 2015, Erlich was a Fellow at the Whitehead Institute, MIT. From 2015 to 2019, he led a lab at Columbia University in computational genomics. [2] From 2020 to present, he has served as CEO of Eleven Therapeutics [3]

Scientific work

Crowd sourcing genomic information

Erlich's team published a study in the journal Science that reported crowd-sourcing of tens of millions of genealogical records from the website Geni.com. [4] The team was able to create a single family tree of 13 million people that are all connected and spans tens of generations and over 600 years of history. [5] The study used the data to analyze the genetics of longevity and familial dispersion [6]

In a different line of studies, Erlich and Joe Pickrell put together a website called DNA.Land to crowd source genomic datasets of participants of consumer genomics. [7] The website collected over 130,000 datasets by November 2018.

Genetic Privacy

The Erlich group published several studies on the subject of genetic privacy. In 2013, they reported the possibility of recovering the surname of a male from his allegedly anonymous genomic dataset, which can lead to tracing his full identity. [8] The technique exploits the co-inheritance of surnames and Y-chromosomes in most societies. Thus, by comparing the Y-chromosome of the person of interest to genetic genealogy databases of Y-chromosomes, it is possible in some cases to infer the surname. The team estimated that 12% of males in the US are subject to successful surname recovery. The team also demonstrated that after recovering the surname, basic demographic identifiers such as age and state of residency can permit tracing back the identity of the individual. To demonstrate the power of technique, they recover the identity of multiple 1000 Genomes by surname inference.

In 2014, Erlich and Arvind Narayanan published a survey of hacking techniques to genomic datasets. [9] They predicted that autosomal searches in GEDmatch can be used to trace back the identity of anonymous people once the GEDmatch user base will reach a certain size, which indeed happened in 2018, where the website used to capture the Golden State Killer.

In 2018, the Erlich team published a study in Science that reported that about 60% of US individuals of European descent have at least a 3rd cousin match in GEDmatch, which can theoretically permit their identification. [10] In two to three years, virtually any person in this ethnic group can be theoretically traced using this technique, if the current rate of growth in GEDmatch will continue. [11] The team suggested a cryptographic signature technique to reduce the chance of misusing direct to consumer websites by police searches.

Related Research Articles

<span class="mw-page-title-main">Genome</span> All genetic material of an organism

In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA. The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as regulatory sequences, and often a substantial fraction of junk DNA with no evident function. Almost all eukaryotes have mitochondria and a small mitochondrial genome. Algae and plants also contain chloroplasts with a chloroplast genome.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">Molecular genetics</span> Scientific study of genes at the molecular level

Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens. 

<span class="mw-page-title-main">Yeast artificial chromosome</span> Genetically engineered chromosome derived from the DNA of yeast

Yeast artificial chromosomes (YACs) are genetically engineered chromosomes derived from the DNA of the yeast, Saccharomyces cerevisiae, which is then ligated into a bacterial plasmid. By inserting large fragments of DNA, from 100–1000 kb, the inserted sequences can be cloned and physically mapped using a process called chromosome walking. This is the process that was initially used for the Human Genome Project, however due to stability issues, YACs were abandoned for the use of bacterial artificial chromosome

<span class="mw-page-title-main">Haplotype</span> Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA), of a set of organisms is the most recent individual from which all the organisms of the set are descended. The term is also used in reference to the ancestry of groups of genes (haplotypes) rather than organisms.

Genetic genealogy is the use of genealogical DNA tests, i.e., DNA profiling and DNA testing, in combination with traditional genealogical methods, to infer genetic relationships between individuals. This application of genetics came to be used by family historians in the 21st century, as DNA tests became affordable. The tests have been promoted by amateur groups, such as surname study groups or regional genealogical groups, as well as research projects such as the Genographic Project.

A genealogical DNA test is a DNA-based genetic test used in genetic genealogy that looks at specific locations of a person's genome in order to find or verify ancestral genealogical relationships, or to estimate the ethnic mixture of an individual. Since different testing companies use different ethnic reference groups and different matching algorithms, ethnicity estimates for an individual vary between tests, sometimes dramatically.

<span class="mw-page-title-main">DNAPrint Genomics</span>

DNAPrint Genomics was a genetics company with a wide range of products related to genetic profiling. They were the first company to introduce forensic and consumer genomics products, which were developed immediately upon the publication of the first complete draft of the human genome in the early 2000s. They researched, developed, and marketed the first ever consumer genomics product, based on "Ancestry Informative Markers" which they used to correctly identify the BioGeographical Ancestry (BGA) of a human based on a sample of their DNA. They also researched, developed and marketed the first ever forensic genomics product - DNAWITNESS - which was used to create a physical profile of donors of crime scene DNA. The company reached a peak of roughly $3M/year revenues but ceased operations in February 2009.

Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual. The genotyping stage employs different techniques, including single-nucleotide polymorphism (SNP) analysis chips, or partial or full genome sequencing. Once the genotypes are known, the individual's variations can be compared with the published literature to determine likelihood of trait expression, ancestry inference and disease risk.

<span class="mw-page-title-main">Whole genome sequencing</span> Determining nearly the entirety of the DNA sequence of an organisms genome at a single time

Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.

Microbial phylogenetics is the study of the manner in which various groups of microorganisms are genetically related. This helps to trace their evolution. To study these relationships biologists rely on comparative genomics, as physiology and comparative anatomy are not possible methods.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person the data belong to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Genetic privacy involves the concept of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to one's genetic information. This concept also encompasses privacy regarding the ability to identify specific individuals by their genetic sequence, and the potential to gain information on specific characteristics about that person via portions of their genetic information, such as their propensity for specific diseases or their immediate or distant ancestry.

<span class="mw-page-title-main">Investigative genetic genealogy</span> Application of genealogy in a legal setting

Investigative genetic genealogy, also known as forensic genetic genealogy, is the emerging practice of utilizing genetic information from direct-to-consumer companies for identifying suspects or victims in criminal cases. As of December 2023, the use of this technology has solved a total of 651 criminal cases, including 318 individual perpetrators who were brought to light. There have also been 464 decedents identified, as well as 4 living does. The investigative power of genetic genealogy revolves around the use of publicly accessible genealogy databases such as GEDMatch and FamilyTreeDNA. On GEDMatch, users are able to upload their genetic data from any direct-to-consumer company in an effort to identify relatives that have tested at companies other than their own.

<span class="mw-page-title-main">Single cell epigenomics</span> Study of epigenomics in individual cells by single cell sequencing

Single cell epigenomics is the study of epigenomics in individual cells by single cell sequencing. Since 2013, methods have been created including whole-genome single-cell bisulfite sequencing to measure DNA methylation, whole-genome ChIP-sequencing to measure histone modifications, whole-genome ATAC-seq to measure chromatin accessibility and chromosome conformation capture.

<span class="mw-page-title-main">GEDmatch</span> Genetic genealogy website

GEDmatch is an online service to compare autosomal DNA data files from different testing companies. It is owned by Qiagen.

References

  1. "Erlich lab's website".
  2. "TEDxDanubia speakers".
  3. "About Us". Eleven Therapeutics. Retrieved 2023-04-16.
  4. Kaplanis, Joanna; Gordon, Assaf; Shor, Tal; Weissbrod, Omer; Geiger, Dan; Wahl, Mary; Gershovits, Michael; Markus, Barak; Sheikh, Mona; Gymrek, Melissa; Bhatia, Gaurav; MacArthur, Daniel G.; Price, Alkes L.; Erlich, Yaniv (2018). "Quantitative analysis of population-scale family trees with millions of relatives". Science. 360 (6385): 171–175. Bibcode:2018Sci...360..171K. doi:10.1126/science.aam9309. PMC   6593158 . PMID   29496957.
  5. "Crowdsourcing 600 Years of Human History". 13 March 2018.
  6. Hotz, Robert Lee (March 2018). "WSJ". Wall Street Journal.
  7. "DNA.Land is a framework to collect genomes and phenomes in the era of abundant genetic information".
  8. Gymrek, Melissa; McGuire, Amy L.; Golan, David; Halperin, Eran; Erlich, Yaniv (2013). "Identifying Personal Genomes by Surname Inference". Science. 339 (6117): 321–324. Bibcode:2013Sci...339..321G. doi:10.1126/science.1229566. PMID   23329047. S2CID   3473659.
  9. Erlich, Yaniv; Narayanan, Arvind (2014). "Routes for breaching and protecting genetic privacy". Nature Reviews Genetics. 15 (6): 409–421. doi:10.1038/nrg3723. PMC   4151119 . PMID   24805122.
  10. Erlich, Yaniv; Shor, Tal; Pe'Er, Itsik; Carmi, Shai (2018). "Identity inference of genomic data using long-range familial searches". Science. 362 (6415): 690–694. Bibcode:2018Sci...362..690E. doi:10.1126/science.aau4832. PMC   7549546 . PMID   30309907.
  11. Murphy, Heather (11 October 2018). "Most White Americans' DNA Can Be Identified Through Genealogy Databases". The New York Times.