Forensic statistics

Last updated

Forensic statistics is the application of probability models and statistical techniques to scientific evidence, such as DNA evidence, [1] and the law. In contrast to "everyday" statistics, to not engender bias or unduly draw conclusions, forensic statisticians report likelihoods as likelihood ratios (LR). This ratio of probabilities is then used by juries or judges to draw inferences or conclusions and decide legal matters. [1] Jurors and judges rely on the strength of a DNA match, given by statistics, to make conclusions and determine guilt or innocence in legal matters. [2]

Contents

In forensic science, the DNA evidence received for DNA profiling often contains a mixture of more than one person's DNA. DNA profiles are generated using a set procedure, however, the interpretation of a DNA profile becomes more complicated when the sample contains a mixture of DNA. Regardless of the number of contributors to the forensic sample, statistics and probabilities must be used to provide weight to the evidence and to describe what the results of the DNA evidence mean. In a single-source DNA profile, the statistic used is termed a random match probability (RMP). RMPs can also be used in certain situations to describe the results of the interpretation of a DNA mixture. [3] Other statistical tools to describe DNA mixture profiles include likelihood ratios (LR) and combined probability of inclusion (CPI), also known as random man not excluded (RMNE). [4]

Computer programs have been implemented with forensic DNA statistics for assessing the biological relationships between two or more people. Forensic science uses several approaches for DNA statistics with computer programs such as; match probability, exclusion probability, likelihood ratios, Bayesian approaches, and paternity and kinship testing. [5]

Although the precise origin of this term remains unclear, it is apparent that the term was used in the 1980s and 1990s. [6] Among the first forensic statistics conferences were two held in 1991 and 1993. [7]

Random match probability

Random match possibilities (RMP) are used to estimate and express the rarity of a DNA profile. RMP can be defined as the probability that someone else in the population, chosen at random, would have the same genotype as the genotype of the contributor of the forensic evidence. RMP is calculated using the genotype frequencies at all the loci, or how common or rare the alleles of a genotype are. The genotype frequencies are multiplied across all loci, using the product rule, to calculate the RMP. This statistic gives weight to the evidence either for or against a particular suspect being a contributor to the DNA mixture sample. [4]

RMP can only be used as a statistic to describe the DNA profile if it is from a single source or if the analyst is able to differentiate between the peaks on the electropherogram from the major and minor contributors of a mixture. [3] Since the interpretation of DNA mixtures with more than two contributors is very difficult for analysts to do without computer software, RMP becomes difficult to calculate with a mixture of more than two people. [4] If the major and minor contributor peaks can not be differentiated, there are other statistical methods that may be used.

If the DNA mixture contains a ratio of 4:1 of major to minor contributors, a modified random match probability (mRMP) may be able to be used as a statistical tool. For calculation of mRMP, the analyst must first deduce a major and minor contributor and their genotypes based on the peak heights given in the electropherogram. Computer software is often used in labs conducting DNA analysis in order to more accurately calculate the mRMP, since calculations for each of the most probable genotypes at each locus become tedious and inefficient for the analyst to do by hand. [2]

Likelihood ratio

Sometimes it can be very difficult to determine the number of contributors in a DNA mixture. If the peaks are easily distinguished and the number of contributors is able to be determined, a likelihood ratio (LR) is used. LRs consider probabilities of events happening and rely on alternative pairs of hypotheses against which the evidence is assessed. [8] These alternative pairs of hypotheses in forensic cases are the prosecutor's hypothesis and the defense hypothesis. In forensic biology cases, the hypotheses often state that the DNA came from a particular person or the DNA came from an unknown person. [2] For example, the prosecution may hypothesize the DNA sample contains DNA from the victim and the suspect, while the defense may hypothesize that the sample contains DNA from the victim and an unknown person. The probabilities of the hypotheses are expressed as a ratio, with the prosecutor's hypothesis being in the numerator. [3] The ratio then expresses the likelihood of both of the events in relation to each other. For the hypotheses where the mixture contains the suspect, the probability is 1, because one can distinguish the peaks and easily tell if the suspect can be excluded as a contributor at each locus based on his/her genotype. The probability of 1 assumes the suspect can not be excluded as a contributor. To determine the probabilities of the unknowns, all genotype possibilities must be determined for that locus. [3]

Once the calculation of the likelihood ratio is made, the number calculated is turned into a statement to provide meaning to the statistic. For the previous example, if the LR calculated is x, then the LR means that the probability of the evidence is x times more likely if the sample contains the victim and the suspect than if it contains the victim and an unknown person. [8] Likelihood ratio can also be defined as 1/RMP. [3]

Combined Probability of Inclusion

Combined probability of inclusion (CPI) is a common statistic used when the analyst can not differentiate between the peaks from a major and minor contributor to a sample and the number of contributors can not be determined. [3] CPI is also commonly known as random man not excluded (RMNE). [3] This statistical calculation is done by adding all the frequencies of observed alleles and then squaring the value, which yields the value for probability of inclusion (PI). These values are then multiplied across all loci, resulting in the value for CPI. [2] The value is squared so that all the possible combinations of genotypes are included in the calculation. [4]

Once the calculation is done, a statement is made about the meaning of this calculation and what it means. For example, if the CPI calculated is 0.5, this means that the probability of someone chosen at random in the population not being excluded as a contributor to the DNA mixture is 0.5.

CPI relates to the evidence (the DNA mixture) and it is not dependent on the profile of any suspect. Therefore, CPI is a statistical tool that can be used to provide weight or strength to evidence when no other information about the crime is known. [3] This is advantageous in situations where the genotypes in the DNA mixture can not be distinguished from one another. However, this statistic is not very discriminating and is not as powerful of a tool as likelihood ratios and random match probabilities can be when some information about the DNA mixture, such as the number of contributors or the genotypes of each contributor, can be distinguished. Another limitation to CPI is that it is not usable as a tool for the interpretation of a DNA mixture. [4]

Blood stains

Blood stains are an important part of forensic statistics, as the analysis of blood drop collisions may help to picture the event that had previously gone on. Commonly blood stains are an elliptical shape, because of this blood stains are usually easy to determine the blood droplets angle through the formula “α = arcsin d/a”. In this formula 'a' and 'd' are simply estimations of the axis of the ellipse. From these calculations, a visualization of the event causing the stains is able to be drawn, and alongside further information such as the velocity of the entity that caused such stains. [9]

Bibliography

Related Research Articles

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.

<span class="mw-page-title-main">DNA profiling</span> Technique used to identify individuals via DNA characteristics

DNA profiling is the process of determining an individual's deoxyribonucleic acid (DNA) characteristics. DNA analysis intended to identify a species, rather than an individual, is called DNA barcoding.

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In scientific research, the null hypothesis is the claim that no relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is due to chance alone, and an underlying causative relationship does not exist, hence the term "null." In addition to the null hypothesis, an alternative hypothesis is also developed, which claims that a relationship does exist between two variables.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

In evidence-based medicine, likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954. In medicine, likelihood ratios were introduced between 1975 and 1980.

Forensic identification is the application of forensic science, or "forensics", and technology to identify specific objects from the trace evidence they leave, often at a crime scene or the scene of an accident. Forensic means "for the courts".

A Y-STR is a short tandem repeat (STR) on the Y-chromosome. Y-STRs are often used in forensics, paternity, and genealogical DNA testing. Y-STRs are taken specifically from the male Y chromosome. These Y-STRs provide a weaker analysis than autosomal STRs because the Y chromosome is only found in males, which are only passed down by the father, making the Y chromosome in any paternal line practically identical. This causes a significantly smaller amount of distinction between Y-STR samples. Autosomal STRs provide a much stronger analytical power because of the random matching that occurs between pairs of chromosomes during the zygote-making process.

Statistics models the collection, organization, analysis, interpretation, and presentation of data, used to solve practical problems. Conclusions drawn from statistical analysis typically involve uncertainty, as they represent the probability of an event occurring. Statistics is fundamental to disciplines of science that involve predicting or classifying events based on a large set of data and is an integral part of fields such as machine learning, bioinformatics, genomics, and economics.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist-inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

Statistical proof is the rational demonstration of degree of certainty for a proposition, hypothesis or theory that is used to convince others subsequent to a statistical test of the supporting evidence and the types of inferences that can be drawn from the test scores. Statistical methods are used to increase the understanding of the facts and the proof demonstrates the validity and logic of inference with explicit reference to a hypothesis, the experimental data, the facts, the test, and the odds. Proof has two essential aims: the first is to convince and the second is to explain the proposition through peer and public review.

In paternity testing, Paternity Index (PI) is a calculated value generated for a single genetic marker or locus and is associated with the statistical strength or weight of that locus in favor of or against parentage given the phenotypes of the tested participants and the inheritance scenario. Phenotype typically refers to physical characteristics such as body plan, color, behavior, etc. in organisms. However, the term used in the area of DNA paternity testing refers to what is observed directly in the laboratory. Laboratories involved in parentage testing and other fields of human identity employ genetic testing panels that contain a battery of loci each of which is selected due to extensive allelic variations within and between populations. These genetic variations are not assumed to bestow physical and/or behavioral attributes to the person carrying the allelic arrangement(s) and therefore are not subject to selective pressure and follow Hardy Weinberg inheritance patterns.

This is a software system for forensic comparison of handwriting. It was developed at CEDAR, the Center of Excellence for Document Analysis and Recognition at the University at Buffalo. CEDAR-FOX has capabilities for interaction with the questioned document examiner to go through processing steps such as extracting regions of interest from a scanned document, determining lines and words of text, recognize textual elements. The final goal is to compare two samples of writing to determine the log-likelihood ratio under the prosecution and defense hypotheses. It can also be used to compare signature samples. The software, which is protected by a United States Patent can be licensed from Cedartech, Inc.

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

Pre-test probability and post-test probability are the probabilities of the presence of a condition before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.

In statistics, robust Bayesian analysis, also called Bayesian sensitivity analysis, is a type of sensitivity analysis applied to the outcome from Bayesian inference or Bayesian optimal decisions.

<span class="mw-page-title-main">Forensic DNA analysis</span>

DNA profiling is the determination of a DNA profile for legal and investigative purposes. DNA analysis methods have changed countless times over the years as technology changes and allows for more information to be determined with less starting material. Modern DNA analysis is based on the statistical calculation of the rarity of the produced profile within a population.

References

  1. 1 2 Gill, Richard. "Forensic Statistics: Ready for Consumption?" (PDF). Mathematical Institute, Leiden University.
  2. 1 2 3 4 Perlin, Mark (2015). "Inclusion probability for DNA mixtures is a subjective one-sided match statistic unrelated to identification information". Journal of Pathology Informatics. 6 (59): 59. doi: 10.4103/2153-3539.168525 . PMC   4639950 . PMID   26605124.
  3. 1 2 3 4 5 6 7 8 Butler, John (2005). Forensic DNA Typing (2nd ed.). Elsevier Academic Press. pp. 445–529.
  4. 1 2 3 4 5 Butler, John (2015). Advanced Topics in Forensic DNA Typing: Interpretation. San Diego, CA: Elsevier Inc. pp. 213–333.
  5. Fung, Wing Kam (2006). "On Statistical Analysis Of Forensic DNA: Theory, Methods And Computer Programs". Forensic Science International. 162 (1–3): 17–23. doi:10.1016/j.forsciint.2006.06.025. PMID   16870375.
  6. Valentin, J (1980). "Exclusions and attributions of paternity: practical experiences of forensic genetics and statistics". Am J Hum Genet. 32 (3): 420–31. PMC   1686081 . PMID   6930157.
  7. Aitken C. G. G., Taroni F. (2004) Statistics and the Evaluation of Evidence for Forensic Scientists, John Wiley and Sons.
  8. 1 2 "What is a likelihood ratio?" (PDF). International Society of Forensic Genetics. Forensic Science Service Ltd. 2006. Retrieved 6 November 2018.
  9. Camana, Francesco (2013). "Determining The Area Of Convergence In Bloodstain Pattern Analysis: A Probabilistic Approach". Forensic Science International. 231 (1–3): 131–136. arXiv: 1210.6106 . doi:10.1016/j.forsciint.2013.04.019. PMID   23890627. S2CID   18601439.