Data analysis for fraud detection

Last updated July 27, 2024

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.^[1]

In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. For example, the currently prevailing approach employed by many law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers.^[2] As a result, a large number of fraud cases remain undetected and unprosecuted. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the sounds like function, regression analysis, clustering analysis, and gap analysis.^[3] Techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence.^[4]

Statistical techniques

Examples of statistical data analysis techniques are:

Data preprocessing techniques for detection, validation, error correction, and filling up of missing or incorrect data.
Calculation of various statistical parameters such as averages, quantiles, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment.
Models and probability distributions of various business activities either in terms of various parameters or probability distributions.
Computing user profiles.
Time-series analysis of time-dependent data.^[5]
Clustering and classification to find patterns and associations among groups of data.^[5]
Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses.^[3]
Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string.^[3]
Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results.^[3]
Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully.
Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate false alarms, estimate risks, and predict future of current transactions or users.

Some forensic accountants specialize in forensic analytics which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are data collection, data preparation, data analysis, and reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use.

Artificial intelligence

Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include:

Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud.
Expert systems to encode expertise for detecting fraud in the form of rules.
Pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs.
Machine learning techniques to automatically identify characteristics of fraud.
Neural nets to independently generate classification, clustering, generalization, and forecasting that can then be compared against conclusions raised in internal audits or formal financial documents such as 10-Q.^[5]

Other techniques such as link analysis, Bayesian networks, decision theory, and sequence matching are also used for fraud detection.^[4] A new and novel technique called System properties approach has also been employed where ever rank data is available. ^[6]

Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism.

Machine learning and data mining

Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts.^[7]

To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided.^[7] In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output).

If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed.

The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method.^[8]

Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy. As a result, effective collaboration between machine learning model and human analysts is vital to the success of fraud detection applications.^[9]

Supervised learning

In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent' (task can be decomposed on more classes to meet algorithm requirements). Relatively rare events such as fraud may need to be over sampled to get a big enough sample size.^[10] These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent.

Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud.^[11]^[12]

Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud.^[13]

Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented.^[14]

Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection.

Link analysis comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods.^[15]^[16]

This type of detection is only able to detect frauds similar to those which have occurred previously and been classified by a human. To detect a novel type of fraud may require the use of an unsupervised machine learning algorithm.

Unsupervised learning

In contrast, unsupervised methods don't make use of labelled records.

Bolton and Hand use Peer Group Analysis and Break Point Analysis applied on spending behaviour in credit card accounts.^[17] Peer Group Analysis detects individual objects that begin to behave in a way different from objects to which they had previously been similar. Another tool Bolton and Hand develop for behavioural fraud detection is Break Point Analysis.^[17] Unlike Peer Group Analysis, Break Point Analysis operates on the account level. A break point is an observation where anomalous behaviour for a particular account is detected. Both the tools are applied on spending behaviour in credit card accounts.

A combination of unsupervised and supervised methods for credit card fraud detection is in Carcillo et al (2019).^[18]

Geolocation

Online retailers and payment processors use geolocation to detect possible credit card fraud by comparing the user's location to the billing address on the account or the shipping address provided. A mismatch – an order placed from the US on an account number from Tokyo, for example – is a strong indicator of potential fraud. IP address geolocation can be also used in fraud detection to match billing address postal code or area code.^[19] Banks can prevent "phishing" attacks, money laundering and other security breaches by determining the user's location as part of the authentication process. Whois databases can also help verify IP addresses and registrants.^[20]

Government, law enforcement and corporate security teams use geolocation as an investigatory tool, tracking the Internet routes of online attackers to find the perpetrators and prevent future attacks from the same location.

Available datasets

A major limitation for the validation of existing fraud detection methods is the lack of public datasets.^[21] One of the few examples is the Credit Card Fraud Detection dataset^[22] made available by the ULB Machine Learning Group.^[23]

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Unsupervised learning is a method in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Within such an approach, a machine learning model tries to find any similarities, differences, patterns, and structure in data by itself. No prior human intervention is needed.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for exploratory qualitative data analysis and interactive data visualization.

Adaptive resonance theory (ART) is a theory developed by Stephen Grossberg and Gail Carpenter on aspects of how the brain processes information. It describes a number of artificial neural network models which use supervised and unsupervised learning methods, and address problems such as pattern recognition and prediction.

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payment to another account, which is controlled by a criminal. The Payment Card Industry Data Security Standard is the data security standard created to help financial institutions process card payments securely and reduce card fraud.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be distinguished: A direct pattern recognition of sensor readings that indicate a fault and an analysis of the discrepancy between the sensor readings and expected values, derived from some model. In the latter case, it is typical that a fault is said to be detected if the discrepancy or residual goes above a certain threshold. It is then the task of fault isolation to categorize the type of fault and its location in the machinery. Fault detection and isolation (FDI) techniques can be broadly classified into two categories. These include model-based FDI and signal processing based FDI.

In information science, profiling refers to the process of construction and application of user profiles generated by computerized data analysis.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

In network theory, link analysis is a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes (100k), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity, computer security analysis, search engine optimization, market research, medical research, and art.

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

The following outline is provided as an overview of and topical guide to machine learning:

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

References

↑ Chuprina, Roman (13 April 2020). "The In-depth 2020 Guide to E-commerce Fraud Detection". www.datasciencecentral.com. Retrieved 2020-05-24.
↑ Velasco, Rafael B.; Carpanese, Igor; Interian, Ruben; Paulo Neto, Octávio C. G.; Ribeiro, Celso C. (2020-05-28). "A decision support system for fraud detection in public procurement". International Transactions in Operational Research. 28: 27–47. doi: 10.1111/itor.12811 . ISSN 0969-6016.
1 2 3 4 Bolton, R. and Hand, D. (2002). Statistical fraud detection: A review. Statistical Science 17 (3), pp. 235-255
1 2 G. K. Palshikar, The Hidden Truth – Frauds and Their Control: A Critical Application for Business Intelligence, Intelligent Enterprise, vol. 5, no. 9, 28 May 2002, pp. 46–51.
1 2 3 Al-Khatib, Adnan M. (2012). "Electronic Payment Fraud Detection Techniques". World of Computer Science and Information Technology Journal. 2. S2CID 214778396.
↑ Vani, G. K. (February 2018). "How to detect data collection fraud using System properties approach". Multilogic in Science. VII (SPECIAL ISSUE ICAAASTSD-2018). ISSN 2277-7601 . Retrieved February 2, 2019.
1 2 Michalski, R. S., I. Bratko, and M. Kubat (1998). Machine Learning and Data Mining – Methods and Applications. John Wiley & Sons Ltd.
↑ Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): 235–255.
↑ Tax, N. & de Vries, K.J. & de Jong, M. & Dosoula, N. & van den Akker, B. & Smith, J. & Thuong, O. & Bernardi, L. Machine Learning for Fraud Detection in E-Commerce: A Research Agenda. Proceedings of the KDD International Workshop on Deployable Machine Learning for Security Defense (ML hat). Springer, Cham, 2021.
↑ Dal Pozzolo, A. & Caelen, O. & Le Borgne, Y. & Waterschoot, S. & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications 41: 10 4915–4928.
↑ Green, B. & Choi, J. (1997). Assessing the Risk of Management Fraud through Neural Network Technology. Auditing 16(1): 14–28.
↑ Estevez, P., C. Held, and C. Perez (2006). Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Systems with Applications 31, 337–344.
↑ Bhowmik, Rekha Bhowmik. "35 Data Mining Techniques in Fraud Detection". Journal of Digital Forensics, Security and Law. University of Texas at Dallas.
↑ Fawcett, T. (1997). AI Approaches to Fraud Detection and Risk Management: Papers from the 1997 AAAI Workshop. Technical Report WS-97-07. AAAI Press.
↑ Phua, C.; Lee, V.; Smith-Miles, K.; Gayler, R. (2005). "A Comprehensive Survey of Data Mining-based Fraud Detection Research". arXiv: 1009.6119 . doi:10.1016/j.chb.2012.01.002. S2CID 50458504.{{cite journal}}: Cite journal requires |journal= (help)
↑ Cortes, C. & Pregibon, D. (2001). Signature-Based Methods for Data Streams. Data Mining and Knowledge Discovery 5: 167–182.
1 2 Bolton, R. & Hand, D. (2001). Unsupervised Profiling Methods for Fraud Detection. Credit Scoring and Credit Control VII.
↑ Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Kessaci, Yacine; Oblé, Frédéric; Bontempi, Gianluca (16 May 2019). "Combining unsupervised and supervised learning in credit card fraud detection". Information Sciences. 557: 317–331. doi:10.1016/j.ins.2019.05.042. ISSN 0020-0255. S2CID 181839660.
↑ Vacca, John R. (2003). Identity Theft. Prentice Hall Professional. p. 400. ISBN 9780130082756.
↑ Barba, Robert (2017-11-18). "Sharing your location with your bank seems creepy, but it's useful". The Morning Call . Archived from the original on 2018-01-11. Retrieved 2018-01-10.
↑ Le Borgne, Yann-Aël; Bontempi, Gianluca (2021). "Machine Learning for Credit Card Fraud Detection - Practical Handbook" . Retrieved 26 April 2021.
↑ "Credit Card Fraud Detection". kaggle.com.
↑ "ULB Machine Learning Group". mlg.ulb.ac.be.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Chuprina, Roman (13 April 2020). "The In-depth 2020 Guide to E-commerce Fraud Detection". www.datasciencecentral.com. Retrieved 2020-05-24.

[2] Velasco, Rafael B.; Carpanese, Igor; Interian, Ruben; Paulo Neto, Octávio C. G.; Ribeiro, Celso C. (2020-05-28). "A decision support system for fraud detection in public procurement". International Transactions in Operational Research. 28: 27–47. doi: 10.1111/itor.12811 . ISSN 0969-6016.

[English302gmu-3] 1 2 3 4 Bolton, R. and Hand, D. (2002). Statistical fraud detection: A review. Statistical Science 17 (3), pp. 235-255

[palshikar_2002-4] 1 2 G. K. Palshikar, The Hidden Truth – Frauds and Their Control: A Critical Application for Business Intelligence, Intelligent Enterprise, vol. 5, no. 9, 28 May 2002, pp. 46–51.

[:0-5] 1 2 3 Al-Khatib, Adnan M. (2012). "Electronic Payment Fraud Detection Techniques". World of Computer Science and Information Technology Journal. 2. S2CID 214778396.

[6] Vani, G. K. (February 2018). "How to detect data collection fraud using System properties approach". Multilogic in Science. VII (SPECIAL ISSUE ICAAASTSD-2018). ISSN 2277-7601 . Retrieved February 2, 2019.

[michalski_1998-7] 1 2 Michalski, R. S., I. Bratko, and M. Kubat (1998). Machine Learning and Data Mining – Methods and Applications. John Wiley & Sons Ltd.

[bolton_2002-8] Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): 235–255.

[tax_2021-9] Tax, N. & de Vries, K.J. & de Jong, M. & Dosoula, N. & van den Akker, B. & Smith, J. & Thuong, O. & Bernardi, L. Machine Learning for Fraud Detection in E-Commerce: A Research Agenda. Proceedings of the KDD International Workshop on Deployable Machine Learning for Security Defense (ML hat). Springer, Cham, 2021.

[dal2014learned-10] Dal Pozzolo, A. & Caelen, O. & Le Borgne, Y. & Waterschoot, S. & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications 41: 10 4915–4928.

[green_1997-11] Green, B. & Choi, J. (1997). Assessing the Risk of Management Fraud through Neural Network Technology. Auditing 16(1): 14–28.

[estevez_2006-12] Estevez, P., C. Held, and C. Perez (2006). Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Systems with Applications 31, 337–344.

[13] Bhowmik, Rekha Bhowmik. "35 Data Mining Techniques in Fraud Detection". Journal of Digital Forensics, Security and Law. University of Texas at Dallas.

[fawcett_1997-14] Fawcett, T. (1997). AI Approaches to Fraud Detection and Risk Management: Papers from the 1997 AAAI Workshop. Technical Report WS-97-07. AAAI Press.

[phua_2005-15] Phua, C.; Lee, V.; Smith-Miles, K.; Gayler, R. (2005). "A Comprehensive Survey of Data Mining-based Fraud Detection Research". arXiv: 1009.6119 . doi:10.1016/j.chb.2012.01.002. S2CID 50458504.{{cite journal}}: Cite journal requires |journal= (help)

[cortes_2002-16] Cortes, C. & Pregibon, D. (2001). Signature-Based Methods for Data Streams. Data Mining and Knowledge Discovery 5: 167–182.

[bolton_2001-17] 1 2 Bolton, R. & Hand, D. (2001). Unsupervised Profiling Methods for Fraud Detection. Credit Scoring and Credit Control VII.

[18] Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Kessaci, Yacine; Oblé, Frédéric; Bontempi, Gianluca (16 May 2019). "Combining unsupervised and supervised learning in credit card fraud detection". Information Sciences. 557: 317–331. doi:10.1016/j.ins.2019.05.042. ISSN 0020-0255. S2CID 181839660.

[19] Vacca, John R. (2003). Identity Theft. Prentice Hall Professional. p. 400. ISBN 9780130082756.

[20] Barba, Robert (2017-11-18). "Sharing your location with your bank seems creepy, but it's useful". The Morning Call . Archived from the original on 2018-01-11. Retrieved 2018-01-10.

[21] Le Borgne, Yann-Aël; Bontempi, Gianluca (2021). "Machine Learning for Credit Card Fraud Detection - Practical Handbook" . Retrieved 26 April 2021.

[22] "Credit Card Fraud Detection". kaggle.com.

[23] "ULB Machine Learning Group". mlg.ulb.ac.be.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]