Data analysis techniques for fraud detection
   HOME

TheInfoList



OR:

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
. They offer applicable and successful solutions in different areas of electronic fraud crimes. In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. For example, the currently prevailing approach employed by many law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers. As a result, a large number of fraud cases remain undetected and unprosecuted. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the ''sounds like'' function, regression analysis, clustering analysis, and gap analysis. Techniques used for fraud detection fall into two primary classes: statistical techniques and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...
.


Statistical techniques

Examples of statistical data analysis techniques are: *
Data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
techniques for detection, validation,
error correction In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
, and filling up of missing or incorrect data. * Calculation of various statistical parameters such as
average In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7 ...
s,
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
s, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment. * Models and probability distributions of various business activities either in terms of various parameters or probability distributions. * Computing
user profile A user profile is a collection of settings and information associated with a user. It contains critical information that is used to identify an individual, such as their name, age, portrait photograph and individual characteristics such as ...
s. * Time-series analysis of time-dependent data. * Clustering and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
to find patterns and associations among groups of data. * Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses. * Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string. *
Regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results. * Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully. * Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate
false alarm A false alarm, also called a nuisance alarm, is the deceptive or erroneous report of an emergency, causing unnecessary panic and/or bringing resources (such as emergency services) to a place where they are not needed. False alarms may occur with ...
s, estimate risks, and predict future of current transactions or users. Some
forensic accountant Forensic accountants are experienced auditors, accountants, and investigators of legal and financial documents that are hired to look into possible suspicions of fraudulent activity within a company; or are hired by a company who may just want t ...
s specialize in
forensic analytics Forensic science, also known as criminalistics, is the application of science to criminal and civil laws, mainly—on the criminal side—during criminal investigation, as governed by the legal standards of admissible evidence and criminal p ...
which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are
data collection Data collection or data gathering is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research com ...
,
data preparation Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes. Data preparation is the first step in data ...
, data analysis, and reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use.


Artificial intelligence

Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include: * Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud. *
Expert system In artificial intelligence, an expert system is a computer system emulating the decision-making ability of a human expert. Expert systems are designed to solve complex problems by reasoning through bodies of knowledge, represented mainly as ifâ ...
s to encode expertise for detecting fraud in the form of rules. *
Pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics ...
to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs. * Machine learning techniques to automatically identify characteristics of fraud. * Neural nets to independently generate classification, clustering, generalization, and forecasting that can then be compared against conclusions raised in internal audits or formal financial documents such as
10-Q Form 10-Q, (also known as a 10-Q or 10Q) is a quarterly report mandated by the United States federal Securities and Exchange Commission, to be filed by publicly traded corporations. Pursuant to Section 13 or 15(d) of the Securities Exchange A ...
. Other techniques such as
link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Lin ...
,
Bayesian networks A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...
,
decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
, and sequence matching are also used for fraud detection. A new and novel technique called System properties approach has also been employed where ever rank data is available. Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism.


Machine learning and data mining

Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts. To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided. In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output). If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed. The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method. Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy. As a result, effective collaboration between machine learning model and human analysts is vital to the success of fraud detection applications.


Supervised learning

In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent' (task can be decomposed on more classes to meet algorithm requirements). Relatively rare events such as fraud may need to be over sampled to get a big enough sample size. These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent. Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud. Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud. Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented. Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection.
Link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Lin ...
comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods. This type of detection is only able to detect frauds similar to those which have occurred previously and been classified by a human. To detect a novel type of fraud may require the use of an unsupervised machine learning algorithm.


Unsupervised learning

In contrast, unsupervised methods don't make use of labelled records. Bolton and Hand use Peer Group Analysis and Break Point Analysis applied on spending behaviour in credit card accounts. Peer Group Analysis detects individual objects that begin to behave in a way different from objects to which they had previously been similar. Another tool Bolton and Hand develop for behavioural fraud detection is Break Point Analysis. Unlike Peer Group Analysis, Break Point Analysis operates on the account level. A break point is an observation where anomalous behaviour for a particular account is detected. Both the tools are applied on spending behaviour in credit card accounts. A combination of unsupervised and supervised methods for credit card fraud detection is in Carcillo et al (2019).


Available datasets

A major limitation for the validation of existing fraud detection methods is the lack of public datasets. One of the few examples is the Credit Card Fraud Detection dataset made available by the ULB Machine Learning Group.


See also

*
Fraud In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compen ...
*
Fraud deterrence Fraud deterrence has gained public recognition and spotlight since the 2002 inception of the Sarbanes-Oxley Act. Of the many reforms enacted through Sarbanes-Oxley, one major goal was to regain public confidence in the reliability of financial mar ...
*
Profiling (information science) In information science, profiling refers to the process of construction and application of user profiles generated by computerized data analysis. This is the use of algorithms or other mathematical techniques that allow the discovery of patter ...
* Data mining *
Geolocation software In computing, Internet geolocation is software capable of deducing the geographic position of a device connected to the Internet. For example, the device's IP address can be used to determine the country, city, or ZIP code, determining its geogr ...
*
Neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
*
Artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...
*
Patterns A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...
*
Data clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
*
Statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
*
Labelling Labelling or using a label is describing someone or something in a word or short phrase. For example, the label "criminal" may be used to describe someone who has broken a law. Labelling theory is a theory in sociology which ascribes labelling ...
*
Decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ...
*
Regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
*
Synthetic data Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data g ...
*
Benford's law Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small.Arno Berger and Theodore ...
*
Beneish M-score The Beneish model is a statistical model that uses financial ratios calculated with accounting data of a specific company in order to check if it is likely (high probability) that the reported earnings of the company have been manipulated. How to ...


References

{{reflist, refs= Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): 235–255. Bolton, R. & Hand, D. (2001)
Unsupervised Profiling Methods for Fraud Detection.
Credit Scoring and Credit Control VII.
G. K. Palshikar, The Hidden Truth – Frauds and Their Control: A Critical Application for Business Intelligence, Intelligent Enterprise, vol. 5, no. 9, 28 May 2002, pp. 46–51. Michalski, R. S., I. Bratko, and M. Kubat (1998). Machine Learning and Data Mining – Methods and Applications. John Wiley & Sons Ltd. {{cite journal, author1=Phua, C. , author2=Lee, V. , author3=Smith-Miles, K. , author4=Gayler, R., year=2005, title=A Comprehensive Survey of Data Mining-based Fraud Detection Research , doi=10.1016/j.chb.2012.01.002 , arxiv=1009.6119, s2cid=50458504 Green, B. & Choi, J. (1997)
Assessing the Risk of Management Fraud through Neural Network Technology.
Auditing 16(1): 14–28.
Tax, N. & de Vries, K.J. & de Jong, M. & Dosoula, N. & van den Akker, B. & Smith, J. & Thuong, O. & Bernardi, L
Machine Learning for Fraud Detection in E-Commerce: A Research Agenda.
Proceedings of the KDD International Workshop on Deployable Machine Learning for Security Defense (ML hat). Springer, Cham, 2021.
Estevez, P., C. Held, and C. Perez (2006)
Subscription fraud prevention in telecommunications using fuzzy rules and neural networks
Expert Systems with Applications 31, 337–344.
Fawcett, T. (1997)
AI Approaches to Fraud Detection and Risk Management
Papers from the 1997 AAAI Workshop. Technical Report WS-97-07. AAAI Press.
Cortes, C. & Pregibon, D. (2001)
Signature-Based Methods for Data Streams.
Data Mining and Knowledge Discovery 5: 167–182.
Dal Pozzolo, A. & Caelen, O. & Le Borgne, Y. & Waterschoot, S. & Bontempi, G. (2014)
Learned lessons in credit card fraud detection from a practitioner perspective
Expert systems with applications 41: 10 4915–4928.
Bolton, R. and Hand, D. (2002)
Statistical fraud detection: A review
Statistical Science 17 (3), pp. 235-255
{{Cite journal, last=Al-Khatib, first=Adnan M., s2cid=214778396, date=2012, title=Electronic Payment Fraud Detection Techniques, journal=World of Computer Science and Information Technology Journal, volume=2 Fraud Data analysis Applications of artificial intelligence