Predictive analytics encompasses a variety of

statistical Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...

techniques from data mining,

predictive modeling Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...

, and

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

that analyze current and historical facts to make predictions about future or otherwise unknown events. In business, predictive models exploit

patterns A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...

found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding

decision-making In psychology, decision-making (also spelled decision making and decisionmaking) is regarded as the cognitive process resulting in the selection of a belief or a course of action among several possible alternative options. It could be either r ...

for candidate transactions. The defining functional effect of these technical approaches is that predictive analytics provides a predictive score (probability) for each individual (customer, employee, healthcare patient, product SKU, vehicle, component, machine, or other organizational unit) in order to determine, inform, or influence organizational processes that pertain across large numbers of individuals, such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, and government operations including law enforcement.

Definition

Predictive analytics is a set of business intelligence (BI) technologies that uncovers relationships and patterns within large volumes of data that can be used to predict behavior and events. Unlike other BI technologies, predictive analytics is forward-looking, using past events to anticipate the future. Predictive analytics statistical techniques include

data modeling Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. Overview Data modeling is a process used to define and analyze data requirements needed to su ...

, AI, deep learning algorithms and data mining. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future. For example, identifying suspects after a crime has been committed, or credit card fraud as it occurs. The core of predictive analytics relies on capturing relationships between

explanatory variables Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

and the predicted variables from past occurrences, and exploiting them to predict the unknown outcome. It is important to note, however, that the accuracy and usability of results will depend greatly on the level of data analysis and the quality of assumptions. Predictive analytics is often defined as predicting at a more detailed level of granularity, i.e., generating predictive scores (probabilities) for each individual organizational element. This distinguishes it from

forecasting Forecasting is the process of making predictions based on past and present data. Later these can be compared (resolved) against what happens. For example, a company might estimate their revenue in the next year, then compare it against the actual ...

. For example, "Predictive analytics—Technology that learns from experience (data) to predict the future behavior of individuals in order to drive better decisions." In future industrial systems, the value of predictive analytics will be to predict and prevent potential issues to achieve near-zero break-down and further be integrated into prescriptive analytics for decision optimization.

Big Data

While there is no universal definition of big data, most of them refer to the processing of a large set of data points to get a finished product. When the dataset is too large to be analyzed using traditional analyzation techniques, big data analytics comes into play. However, size is not the only factor that defines big data. Gartner's definition of big data is useful in explaining the defining properties of big data: "Big data is high-volume, high-velocity and/or high variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." These properties are sometimes referred to as the 3 Vs of big data. When we talk about volume of data, think about its size. There is no universal criteria for size that determines whether a dataset is "big" or not, because size is relative. Terabytes of data could be considered big data to one firm while another firm uses a larger unit of storage as criteria for big data such as a petabyte or an exabyte. The velocity of data refers to the speed of data and how much time it takes to create, store, and analyze it. Batch processing was traditionally used to process large blocks of data, but this takes a lot of time and is only useful if decision making can be successful without fast-paced data processing. The markets of the modern day however require real-time processing for powerful and successful decision making in highly versatile and competitive environments. There are also a few different types of data, which is what Gartner means by variety. Data can be structured, semi-structured, or unstructured. "Structured data is data that adheres to a predefined data model and is therefore straightforward to analyze." Structured data generally has rows and columns that can be sorted and searched with basic techniques. Spreadsheets and relational databases are typical examples of structured data. Unstructured data is basically the opposite of structured data in that it doesn't adhere to a predefined data model and doesn't contain columns or rows to help organize the data. This makes unstructured data more difficult to understand than structured data, which can be easily processed using traditional programs like Excel and SQL. Some examples of unstructured data include emails, PDF files, and Google searches. Storing and processing unstructured data has become much easier in recent years due to programs like Power BI and Tableau. "Semi-structured data lies in between structured and unstructured data. It does not adhere to a formal data structure yet does contain tags and other markers to organize the data." The semi-structured category of data is much easier to analyze than unstructured data. Many big data tools can 'read' and process semi-structured forms of data like XML or JSON files. The volume, variety and velocity of big data have introduced challenges across the board for capture, storage, search, sharing, analysis, and visualization. Examples of big data sources include web logs,

RFID Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. An RFID system consists of a tiny radio transponder, a radio receiver and transmitter. When triggered by an electroma ...

, sensor data,

social networks A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for ...

, Internet search indexing, call detail records, military surveillance, and complex data in astronomic, biogeochemical, genomics, and atmospheric sciences. Thanks to technological advances in computer hardware—faster CPUs, cheaper memory, and MPP architectures—and new technologies such as

Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...

, and in-database and

text analytics Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extrac ...

for processing big data, it is now feasible to collect, analyze, and mine massive amounts of structured and

unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...

for new insights. It is also possible to run predictive algorithms on streaming data. Today, exploring big data and using predictive analytics is within reach of more organizations than ever before and new methods that are capable of handling such datasets are proposed.

Analytical Techniques

The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.

Machine Learning

Machine learning can be defined as the ability of a machine to learn and then mimic human behavior that requires intelligence. This is accomplished through artificial intelligence, algorithms, and models.

Autoregressive Integrated Moving Average (ARIMA)

ARIMA models are a common example of time series models. These models use autoregression, which means the model can be fitted with a regression software that will use machine learning to do most of the regression analysis and smoothing. ARIMA models are known to have no overall trend, but instead have a variation around the average that has a constant amplitude, resulting in statistically similar time patterns. Through this, variables are analyzed and data is filtered in order to better understand and predict future values. One example of an ARIMA method is exponential smoothing models. Exponential smoothing takes into account the difference in importance between older and newer data sets, as the more recent data is more accurate and valuable in predicting future values. In order to accomplish this, exponents are utilized to give newer data sets a larger weight in the calculations than the older sets.

Time Series Models

Time series models are a subset of machine learning that utilize time series in order to understand and forecast data using past values. A time series is the sequence of a variable's value over equally spaced periods, such as years or quarters in business applications. To accomplish this, the data must be smoothed, or the random variance of the data must be removed in order to reveal trends in the data. There are multiple ways to accomplish this.

= Single Moving Average

= Single moving average methods utilize smaller and smaller numbered sets of past data to decrease error that is associated with taking a single average, making it a more accurate average than it would be to take the average of the entire data set.

= Centered Moving Average

= Centered moving average methods utilize the data found in the single moving average methods by taking an average of the median-numbered data set. However, as the median-numbered data set is difficult to calculate with even-numbered data sets, this method works better with odd-numbered data sets than even.

Predictive Modeling

Predictive Modeling is a statistical technique used to predict future behavior. It utilizes predictive models to analyze a relationship between a specific unit in a given sample and one or more features of the unit. The objective of these models is to assess the possibility that a unit in another sample will display the same pattern. Predictive model solutions can be considered a type of data mining technology. The models can analyze both historical and current data and generate a model in order to predict potential future outcomes. Regardless of the methodology used, in general, the process of creating predictive models involves the same steps. First, it is necessary to determine the project objectives and desired outcomes and translate these into predictive analytic objectives and tasks. Then, analyze the source data to determine the most appropriate data and model building approach (models are only as useful as the applicable data used to build them). Select and transform the data in order to create models. Create and test models in order to evaluate if they are valid and will be able to meet project goals and metrics. Apply the model's results to appropriate business processes (identifying patterns in the data doesn't necessarily mean a business will understand how to take advantage or capitalize on it). Afterward, manage and maintain models in order to standardize and improve performance (demand will increase for model management in order to meet new compliance regulations).

Regression Techniques

Generally, regression analysis uses structural data along with the past values of independent variables and the relationship between them and the dependent variable to form predictions.

Linear Regression

In linear regression, a plot is constructed with the previous values of the dependent variable plotted on the Y-axis and the independent variable that is being analyzed plotted on the X-axis. A regression line is then constructed by a statistical program representing the relationship between the independent and dependent variables which can be used to predict values of the dependent variable based only on the independent variable. With the regression line, the program also shows a slope intercept equation for the line which includes an addition for the error term of the regression, where the higher the value of the error term the less precise the regression model is. In order to decrease the value of the error term, other independent variables are introduced to the model, and similar analyses are performed on these independent variables.

Applications

Analytical Review and Conditional Expectations in Auditing

An important aspect of auditing includes analytical review. In analytical review, the reasonableness of reported account balances being investigated is determined. Auditors accomplish this process through predictive modeling to form predictions called conditional expectations of the balances being audited using autoregressive integrated moving average (ARIMA) methods and general regression analysis methods, specifically through the Statistical Technique for Analytical Review (STAR) methods. The ARIMA method for analytical review uses time-series analysis on past audited balances in order to create the conditional expectations. These conditional expectations are then compared to the actual balances reported on the audited account in order to determine how close the reported balances are to the expectations. If the reported balances are close to the expectations, the accounts are not audited further. If the reported balances are very different from the expectations, there is a higher possibility of a material accounting error and a further audit is conducted. Regression analysis methods are deployed in a similar way, except the regression model used assumes the availability of only one independent variable. The materiality of the independent variable contributing to the audited account balances are determined using past account balances along with present structural data. Materiality is the importance of an independent variable in its relationship to the dependent variable. In this case, the dependent variable is the account balance. Through this the most important independent variable is used in order to create the conditional expectation and, similar to the ARIMA method, the conditional expectation is then compared to the account balance reported and a decision is made based on the closeness of the two balances. The STAR methods operate using regression analysis, and fall into two methods. The first is the STAR monthly balance approach, and the conditional expectations made and regression analysis used are both tied to one month being audited. The other method is the STAR annual balance approach, which happens on a larger scale by basing the conditional expectations and regression analysis on one year being audited. Besides the difference in the time being audited, both methods operate the same, by comparing expected and reported balances to determine which accounts to further investigate.

Business Value

As we move into a world of technological advances where more and more data is created and stored digitally, businesses are looking for ways to take advantage of this opportunity and use this information to help generate profits. Predictive analytics can be used and is capable of providing many benefits to a wide range of businesses, including asset management firms, insurance companies, communication companies, and many other firms. In a study conducted by IDC Analyze the Future, Dan Vasset and Henry D. Morris explain how an asset management firm used predictive analytics to develop a better marketing campaign. They went from a mass marketing approach to a customer-centric approach, where instead of sending the same offer to each customer, they would personalize each offer based on their customer. Predictive analytics was used to predict the likelihood that a possible customer would accept a personalized offer. Due to the marketing campaign and predictive analytics, the firm's acceptance rate skyrocketed, with three times the number of people accepting their personalized offers. Technological advances in predictive analytics have increased its value to firms. One technological advancement is more powerful computers, and with this predictive analytics has become able to create forecasts on large data sets much faster. With the increased computing power also comes more data and applications, meaning a wider array of inputs to use with predictive analytics. Another technological advance includes a more user-friendly interface, allowing a smaller barrier of entry and less extensive training required for employees to utilize the software and applications effectively. Due to these advancements, many more corporations are adopting predictive analytics and seeing the benefits in employee efficiency and effectiveness, as well as profits.

Cash-flow Prediction

ARIMA Arima, officially The Royal Chartered Borough of Arima is the easternmost and second largest in area of the three boroughs of Trinidad and Tobago. It is geographically adjacent to Sangre Grande and Arouca at the south central foothills of ...

univariate and multivariate models can be used in forecasting a company's future

cash flow A cash flow is a real or virtual movement of money: *a cash flow in its narrow sense is a payment (in a currency), especially from one central bank account to another; the term 'cash flow' is mostly used to describe payments that are expected ...

s, with its equations and calculations based on the past values of certain factors contributing to cash flows. Using time-series analysis, the values of these factors can be analyzed and extrapolated to predict the future cash flows for a company. For the univariate models, past values of cash flows are the only factor used in the prediction. Meanwhile the multivariate models use multiple factors related to accrual data, such as operating income before depreciation. Another model used in predicting cash-flows was developed in 1998 and is known as the Dechow, Kothari, and Watts model, or DKW (1998). DKW (1998) uses regression analysis in order to determine the relationship between multiple variables and cash flows. Through this method, the model found that cash-flow changes and accruals are negatively related, specifically through current earnings, and using this relationship predicts the cash flows for the next period. The DKW (1998) model derives this relationship through the relationships of accruals and cash flows to accounts payable and receivable, along with inventory.

Child protection

Some child welfare agencies have started using predictive analytics to flag high risk cases. For example, in

Hillsborough County, Florida Hillsborough County is located in the west central portion of the U.S. state of Florida. In the 2020 census, the population was 1,459,762, making it the fourth-most populous county in Florida and the most populous county outside the Miami metrop ...

, the child welfare agency's use of a predictive modeling tool has prevented abuse-related child deaths in the target population.

Clinical decision support systems

Predictive analysis have found use in health care primarily to determine which patients are at risk of developing conditions such as diabetes, asthma, or heart disease. Additionally, sophisticated clinical decision support systems incorporate predictive analytics to support medical decision making. A 2016 study of

neurodegenerative disorders A neurodegenerative disease is caused by the progressive loss of structure or function of neurons, in the process known as neurodegeneration. Such neuronal damage may ultimately involve cell death. Neurodegenerative diseases include amyotrophic ...

provides a powerful example of a CDS platform to diagnose, track, predict and monitor the progression of

Parkinson's disease Parkinson's disease (PD), or simply Parkinson's, is a long-term degenerative disorder of the central nervous system that mainly affects the motor system. The symptoms usually emerge slowly, and as the disease worsens, non-motor symptoms becom ...

Predicting outcomes of legal decisions

The predicting of the outcome of juridical decisions can be done by AI programs. These programs can be used as assistive tools for professions in this industry.

Portfolio, product or economy-level prediction

Often the focus of analysis is not the consumer but the product, portfolio, firm, industry or even the economy. For example, a retailer might be interested in predicting store-level demand for inventory management purposes. Or the Federal Reserve Board might be interested in predicting the unemployment rate for the next year. These types of problems can be addressed by predictive analytics using time series techniques (see below). They can also be addressed via machine learning approaches which transform the original time series into a feature vector space, where the learning algorithm finds patterns that have predictive power.

Underwriting

Many businesses have to account for risk exposure due to their different services and determine the costs needed to cover the risk. Predictive analytics can help

underwrite Underwriting (UW) services are provided by some large financial institutions, such as banks, insurance companies and investment houses, whereby they guarantee payment in case of damage or financial loss and accept the financial risk for liabili ...

these quantities by predicting the chances of illness, default, bankruptcy, etc. Predictive analytics can streamline the process of customer acquisition by predicting the future risk behavior of a customer using application level data. Predictive analytics in the form of credit scores have reduced the amount of time it takes for loan approvals, especially in the mortgage market. Proper predictive analytics can lead to proper pricing decisions, which can help mitigate future risk of default. Predictive analytics can be used to mitigate moral hazard and prevent accidents from occurring.