Oracle Data Mining
   HOME

TheInfoList



OR:

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations,
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
,
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
,
feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.


Overview

Oracle Corporation has implemented a variety of data mining algorithms inside its Oracle Database relational database product. These implementations integrate directly with the Oracle database kernel and operate natively on data stored in the relational database tables. This eliminates the need for extraction or
transfer Transfer may refer to: Arts and media * ''Transfer'' (2010 film), a German science-fiction movie directed by Damir Lukacevic and starring Zana Marjanović * ''Transfer'' (1966 film), a short film * ''Transfer'' (journal), in management studies ...
of data into standalone mining/analytic servers. The relational database platform is leveraged to securely manage models and to efficiently execute SQL queries on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data-mining functions. These operations include functions to create, apply,
test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...
, and
manipulate Manipulation may refer to: * Manipulation (psychology) - the action of manipulating someone in a clever or unscrupulous way *Crowd manipulation - use of crowd psychology to direct the behavior of a crowd toward a specific action ::* Internet mani ...
data-mining models. Models are created and stored as
database object In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
s, and their management is done within the database - similar to tables, views, indexes and other database objects. In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application-developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance. Models can be created and managed by one of several means. Oracle Data Miner provides a
graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inst ...
that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the
CRISP-DM Cross-industry standard process for data mining, known as CRISP-DM,Shearer C., ''The CRISP-DM model: the new blueprint for data mining'', J Data Warehousing (2000); 5:13—22. is an open standard process model that describes common approaches use ...
methodology). Application- and tools-developers can embed predictive and descriptive mining capabilities using
PL/SQL PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
or
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
s. Business analysts can quickly experiment with, or demonstrate the power of,
predictive analytics Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events. In busine ...
using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for App ...
adaptor interface. ODM offers a choice of well-known
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
approaches such as
Decision Trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
,
Naive Bayes In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
, Support vector machines, Generalized linear model (GLM) for predictive mining,
Association rules Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...
,
K-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
and Orthogonal Partitioning Boriana L. Milenova and Marcos M. Campos (2002)
''O-Cluster: Scalable Clustering of Large High Dimensional Data Sets''
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining, pages 290-297, .
Clustering, and
Non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
for descriptive mining. A
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
based technique to grade the relative importance of input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
by accepting text (
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...
) attributes as input. Users do not need to configure text-mining options - the Database_options database option handles this behind the scenes.


History

Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release: * Oracle Data Mining 9iR2 (9.2.0.1.0 - May 2002) * Oracle Data Mining 10gR1 (10.1.0.2.0 - February 2004) * Oracle Data Mining 10gR2 (10.2.0.1.0 - July 2005) * Oracle Data Mining 11gR1 (11.1 - September 2007) * Oracle Data Mining 11gR2 (11.2 - September 2009) Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by
Thinking Machines Corporation Thinking Machines Corporation was a supercomputer manufacturer and artificial intelligence (AI) company, founded in Waltham, Massachusetts, in 1983 by Sheryl Handler and W. Daniel "Danny" Hillis to turn Hillis's doctoral work at the Massachusett ...
in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself is a complete redesign and rewrite from ground-up - while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the Oracle Data Miner GUI. The Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. An updated Oracle Data Miner GUI was released in 2012. It is free, and is available as an extension to Oracle SQL Developer 3.1 .


Functionality

As of release 11gR1 Oracle Data Mining contains the following data mining functions: * Data transformation and model analysis: ** Data sampling, binning, discretization, and other data transformations. ** Model exploration, evaluation and analysis. *
Feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
(Attribute Importance). **
Minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
(MDL). * Classification. **
Naive Bayes In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
(NB). ** Generalized linear model (GLM) for
Logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...
. ** Support Vector Machine (SVM). **
Decision Trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
(DT). *
Anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
. ** One-class Support Vector Machine (SVM). * Regression ** Support Vector Machine (SVM). ** Generalized linear model (GLM) for
Multiple regression In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
* Clustering: ** Enhanced
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
(EKM). ** Orthogonal Partitioning Clustering (O-Cluster). *
Association rule learning Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...
: ** Itemsets and association rules (AM). *
Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
. **
Non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
(NMF). *
Text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...
and spatial mining: ** Combined text and non-text columns of input data. ** Spatial/ GIS data.


Input sources and data preparation

Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a
star schema In computing, the star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dim ...
). The full functionality of SQL can be used when preparing data for data mining, including dates and spatial data. Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as outlier treatment, discretization, normalization and binning ( sorting in general speak)


Graphical user interface: Oracle Data Miner

Users can access Oracle Data Mining through Oracle Data Miner, a
GUI The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
client application that provides access to the data mining functions and structured templates (called Mining Activities) that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
and/or SQL code associated with the data-mining activities. The Java Code Generator is an extension to
Oracle JDeveloper JDeveloper is a freeware Integrated development environment, IDE supplied by Oracle Corporation. It offers features for development in Java (programming language), Java, XML, SQL and PL/SQL, HTML, JavaScript, BPEL and PHP. JDeveloper covers the f ...
. An independent interface also exists: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics
PL/SQL PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package from
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for App ...
. From version 11.2 of the Oracle database, Oracle Data Miner integrates with
Oracle SQL Developer Oracle SQL Developer is an Integrated development environment (IDE) for working with SQL in Oracle databases. Oracle Corporation provides this product free; it uses the Java Development Kit. Features Oracle SQL Developer supports Oracle pr ...
.


PL/SQL and Java interfaces

Oracle Data Mining provides a native
PL/SQL PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a classification model: BEGIN DBMS_DATA_MINING.CREATE_MODEL ( model_name => 'credit_risk_model', function => DBMS_DATA_MINING.classification, data_table_name => 'credit_card_data', case_id_column_name => 'customer_id', target_column_name => 'credit_risk', settings_table_name => 'credit_risk_model_settings'); END; where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'. Oracle Data Mining also supports a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
consistent with the Java Data Mining (JDM) standard for data mining (JSR-73) for enabling integration with web and
Java EE Jakarta EE, formerly Java Platform, Enterprise Edition (Java EE) and Java 2 Platform, Enterprise Edition (J2EE), is a set of specifications, extending Java SE with specifications for enterprise features such as distributed computing and web ser ...
applications and to facilitate portability across platforms.


SQL scoring functions

As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a classification model: SELECT customer_name FROM credit_card_data WHERE PREDICTION (credit_risk_model USING *) = 'LOW' AND customer_value = 'HIGH';


PMML

In Release 11gR2 (11.2.0.2), ODM supports the import of externally created
PMML The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML provi ...
for some of the data mining models.
PMML The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML provi ...
is an XML-based standard for representing data mining models.


Predictive analytics Microsoft Excel add-in

The
PL/SQL PL/SQL (Procedural Language for SQL) is Oracle Corporation's procedural extension for SQL and the Oracle relational database. PL/SQL is available in Oracle Database (since version 6 - stored PL/SQL procedures/functions/packages/triggers since ...
package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including
data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.


References and further reading

* T. H. Davenport
Competing on Analytics
Harvard Business Review, January 2006. * I. Ben-Ga
Outlier detection
In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, . * M. M. Campos, P. J. Stengard, and B. L. Milenova, Data-centric Automated Data Mining. In proceedings of the ''Fourth International Conference on Machine Learning and Applications 2005'', 15–17 December 2005. pp8, * M. F. Hornick, Erik Marcade, and Sunil Venkayala. Java Data Mining: Strategy, Standard, and Practice. Morgan-Kaufmann, 2006, . * B. L. Milenova, J. S. Yarmus, and M. M. Campos. SVM in Oracle database 10g: removing the barriers to widespread adoption of support vector machines. In Proceedings of the ''31st international Conference on Very Large Data Bases'' (Trondheim, Norway, August 30 - September 2, 2005). pp1152–1163, . * B. L. Milenova and M. M. Campos. O-Cluster: scalable clustering of large high dimensional data sets. In proceedings of the ''2002 IEEE International Conference on Data Mining: ICDM 2002''. pp290–297, . * P. Tamayo, C. Berger, M. M. Campos, J. S. Yarmus, B. L.Milenova, A. Mozes, M. Taft, M. Hornick, R. Krishnan, S.Thomas, M. Kelly, D. Mukhin, R. Haberstroh, S. Stephens and J. Myczkowski. Oracle Data Mining - Data Mining in the Database Environment. In Part VII of ''Data Mining and Knowledge Discovery Handbook'', Maimon, O.; Rokach, L. (Eds.) 2005, p315-1329, . * Brendan Tierney, Predictive Analytics using Oracle Data Miner: for the data scientist, oracle analyst, oracle developer & DBA, Oracle Press, McGraw Hill, Spring 2014.


See also

* Oracle LogMiner - in contrast to generic data mining, targets the extraction of information from the internal logs of an Oracle database


References

{{Reflist


External links


Oracle Data Mining at Oracle Technology Network

Oracle Data Mining Blog



Oracle Data Mining and Analytics Blog

Oracle Wiki for Data Mining

Oracle Data Mining RSS Feed



Oracle Data Mining related blog by Brendan Tierney (Oracle ACE Director)

Oracle Data Mining Examples (on Panoply Technology)
Oracle software Data mining and machine learning software