Oracle Data Mining

Last updated

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

Contents

Oracle Data Mining
Developer(s) Oracle Corporation
Stable release
11gR2 / September, 2009
Type data mining and analytics
License proprietary
Website Oracle Data Mining

Overview

Oracle Corporation has implemented a variety of data mining algorithms inside its Oracle Database relational database product. These implementations integrate directly with the Oracle database kernel and operate natively on data stored in the relational database tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers. The relational database platform is leveraged[ by whom? ] to securely manage models and to efficiently execute SQL queries on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data-mining functions. These operations include functions to create, apply, test, and manipulate data-mining models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects.

In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application-developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance.

Models can be created and managed by one of several means. Oracle Data Miner provides a graphical user interface that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM methodology). Application- and tools-developers can embed predictive and descriptive mining capabilities using PL/SQL or Java APIs. Business analysts can quickly experiment with, or demonstrate the power of, predictive analytics using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated Microsoft Excel adaptor interface. ODM offers a choice of well-known machine learning approaches such as Decision Trees, Naive Bayes, Support vector machines, Generalized linear model (GLM) for predictive mining, Association rules, K-means and Orthogonal Partitioning [1] [2] Clustering, and Non-negative matrix factorization for descriptive mining. A minimum description length based technique to grade the relative importance of input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow text mining by accepting text (unstructured data) attributes as input. Users do not need to configure text-mining options - the Database_options database option handles this behind the scenes.

History

Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release:

Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by Thinking Machines Corporation in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself is a complete redesign and rewrite from ground-up - while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the Oracle Data Miner GUI.

The Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. An updated Oracle Data Miner GUI was released in 2012. It is free, and is available as an extension to Oracle SQL Developer 3.1 .

Functionality

As of release 11gR1 Oracle Data Mining contains the following data mining functions:

Input sources and data preparation

Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a star schema). The full functionality of SQL can be used when preparing data for data mining, including dates and spatial data.

Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as outlier treatment, discretization, normalization and binning (sorting in general speak)

Graphical user interface: Oracle Data Miner

Users can access Oracle Data Mining through Oracle Data Miner, a GUI client application that provides access to the data mining functions and structured templates (called Mining Activities) that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of Java and/or SQL code associated with the data-mining activities. The Java Code Generator is an extension to Oracle JDeveloper. An independent interface also exists: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics PL/SQL package from Microsoft Excel.

From version 11.2 of the Oracle database, Oracle Data Miner integrates with Oracle SQL Developer. [3]

PL/SQL and Java interfaces

Oracle Data Mining provides a native PL/SQL package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a classification model:

BEGINDBMS_DATA_MINING.CREATE_MODEL(model_name=>'credit_risk_model',function=>DBMS_DATA_MINING.classification,data_table_name=>'credit_card_data',case_id_column_name=>'customer_id',target_column_name=>'credit_risk',settings_table_name=>'credit_risk_model_settings');END;

where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'.

Oracle Data Mining also supports a Java API consistent with the Java Data Mining (JDM) standard for data mining (JSR-73) for enabling integration with web and Java EE applications and to facilitate portability across platforms.

SQL scoring functions

As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a classification model:

SELECTcustomer_nameFROMcredit_card_dataWHEREPREDICTION(credit_risk_modelUSING*)='LOW'ANDcustomer_value='HIGH';

PMML

In Release 11gR2 (11.2.0.2), ODM supports the import of externally created PMML for some of the data mining models. PMML is an XML-based standard for representing data mining models.

Predictive analytics Microsoft Excel add-in

The PL/SQL package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.

References and further reading

See also

Related Research Articles

Database Organized collection of data

A database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex they are often developed using formal design and modeling techniques.

MySQL SQL database engine software

MySQL is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter, and "SQL", the abbreviation for Structured Query Language. A relational database organizes data into one or more data tables in which data types may be related to each other; these relations help structure the data. SQL is a language programmers use to create, modify and extract data from the relational database, as well as control user access to the database. In addition to relational databases and SQL, an RDBMS like MySQL works with an operating system to implement a relational database in a computer's storage system, manages users, allows for network access and facilitates testing database integrity and creation of backups.

A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. A software system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems have an option of using the SQL for querying and maintaining the database.

SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.

Object-relational mapping in computer science is a programming technique for converting data between incompatible type systems using object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to construct their own ORM tools.

Object–relational database

An object–relational database (ORD), or object–relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the data model with custom data types and methods.

IBM Db2 Family Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. They initially supported the relational model, but were extended to support object-relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An application written using ODBC can be ported to other platforms, both on the client and server side, with few changes to the data access code.

ABAP is a high-level programming language created by the German software company SAP SE. It is extracted from the base computing languages Java, C, C++ and Python. It is currently positioned, alongside Java, as the language for programming the SAP NetWeaver Application Server, which is part of the SAP NetWeaver platform for building business applications.

Dataphor is an open-source truly-relational database management system (RDBMS) and its accompanying user interface technologies, which together are designed to provide highly declarative software application development. The Dataphor Server has its own storage engine or it can be a virtual, or federated, DBMS, meaning that it can utilize other database engines for storage.

TerraLib

TerraLib is an open-source GIS software library that extends object-relational DBMS technology to handle spatiotemporal data types. The library supports different DBMS, including MySQL, PostgreSQL, and Oracle. Its vector data model is upwards compliant with OGC standards. TerraLib supports the development of geographical applications using spatial databases.

Weka (machine learning)

Waikato Environment for Knowledge Analysis (Weka), developed at the University of Waikato, New Zealand, is free software licensed under the GNU General Public License, and the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

Database administration is the function of managing and maintaining database management systems (DBMS) software. Mainstream DBMS software such as Oracle, IBM DB2 and Microsoft SQL Server need ongoing management. As such, corporations that use DBMS software often hire specialized information technology personnel called database administrators or DBAs.

World Programming System

The World Programming System, also known as WPS Analytics or WPS, is a software product developed by a company called World Programming.

Exasol

Exasol is an analytics database management software company. Its product is called Exasol, an in-memory, column-oriented, relational database management system

Database Workbench

Database Workbench is a software application for development and administration of multiple relational databases using SQL, with interoperationality between different database systems, developed by Upscene Productions.

The following outline is provided as an overview of and topical guide to MySQL:

Oracle TopLink is a mapping and persistence framework for Java developers. TopLink is produced by Oracle and is a part of Oracle's OracleAS, WebLogic, and OC4J servers. It is an object-persistence and object-transformation framework. TopLink provides development tools and run-time functionalities that ease the development process and help increase functionality. Persistent object-oriented data is stored in relational databases which helps build high-performance applications. Storing data in either XML or relational databases is made possible by transforming it from object-oriented data.

References

  1. 1 2 USpatent 7174344,Campos, Marcos M.&Milenova, Boriana L.,"Orthogonal partitioning clustering",issued 2007-02-06, assigned to Oracle International Corporation
  2. 1 2 Boriana L. Milenova and Marcos M. Campos (2002); O-Cluster: Scalable Clustering of Large High Dimensional Data Sets, ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining, pages 290-297, ISBN   0-7695-1754-4.
  3. "Oracle Data Miner". Oracle technology Network. Oracle Corporation. 2014. Retrieved 2014-07-17. The Oracle Data Miner is an Oracle SQL Developer extension that enables data analysts to work directly with data inside the database, explore the data graphically, build and evaluate multiple data mining models, apply Oracle Data Mining models to new data and deploy Oracle Data Mining's predictions and insights throughout the enterprise. [...] Oracle Data Miner is comprised of three components: Oracle Database 12c or Oracle Database 11g Release 2 SQL Developer (client) which bundles the Oracle Data Miner work flow GUI Data Miner Repository - installed in the Oracle Database