In-database processing

Last updated

In-database processing, sometimes referred to as in-database analytics, refers to the integration of data analytics into data warehousing functionality. Today, many large databases, such as those used for credit card fraud detection and investment bank risk management, use this technology because it provides significant performance improvements over traditional methods. [1]

Contents

History

Traditional approaches to data analysis require data to be moved out of the database into a separate analytics environment for processing, and then back to the database. (SPSS from IBM are examples of tools that still do this today). Doing the analysis in the database, where the data resides, eliminates the costs, time and security issues associated with the old approach by doing the processing in the data warehouse itself. [2]

Though in-database capabilities were first commercially offered in the mid-1990s, as object-related database systems from vendors including IBM, Illustra/Informix (now IBM) and Oracle, the technology did not begin to catch on until the mid-2000s. [3] The concept of migrating analytics from the analytical workstation and into the Enterprise Data Warehouse was first introduced by Thomas Tileston in his presentation entitled, “Have Your Cake & Eat It Too! Accelerate Data Mining Combining SAS & Teradata” at the Teradata Partners 2005 "Experience the Possibilities" conference in Orlando, FL, September 18–22, 2005. Mr. Tileston later presented this technique globally in 2006, [4] 2007 [5] [6] [7] and 2008. [8]

At that point, the need for in-database processing had become more pressing as the amount of data available to collect and analyze continues to grow exponentially (due largely to the rise of the Internet), from megabytes to gigabytes, terabytes and petabytes. This “big data” is one of the primary reasons it has become important to collect, process and analyze data efficiently and accurately.

Also, the speed of business has accelerated to the point where a performance gain of nanoseconds can make a difference in some industries. [2] Additionally, as more people and industries use data to answer important questions, the questions they ask become more complex, demanding more sophisticated tools and more precise results.

All of these factors in combination have created the need for in-database processing. The introduction of the column-oriented database, specifically designed for analytics, data warehousing and reporting, has helped make the technology possible.

Types

There are three main types of in-database processing: translating a model into SQL code, loading C or C++ libraries into the database process space as a built-in user-defined function (UDF), and out-of-process libraries typically written in C, C++ or Java and registering them in the database as a built-in UDFs in a SQL statement.

Translating models into SQL code

In this type of in-database processing, a predictive model is converted from its source language into SQL that can run in the database usually in a stored procedure. Many analytic model-building tools have the ability to export their models in either SQL or PMML (Predictive Modeling Markup Language). Once the SQL is loaded into a stored procedure, values can be passed in through parameters and the model is executed natively in the database. Tools that can use this approach include SAS, SPSS, R and KXEN.

Loading C or C++ libraries into the database process space

With C or C++ UDF libraries that run in process, the functions are typically registered as built-in functions within the database server and called like any other built-in function in a SQL statement. Running in process allows the function to have full access to the database server's memory, parallelism and processing management capabilities. Because of this, the functions must be well-behaved so as not to negatively impact the database or the engine. This type of UDF gives the highest performance out of any method for OLAP, mathematical, statistical, univariate distributions and data mining algorithms.

Out-of-process

Out-of-process UDFs are typically written in C, C++ or Java. By running out of process, they do not run the same risk to the database or the engine as they run in their own process space with their own resources. Here, they wouldn't be expected to have the same performance as an in-process UDF. They are still typically registered in the database engine and called through standard SQL, usually in a stored procedure. Out-of-process UDFs are a safe way to extend the capabilities of a database server and are an ideal way to add custom data mining libraries.

Uses

In-database processing makes data analysis more accessible and relevant for high-throughput, real-time applications including fraud detection, credit scoring, risk management, transaction processing, pricing and margin analysis, usage-based micro-segmenting, behavioral ad targeting and recommendation engines, such as those used by customer service organizations to determine next-best actions. [9]

Vendors

In-database processing is performed and promoted as a feature by many of the major data warehousing vendors, including Teradata (and Aster Data Systems, which it acquired), IBM (with its Netezza, PureData Systems, and Db2 Warehouse products), IEMC Greenplum, Sybase, ParAccel, SAS, and EXASOL. Some of the products offered by these vendors, such as CWI's MonetDB or IBM's Db2 Warehouse, offer users the means to write their own functions (UDFs) or extensions (UDXs) to enhance the products' capabilities. [10] Fuzzy Logix offers libraries of in-database models used for mathematical, statistical, data mining, simulation, and classification modelling, as well as financial models for equity, fixed income, interest rate, and portfolio optimization. In-DataBase Pioneers collaborates with marketing and IT teams to institutionalize data mining and analytic processes inside the data warehouse for fast, reliable, and customizable consumer-behavior and predictive analytics.

In-database processing is one of several technologies focused on improving data warehousing performance. Others include parallel computing, shared everything architectures, shared nothing architectures and massive parallel processing. It is an important step towards improving predictive analytics capabilities. [11]

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versions of the software released since 2015 have the brand name IBM SPSS Statistics.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">SAS (software)</span> Statistical software

SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics. SAS' analytical software is built upon artificial intelligence and utilizes machine learning, deep learning and generative AI to manage and model data. The software is widely used in industries such as finance, insurance, health care and education.

The Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.

Business intelligence software is a type of application software designed to retrieve, analyze, transform and report data for business intelligence (BI). The applications generally read data that has been previously stored, often - though not necessarily - in a data warehouse or data mart.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

<span class="mw-page-title-main">World Programming System</span> Data analysis software

The World Programming System, also known as WPS Analytics or WPS, is a software product developed by a company called World Programming.

In computing, the term data warehouse appliance (DWA) was coined by Foster Hinshaw for a computer architecture for data warehouses (DW) specifically marketed for big data analysis and discovery that is simple to use and has a high performance for the workload. A DWA includes an integrated set of servers, storage, operating systems, and databases.

<span class="mw-page-title-main">KXEN Inc.</span>

KXEN was an American software company which existed from 1998 to 2013 when it was acquired by SAP AG.

<span class="mw-page-title-main">Netezza</span> Provider of Integrated Data Warehouse Hardware and Software

IBM Netezza is a subsidiary of American technology company IBM that designs and markets high-performance data warehouse appliances and advanced analytics applications for the most demanding analytic uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning.

<span class="mw-page-title-main">Exasol</span> German database management software company

Exasol is an analytics database management software company. Its product is called Exasol, an in-memory, column-oriented, relational database management system

<span class="mw-page-title-main">Greenplum</span> American data technology company

Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. The technology was created by a company of the same name headquartered in San Mateo, California around 2005. Greenplum was acquired by EMC Corporation in July 2010.

<span class="mw-page-title-main">SPSS Modeler</span> Data analytics software

IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used to build predictive models and conduct other analytic tasks. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

Revolution Analytics is a statistical software company focused on developing open source and "open-core" versions of the free and open source software R for enterprise, academic and analytics customers. Revolution Analytics was founded in 2007 as REvolution Computing providing support and services for R in a model similar to Red Hat's approach with Linux in the 1990s as well as bolt-on additions for parallel processing. In 2009 the company received nine million in venture capital from Intel along with a private equity firm and named Norman H. Nie as their new CEO. In 2010 the company announced the name change as well as a change in focus. Their core product, Revolution R, would be offered free to academic users and their commercial software would focus on big data, large scale multiprocessor computing, and multi-core functionality.

Fuzzy Logix is a company that develops analytics solutions for operating on large amounts of data.

References

  1. What Is In-Database Processing?, Wise Geek, retrieved May 14, 2012
  2. 1 2 Das, Joydeep (May 10, 2010), Adding Competitive Muscle with In-Database Analytics, Database Trends & Applications
  3. Grimes, Seth (December 15, 2008), In-Database Analytics: A Passing Lane for Complex Analysis, Intelligent Enterprise
  4. "Business Intelligence – Taking the sting out of forecasting | IT World Canada News". 31 October 2006.
  5. http://www2.sas.com/proceedings/forum2007/371-2007.pdf [ bare URL PDF ]
  6. "SAS Global Forum 2007 – SAS-Wiki". Archived from the original on 2014-08-21. Retrieved 2014-08-21.
  7. "Archived copy". Archived from the original on 2014-08-22. Retrieved 2014-08-21.{{cite web}}: CS1 maint: archived copy as title (link)
  8. http://www.teradata.kr/teradatauniverse/PDF/Track_2/2_2_Warner_Home_Thomas_Tileston.pdf [ bare URL PDF ]
  9. Kobelius, James (June 22, 2011), The Power of Predictions: Case Studies in CRM Next Best Action, Forrester, archived from the original on April 13, 2012, retrieved May 15, 2012
  10. "Embedded R in MonetDB". 22 December 2014. Archived from the original on 13 November 2014. Retrieved 22 December 2014.
  11. http://timmanns.blogspot.com/2009/01/isnt-in-database-processing-old-news.html "Isn't In-database processing old news yet?," "Blog by Tim Manns (Data Mining Blog)," January 8, 2009