Cross-industry standard process for data mining

Last updated

The Cross-industry standard process for data mining, known as CRISP-DM, [1] is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. [2]

Contents

In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics [3] [4] (also known as ASUM-DM), which refines and extends CRISP-DM.

History

CRISP-DM was conceived in 1996 and became a European Union project under the ESPRIT funding initiative in 1997. The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation, and OHRA, an insurance company.

This core consortium brought different experiences to the project. ISL, was later acquired and merged into SPSS. The computer giant NCR Corporation produced the Teradata data warehouse and its own data mining software. Daimler-Benz had a significant data mining team. OHRA was starting to explore the potential use of data mining.

The first version of the methodology was presented at the 4th CRISP-DM SIG Workshop in Brussels in March 1999, [5] and published as a step-by-step data mining guide later that year. [6]

Between 2006 and 2008, a CRISP-DM 2.0 SIG was formed, and there were discussions about updating the CRISP-DM process model. [7] The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews, [8] [9] and the CRISP-DM 2.0 SIG website are both no longer active. [7]

While many non-IBM data mining practitioners use CRISP-DM, [10] [11] [12] IBM is the primary corporation that currently uses the CRISP-DM process model. It makes some of the old CRISP-DM documents available for download and it has incorporated it into its SPSS Modeler product. [6]

Based on current research, CRISP-DM is the most widely used form of data-mining model because of its various advantages which solved the existing problems in the data mining industries. Some of the drawbacks of this model is that it does not perform project management activities. The success of CRISP-DM is largely attributable to the fact that it is industry, tool, and application neutral. [13]

Major phases

Process diagram showing the relationship between the different phases of CRISP-DM CRISP-DM Process Diagram.png
Process diagram showing the relationship between the different phases of CRISP-DM

CRISP-DM breaks the process of data mining into six major phases: [14]

The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.

Polls and Alternative Process Frameworks

Polls conducted at the same website (KDNuggets) in 2002, 2004, 2007, and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey. [10] [11] [12] [15] The only other data mining approach named in these polls was SEMMA. However, SAS Institute clearly states that SEMMA is not a data mining methodology, but rather a "logical organization of the functional toolset of SAS Enterprise Miner." A review and critique of data mining process models in 2009 called the CRISP-DM the "de facto standard for developing data mining and knowledge discovery projects." [16] Other reviews of CRISP-DM and data mining process models include Kurgan and Musilek's 2006 review, [8] and Azevedo and Santos' 2008 comparison of CRISP-DM and SEMMA. [9] Efforts to update the methodology started in 2006, but have, as of June 2015, not led to a new version, and the "Special Interest Group" (SIG) responsible along with the website has long disappeared (see History of CRISP-DM).

In 2024, Harvard Business Review published an updated framework, bizML, that is designed for greater relevance to business personnel and to be specific for machine learning projects in particular, rather than for analytics, data science, or data mining projects in general. [17]

Related Research Articles

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versions of the software released since 2015 have the brand name IBM SPSS Statistics.

Design for Six Sigma (DFSS) is a collection of best-practices for the development of new products and processes. It is sometimes deployed as an engineering design process or business process management method. DFSS originated at General Electric to build on the success they had with traditional Six Sigma; but instead of process improvement, DFSS was made to target new product development. It is used in many industries, like finance, marketing, basic engineering, process industries, waste management, and electronics. It is based on the use of statistical tools like linear regression and enables empirical research similar to that performed in other fields, such as social science. While the tools and order used in Six Sigma require a process to be in place and functioning, DFSS has the objective of determining the needs of customers and the business, and driving those needs into the product solution so created. It is used for product or process design in contrast with process improvement. Measurement is the most important part of most Six Sigma or DFSS tools, but whereas in Six Sigma measurements are made from an existing process, DFSS focuses on gaining a deep insight into customer needs and using these to inform every design decision and trade-off.

A Chief Data Officer (CDO) is a corporate officer responsible for enterprise-wide governance and utilization of information as an asset, via data processing, analysis, data mining, information trading and other means. CDOs usually report to the chief executive officer (CEO), although depending on the area of expertise this can vary. The CDO is a member of the executive management team and manager of enterprise-wide data processing and data mining.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

<span class="mw-page-title-main">RapidMiner</span> Data science software

RapidMiner is a data science platform that analyses the collective impact of an organization's data. It was acquired by Altair Engineering in September 2022.

SIGKDD, representing the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining, hosts an influential annual conference.

<span class="mw-page-title-main">SPSS Modeler</span> Data analytics software

IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used to build predictive models and conduct other analytic tasks. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming.

In information science, profiling refers to the process of construction and application of user profiles generated by computerized data analysis.

In-database processing, sometimes referred to as in-database analytics, refers to the integration of data analytics into data warehousing functionality. Today, many large databases, such as those used for credit card fraud detection and investment bank risk management, use this technology because it provides significant performance improvements over traditional methods.

Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining, data science, and analytics professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: (1) Field and goals, (2) Algorithms, (3) Models, (4) Tools, (5) Technology, (6) Challenges, and (7) Future. It is conducted as a service to the data mining community, and the results are usually announced at the PAW conferences and shared via freely available summary reports. In the 2013 survey, 1259 data miners from 75 countries participated. After 2011, Rexer Analytics moved to a biannual schedule.

SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by SAS Institute, one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is "rather a logical organization of the functional tool set of" one of their products, SAS Enterprise Miner, "for carrying out the core tasks of data mining".

<span class="mw-page-title-main">Usama Fayyad</span> American computer scientist

Usama M. Fayyad is an American-Jordanian data scientist and co-founder of KDD conferences and ACM SIGKDD association for Knowledge Discovery and Data Mining. He is a speaker on Business Analytics, Data Mining, Data Science, and Big Data. He recently left his role as the Chief Data Officer at Barclays Bank.

Eureqa was a proprietary modeling engine created in Cornell's Artificial Intelligence Lab and later commercialized by Nutonian, Inc. The software used genetic algorithms to determine mathematical equations that describe sets of data in their simplest form, a technique referred to as symbolic regression.

International School of Engineering (INSOFE) is an Applied Engineering school with area of focus in Data science / Big data analytics. It is located in Hyderabad, Telangana; Bengaluru, Karnataka; and Mumbai, Maharashtra, in India. It opened in 2011. The program is delivered through classroom only sessions and is suitable for students and working professionals.

In business analysis, the Decision Model and Notation (DMN) is a standard published by the Object Management Group. It is a standard approach for describing and modeling repeatable decisions within organizations to ensure that decision models are interchangeable across organizations.

<span class="mw-page-title-main">Gregory Piatetsky-Shapiro</span> American computer scientist

Gregory I. Piatetsky-Shapiro is a data scientist and the co-founder of the KDD conferences, and co-founder and past chair of the Association for Computing Machinery SIGKDD group for Knowledge Discovery, Data Mining and Data Science. He is the founder and president of KDnuggets, a discussion and learning website for Business Analytics, Data Mining and Data Science.

Industrial artificial intelligence, or industrial AI, usually refers to the application of artificial intelligence to industry and business. Unlike general artificial intelligence which is a frontier research discipline to build computerized systems that perform tasks requiring human intelligence, industrial AI is more concerned with the application of such technologies to address industrial pain-points for customer value creation, productivity improvement, cost reduction, site optimization, predictive analysis and insight discovery.

References

  1. Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22.
  2. What IT Needs To Know About The Data Mining Process Published by Forbes, 29 July 2015, retrieved June 24, 2018
  3. Have you seen ASUM-DM?, By Jason Haffar, 16 October 2015, SPSS Predictive Analytics, IBM Archived 8 March 2016 at the Wayback Machine
  4. Analytics Solutions Unified Method - Implementations with Agile principles Published by IBM, 1 March 2016, retrieved October 5, 2018
  5. Pete Chapman (1999); The CRISP-DM User Guide.
  6. 1 2 Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth (2000); The CRISP-DM User Guide (entry on semantic scholar, including links to PDFs), (PDF version with high-resolution graphics Archived 12 September 2020 at the Wayback Machine ).
  7. 1 2 Colin Shearer (2006); First CRISP-DM 2.0 Workshop Held
  8. 1 2 Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA doi: 10.1017/S0269888906000737.
  9. 1 2 Azevedo, A. and Santos, M. F. (2008); KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
  10. 1 2 Gregory Piatetsky-Shapiro (2002); KDnuggets Methodology Poll
  11. 1 2 Gregory Piatetsky-Shapiro (2004); KDnuggets Methodology Poll
  12. 1 2 Gregory Piatetsky-Shapiro (2007); KDnuggets Methodology Poll
  13. Mariscal, G., Marban, O., Fernandez, C. (2010). "A Survey of Data Mining and knowledge discovery process Models and methodologies". The Knowledge Engineering Review. 25 (2): 137–166. doi:10.1017/S0269888910000032. S2CID   31359633.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  14. Harper, Gavin; Stephen D. Pickett (August 2006). "Methods for mining HTS data". Drug Discovery Today . 11 (15–16): 694–699. doi:10.1016/j.drudis.2006.06.006. PMID   16846796.
  15. Gregory Piatetsky-Shapiro (2014); KDnuggets Methodology Poll
  16. Martínez-Plumed, Fernando; Contreras-Ochando, Lidia; Ferri, Cèsar; Flach, Peter; Hernández-Orallo, José; Kull, Meelis; Lachiche, Nicolas; Ramírez-Quintana, María José (19 September 2017). "CASP-DM: Context Aware Standard Process for Data Mining". arXiv: 1709.09003 [cs.DB].
  17. Eric Siegel (2024); Getting Machine Learning Projects from Idea to Execution