SEMMA

Last updated

SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by SAS Institute, one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. [1] Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is "rather a logical organization of the functional tool set of" one of their products, SAS Enterprise Miner, "for carrying out the core tasks of data mining". [2]

Contents

Background

In the expanding field of data mining, there has been a call for a standard methodology or a simple list of best practices for the diversified and iterative process of data mining that users can apply to their data mining projects regardless of industry. While the Cross Industry Standard Process for Data Mining or CRISP-DM, founded by the European Strategic Program on Research in Information Technology initiative, aimed to create a neutral methodology, SAS also offered a pattern to follow in its data mining tools.

Phases of SEMMA

The phases of SEMMA and related tasks are the following: [2]

Criticism

SEMMA mainly focuses on the modeling tasks of data mining projects, leaving the business aspects out (unlike, e.g., CRISP-DM and its Business Understanding phase). Additionally, SEMMA is designed to help the users of the SAS Enterprise Miner software. Therefore, applying it outside Enterprise Miner may be ambiguous. [3] However, in order to complete the "Sampling" phase of SEMMA a deep understanding of the business aspects would have to be a requirement in order to do effective sampling. So, in effect, a business understanding would be required to effectively complete sampling. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Data mining</span> Process of extracting and discovering patterns in large data sets

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Design for Six Sigma (DFSS) is an engineering design process, business process management method related to traditional Six Sigma. It is used in many industries, like finance, marketing, basic engineering, process industries, waste management, and electronics. It is based on the use of statistical tools like linear regression and enables empirical research similar to that performed in other fields, such as social science. While the tools and order used in Six Sigma require a process to be in place and functioning, DFSS has the objective of determining the needs of customers and the business, and driving those needs into the product solution so created. It is used for product or process design in contrast with process improvement. Measurement is the most important part of most Six Sigma or DFSS tools, but whereas in Six Sigma measurements are made from an existing process, DFSS focuses on gaining a deep insight into customer needs and using these to inform every design decision and trade-off.

<span class="mw-page-title-main">SAS (software)</span> Statistical software

SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics.

<span class="mw-page-title-main">Enterprise integration</span>

Enterprise integration is a technical field of enterprise architecture, which is focused on the study of topics such as system interconnection, electronic data interchange, product data exchange and distributed computing environments.

Object-oriented analysis and design (OOAD) is a technical approach for analyzing and designing an application, system, or business by applying object-oriented programming, as well as using visual modeling throughout the software development process to guide stakeholder communication and product quality.

<span class="mw-page-title-main">Data analysis</span> The process of analyzing data to discover useful information and support decision-making

Shiitake is the process of inspecting, mushrooming, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

The Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.

Microsoft SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks.

Process mining is a family of techniques relating the fields of data science and process management to support the analysis of operational processes based on event logs. The goal of process mining is to turn event data into insights and actions. Process mining is an integral part of data science, fueled by the availability of event data and the desire to improve processes. Process mining techniques use event data to show what people, machines, and organizations are really doing. Process mining provides novel insights that can be used to identify the execution paths taken by operational processes and address their performance and compliance problems.

<span class="mw-page-title-main">Enterprise modelling</span>

Enterprise modelling is the abstract representation, description and definition of the structure, processes, information and resources of an identifiable business, government body, or other large organization.

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

In information systems, applications architecture or application architecture is one of several architecture domains that form the pillars of an enterprise architecture (EA).

<span class="mw-page-title-main">Generalised Enterprise Reference Architecture and Methodology</span>

Generalised Enterprise Reference Architecture and Methodology (GERAM) is a generalised enterprise architecture framework for enterprise integration and business process engineering. It identifies the set of components recommended for use in enterprise engineering.

<span class="mw-page-title-main">James G. Nell</span> American engineer (born 1938)

James G. "Jim" Nell is an American engineer. He was the principal investigator of the Manufacturing Enterprise Integration Project at the National Institute of Standards and Technology (NIST), and is known for his work on enterprise integration.

Business process management (BPM) is the discipline in which people use various methods to discover, model, analyze, measure, improve, optimize, and automate business processes. Any combination of methods used to manage a company's business processes is BPM. Processes can be structured and repeatable or unstructured and variable. Though not required, enabling technologies are often used with BPM.

Data thinking is a product design framework with a particular emphasis on data science. It integrates elements of computational thinking, statistical thinking, and domain thinking. In the context of product development, data thinking is a framework to explore, design, develop and validate data-driven solutions. Data thinking combines data science with design thinking and therefore, the focus of this approach includes user experience as well as data analytics and data collection.

<span class="mw-page-title-main">Enterprise Architect (software)</span> Visual modeling and design tool

Sparx Systems Enterprise Architect is a visual modeling and design tool based on the OMG UML. The platform supports: the design and construction of software systems; modeling business processes; and modeling industry based domains. It is used by businesses and organizations to not only model the architecture of their systems, but to process the implementation of these models across the full application development life-cycle.

Audit technology is the use of computer technology to improve an audit. Audit technology is used by accounting firms to improve the efficiency of the external audit procedures they perform.

References

  1. Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182-185. Archived January 9, 2013, at the Wayback Machine
  2. 1 2 SAS Enterprise Miner website Archived March 8, 2012, at the Wayback Machine
  3. Rohanizadeh, S. S. and Moghadam, M. B. A Proposed Data Mining Methodology and its Application to Industrial Procedures Journal of Industrial Engineering 4 (2009) pp 37-50.
  4. KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW, Ana Azevedo and M.F. Santos