Predictive Model Markup Language

Predictive Model Markup Language
Developed by	Robert Lee Grossman
Latest release	4.4; November 2019;3 years ago
Type of format	Predictive modelling
Extended from	XML

Last updated June 23, 2023

The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format conceived by Dr. Robert Lee Grossman, then the director of the National Center for Data Mining at the University of Illinois at Chicago. PMML provides a way for analytic applications to describe and exchange predictive models produced by data mining and machine learning algorithms. It supports common models such as logistic regression and other feedforward neural networks. Version 0.9 was published in 1998.^[1] Subsequent versions have been developed by the Data Mining Group.^[2]

PMML Components

A PMML file can be described by the following components:^[4]^[5]

Header : contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.
Data Dictionary : contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).
Data Transformations : transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
- Normalization: map values to numbers, the input can be continuous or discrete.
- Discretization: map continuous values to discrete values.
- Value mapping: map discrete values to discrete values.
- Functions (custom and built-in): derive a value by applying a function to one or more parameters.
- Aggregation: used to summarize or collect groups of values.
Model : contains the definition of the data mining model. E.g., A multi-layered feedforward neural network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
- Model Name (attribute modelName)
- Function Name (attribute functionName)
- Algorithm Name (attribute algorithmName)
- Activation Function (attribute activationFunction)
- Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other types of models including support vector machines, association rules, Naive Bayes classifier, clustering models, text models, decision trees, and different regression models.

Mining Schema: a list of all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
- Name (attribute name): must refer to a field in the data dictionary
- Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
- Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
- Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.
- Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).
Targets: allows for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.
Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity (for clustering models), standard error, etc. The latest release of PMML, PMML 4.1, extended Output to allow for generic post-processing of model outputs. In PMML 4.1, all the built-in and custom functions that were originally available only for pre-processing became available for post-processing too.

PMML 4.0, 4.1, 4.2 and 4.3

PMML 4.0 was released on June 16, 2009.^[6]^[7]^[8]

Examples of new features included:

Improved Pre-Processing Capabilities: Additions to built-in functions include a range of Boolean operations and an If-Then-Else function.
Time Series Models: New exponential Smoothing models; also place holders for ARIMA, Seasonal Trend Decomposition, and Spectral density estimation, which are to be supported in the near future.
Model Explanation: Saving of evaluation and model performance measures to the PMML file itself.
Multiple Models: Capabilities for model composition, ensembles, and segmentation (e.g., combining of regression and decision trees).
Extensions of Existing Elements: Addition of multi-class classification for Support Vector Machines, improved representation for Association Rules, and the addition of Cox Regression Models.

PMML 4.1 was released on December 31, 2011.^[9]^[10]

New features included:

New model elements for representing Scorecards, k-Nearest Neighbors (KNN) and Baseline Models.
Simplification of multiple models. In PMML 4.1, the same element is used to represent model segmentation, ensemble, and chaining.
Overall definition of field scope and field names.
A new attribute that identifies for each model element if the model is ready or not for production deployment.
Enhanced post-processing capabilities (via the Output element).

PMML 4.2 was released on February 28, 2014.^[11]^[12]

New features include:

Transformations: New elements for implementing text mining
New built-in functions for implementing regular expressions: matches, concat, and replace
Simplified outputs for post-processing
Enhancements to Scorecard and Naive Bayes model elements

PMML 4.3 was released on August 23, 2016.^[13]^[14]

New features include:

New Model Types:
- Gaussian Process
- Bayesian Network
New built-in functions
Usage clarifications
Documentation improvements

Version 4.4 was released in November 2019.^[15]^[16]

Release history

Version	Release date
Version 0.7	July 1997
Version 0.9	July 1998
Version 1.0	August 1999
Version 1.1	August 2000
Version 2.0	August 2001
Version 2.1	March 2003
Version 3.0	October 2004
Version 3.1	December 2005
Version 3.2	May 2007
Version 4.0	June 2009
Version 4.1	December 2011
Version 4.2	February 2014
Version 4.2.1	March 2015
Version 4.3	August 2016
Version 4.4	November 2019

Data Mining Group

The Data Mining Group is a consortium managed by the Center for Computational Science Research, Inc., a nonprofit founded in 2008.^[17] The Data Mining Group also developed a standard called Portable Format for Analytics, or PFA, which is complementary to PMML.

Related Research Articles

Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labeled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning algorithms is learning a function that maps feature vectors (inputs) to labels (output), based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.

<span class="mw-page-title-main">Time series</span> Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics applications and tools. The JDM 1.0 standard was developed under the Java Community Process as JSR 73. In 2006, the JDM 2.0 specification was being developed under JSR 247, but has been withdrawn in 2011 without standardization.

A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons ; see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.

Neural network software is used to simulate, research, develop, and apply artificial neural networks, software concepts adapted from biological neural networks, and in some cases, a wider array of adaptive systems such as artificial intelligence and machine learning.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of $independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.$

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.

A probabilistic neural network (PNN) is a feedforward neural network, which is widely used in classification and pattern recognition problems. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input data is estimated and Bayes’ rule is then employed to allocate the class with highest posterior probability to new input data. By this method, the probability of mis-classification is minimized. This type of artificial neural network (ANN) was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It was introduced by D.F. Specht in 1966. In a PNN, the operations are organized into a multilayered feedforward network with four layers:

Neural Designer is a software tool for machine learning based on neural networks, a main area of artificial intelligence research, and contains a graphical user interface which simplifies data entry and interpretation of results.

oneAPI Data Analytics Library, is a library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems.

The Portable Format for Analytics (PFA) is a JSON-based predictive model interchange format conceived and developed by Jim Pivarski. PFA provides a way for analytic applications to describe and exchange predictive models produced by analytics and machine learning algorithms. It supports common models such as logistic regression and decision trees. Version 0.8 was published in 2015. Subsequent versions have been developed by the Data Mining Group.

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

<i>Data Science and Predictive Analytics</i>

The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. The second edition of the book was printed in 2023.

Conformal prediction (CP) is a statistical technique for producing prediction sets without assumptions on the predictive algorithm (often a machine learning system) and only assuming exchangeability of the data. CP works by computing a nonconformity measure, often called a score function, on previously labeled data, and using these to create prediction sets on a new (unlabeled) test data point. A version of CP was first proposed in 1998 by Gammerman, Vovk, and Vapnik, and since, several variants of conformal prediction have been developed with different computational complexities, formal guarantees, and practical applications.

References

↑ "The management and mining of multiple predictive models using the predictive modeling markup language". ResearchGate. doi:10.1016/S0950-5849(99)00022-1 . Retrieved 2015-12-21.
↑ "Data Mining Group" . Retrieved December 14, 2017. The DMG is proud to host the working groups that develop the Predictive Model Markup Language (PMML) and the Portable Format for Analytics (PFA), two complementary standards that simplify the deployment of analytic models.
↑ "PMML Powered". Data Mining Group. Retrieved December 14, 2017.
↑ A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009.
↑ A. Guazzelli, W. Lin, T. Jena (2010). PMML in Action (2nd Edition): Unleashing the Power of Open Standards for Data Mining and Predictive Analytics. CreateSpace.
↑ Data Mining Group website | PMML 4.0 - Changes from PMML 3.2 Archived 2012-07-28 at archive.today
↑ "Zementis website | PMML 4.0 is here!". Archived from the original on 2011-10-03. Retrieved 2009-06-17.
↑ R. Pechter. What's PMML and What's New in PMML 4.0? The ACM SIGKDD Explorations Newsletter, Volume 11/1, July 2009.
↑ Data Mining Group website | PMML 4.1 - Changes from PMML 4.0
↑ Predictive Analytics Info website | PMML 4.1 is here!
↑ Data Mining Group website | PMML 4.2 - Changes from PMML 4.1 Archived 2014-05-20 at archive.today
↑ Predictive Analytics Info website | PMML 4.2 is here!
↑ Data Mining Group website | PMML 4.3 - Changes from PMML 4.2.1
↑ Predictive Model Markup Language product website | Project activity
↑ "The Data Mining Group releases Predictive Model Markup Language v4.4" . Retrieved 12 July 2021.
↑ "PMML 4.4.1 - General Structure". Data Mining Group. Retrieved 12 July 2021.
↑ "2008 EO 990" . Retrieved 16 Oct 2014.

External links

Data Pre-processing in PMML and ADAPA - A Primer
Video of Dr. Alex Guazzelli's PMML presentation for the ACM Data Mining Group (hosted by LinkedIn)
PMML 3.2 Specification
PMML 4.0 Specification
PMML 4.1 Specification
PMML 4.2.1 Specification
PMML 4.4 Specification
Representing predictive solutions in PMML: Move from raw data to predictions - Article published on the IBM developerWorks website.
Predictive analytics in healthcare: The importance of open standards - Article published on the IBM developerWorks website.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "The management and mining of multiple predictive models using the predictive modeling markup language". ResearchGate. doi:10.1016/S0950-5849(99)00022-1 . Retrieved 2015-12-21.

[2] "Data Mining Group" . Retrieved December 14, 2017. The DMG is proud to host the working groups that develop the Predictive Model Markup Language (PMML) and the Portable Format for Analytics (PFA), two complementary standards that simplify the deployment of analytic models.

[3] "PMML Powered". Data Mining Group. Retrieved December 14, 2017.

[4] A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009.

[5] A. Guazzelli, W. Lin, T. Jena (2010). PMML in Action (2nd Edition): Unleashing the Power of Open Standards for Data Mining and Predictive Analytics. CreateSpace.

[6] Data Mining Group website | PMML 4.0 - Changes from PMML 3.2 Archived 2012-07-28 at archive.today

[7] "Zementis website | PMML 4.0 is here!". Archived from the original on 2011-10-03. Retrieved 2009-06-17.

[8] R. Pechter. What's PMML and What's New in PMML 4.0? The ACM SIGKDD Explorations Newsletter, Volume 11/1, July 2009.

[9] Data Mining Group website | PMML 4.1 - Changes from PMML 4.0

[10] Predictive Analytics Info website | PMML 4.1 is here!

[11] Data Mining Group website | PMML 4.2 - Changes from PMML 4.1 Archived 2014-05-20 at archive.today

[12] Predictive Analytics Info website | PMML 4.2 is here!

[13] Data Mining Group website | PMML 4.3 - Changes from PMML 4.2.1

[14] Predictive Model Markup Language product website | Project activity

[15] "The Data Mining Group releases Predictive Model Markup Language v4.4" . Retrieved 12 July 2021.

[16] "PMML 4.4.1 - General Structure". Data Mining Group. Retrieved 12 July 2021.

[17] "2008 EO 990" . Retrieved 16 Oct 2014.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]