Business process conformance checking (a.k.a. conformance checking for short) is a family of process mining techniques to compare a process model with an event log of the same process. [1] It is used to check if the actual execution of a business process, as recorded in the event log, conforms to the model and vice versa.
For instance, there may be a process model indicating that purchase orders of more than one million euros require two checks. Analysis of the event log will show whether this rule is followed or not.
Another example is the checking of the so-called “four-eyes” principle stating that particular activities should not be executed by one and the same person. By scanning the event log using a model specifying these requirements, one can discover potential cases of fraud. Hence, conformance checking may be used to detect, locate and explain deviations, and to measure the severity of these deviations. [2]
Conformance checking techniques take as input a process model and event log and return a set of differences between the behavior captured in the process model and the behavior captured in the event log. These differences may be represented visually (e.g. overlaid on top of the process model) or textually as lists of natural language statements (e.g., activity x is executed multiple times in the log, but this is not allowed according to the model). Some techniques may also produce a normalized measures (between 0 and 1) indicating to what extent the process model and the event log match each other.
The interpretation of non-conformance depends on the purpose of the model:
The purpose of conformance checking is to identify two types of discrepancies:
There are broadly three families of techniques for detecting unfitting log behavior: replay, trace alignment and behavioral alignment.
In replay techniques, [3] each trace is replayed against the process model one event at a time. When a replay error is detected, it is reported and a local correction is made to resume the replay procedure. The local correction may be for example to skip/ignore a task in the process model or to skip/ignore an event in the log.
A general limitation of replay methods is that error recovery is performed locally each time that an error is encountered. Hence, these methods might not identify the minimum number of errors that can explain the unfitting log behavior. This limitation is addressed by trace alignment techniques. [4] These latter techniques identify, for each trace in the log, the closest corresponding trace that can be parsed by the model. Trace alignment techniques also compute an alignment showing the points of divergence between these two traces. The output is a set of pairs of aligned traces. Each pair shows a trace in the log that does not match exactly a trace in the model, together with the corresponding closest trace(s) produced by the model.
Trace alignment techniques do not explicitly handle concurrent tasks nor cyclic behavior (repetition of tasks). If for example four tasks can occur only in a fixed order in the process model (e.g. [A, B, C, D]), but they can occur concurrently in the log (i.e. in any order), this difference cannot directly detected by trace alignment, because it cannot be observed at the level of individual traces.
Other methods to identify additional behavior are based on negative events . [5] These methods start by enhancing the traces in the log by inserting fake (negative) events in all or some traces of the log. A negative event is inserted after a given prefix of a trace if this event is never observed preceded by that prefix anywhere in the log.
For example, if event C is never observed after prefix AB, then C can be inserted as a negative event after AB. Thereafter, the log enhanced with negative events is replayed against the process model. If the process model can replay the negative events, it means that there is behavior captured in the process model that is not captured in the log (since the negative events correspond to behavior that is never observed in the log).
Footprint matrices display the causal dependency of two activities in an event log, e.g., if in an event log, activity a is followed by activity b in all traces but activity b is never followed by b. [6] Toward this kind of dependency, a list of ordering relations is declared:
Let L be an event log associated with the list A of all activities. Let a, b be two activities in A.
For a process model, such a matrix can also be derived on top of the execution sequences by using the play-out technique. Therefore, based on the footprint matrices, one can reason that if an event log conforms with a regarded process model, the two footprint matrices representing the log and the model are identical, i.e., the behaviors recorded in the model (in this case is the causal dependency) appear at least once in the event log.
Example: Let L be: {<a, b>, <a, c, d>} and a model M of L. Assume the two matrices are as follows:
a | b | c | d | |
---|---|---|---|---|
a | # | → | → | # |
b | ← | # | # | # |
c | ← | # | # | → |
d | # | # | ← | # |
a | b | c | d | |
---|---|---|---|---|
a | # | → | → | → |
b | ← | # | # | # |
c | ← | # | # | → |
d | ← | # | ← | # |
We can notice that, in the footprint matrix of model M, the pattern (a, d) is allowed to occur, hence, it causes a deviation in comparison with the event log. The fitness between the event log and the model is computed as follows:
In this example, the fitness is .
Token-based replay is a technique that uses 4 counters (produced tokens, consumed tokens, missing tokens and remaining tokens) to compute the fitness of an observation trace based on a given process model in Petri-net notation. [7] These 4 counters record the status of tokens when a trace is replayed on the Petri net. When a token is produced by a transition, produced tokens is increased by 1. When a token is consumed to fire a transition, consumed tokens is increased by 1. When a token is missing to fire a transition, missing tokens is increased by 1. Remaining tokens records the total remaining tokens after the trace is complete. The trace conforms with the process model if and only if there are no missing tokens during the replay and no remaining tokens at the end.
The fitness between an event log and a process model is computed as follows:
where m is the number of missing tokens, c is the number of consumed tokens, r is the number of remaining tokens, p is the number of produced tokens.
Although the token-replay technique is efficient and easy to understand, the approach is designed for Petri net notation and doesn't consider the suitable path generated by the model for the unfit cases. Alignments were introduced to solve the limitations and is considered a highly accurate conformance checking technique and can be applied for any process modeling notation. [8] The idea is that the algorithm performs an exhaustive search to find out the optimal alignment between the observed trace and the process model. Hence, it is guaranteed to find out the most related model run in comparison to the trace.
A Petri net, also known as a place/transition (PT) net, is one of several mathematical modeling languages for the description of distributed systems. It is a class of discrete event dynamic system. A Petri net is a directed bipartite graph that has two types of elements, places and transitions. Place elements are depicted as white circles and transition elements are depicted as rectangles. A place can contain any number of tokens, depicted as black circles. A transition is enabled if all places connected to it as inputs contain at least one token. Some sources state that Petri nets were invented in August 1939 by Carl Adam Petri—at the age of 13—for the purpose of describing chemical processes.
In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.
An event-driven process chain (EPC) is a type of flow chart for business process modeling. EPC can be used to configure enterprise resource planning execution, and for business process improvement. It can be used to control an autonomous workflow instance in work sharing.
In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.
A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.
In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations versus is an autocorrelogram. If cross-correlation is plotted, the result is called a cross-correlogram.
In operant conditioning, the matching law is a quantitative relationship that holds between the relative rates of response and the relative rates of reinforcement in concurrent schedules of reinforcement. For example, if two response alternatives A and B are offered to an organism, the ratio of response rates to A and B equals the ratio of reinforcements yielded by each response. This law applies fairly well when non-human subjects are exposed to concurrent variable interval schedules ; its applicability in other situations is less clear, depending on the assumptions made and the details of the experimental situation. The generality of applicability of the matching law is subject of current debate.
In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.
Process mining is a family of techniques relating the fields of data science and process management to support the analysis of operational processes based on event logs. The goal of process mining is to turn event data into insights and actions. Process mining is an integral part of data science, fueled by the availability of event data and the desire to improve processes. Process mining techniques use event data to show what people, machines, and organizations are really doing. Process mining provides novel insights that can be used to identify the executional path taken by operational processes and address their performance and compliance problems.
Business process discovery (BPD) related to business process management and process mining is a set of techniques that manually or automatically construct a representation of an organisations' current business processes and their major process variations. These techniques use data recorded in the existing organisational methods of work, documentations, and technology systems that run business processes within an organisation. The type of data required for process discovery is called an event log. Any record of data that contains the case id, activity name, and timestamp. Such a record qualifies for an event log and can be used to discover the underlying process model. The event log can contain additional information related to the process, such as the resources executing the activity, the type or nature of the events, or any other relevant details. Process discovery aims to obtain a process model that describes the event log as closely as possible. The process model acts as a graphical representation of the process. The event logs used for discovery could contain noise, irregular information, and inconsistent/incorrect timestamps. Process discovery is challenging due to such noisy event logs and because the event log contains only a part of the actual process hidden behind the system. The discovery algorithms should solely depend on a small percentage of data provided by the event logs to develop the closest possible model to the actual behaviour.
In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.
In probability theory and statistics, the index of dispersion, dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.
Willibrordus Martinus Pancratius van der Aalst is a Dutch computer scientist and full professor at RWTH Aachen University, leading the Process and Data Science (PADS) group. His research and teaching interests include information systems, workflow management, Petri nets, process mining, specification languages, and simulation. He is also known for his work on workflow patterns.
The α-algorithm or α-miner is an algorithm used in process mining, aimed at reconstructing causality from a set of sequences of events. It was first put forward by van der Aalst, Weijters and Măruşter. The goal of Alpha miner is to convert the event log into a workflow-net based on the relations between various activities in the event log. An event log is a multi-set of traces, and a trace is a sequence of activity names. Several extensions or modifications of it have since been presented, which will be listed below.
Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).
Inductive miner belongs to a class of algorithms used in process discovery. Various algorithms proposed previously give process models of slightly different type from the same input. The quality of the output model depends on the soundness of the model. A number of techniques such as alpha miner, genetic miner, work on the basis of converting an event log into a workflow model, however, they do not produce models that are sound all the time. Inductive miner relies on building a directly follows graph from event log and using this graph to detect various process relations.
Process mining is a technique used to turn event data into insights and actions. Techniques used in process mining such as Process discovery and Conformance checking depend only one the order of activities executed in the operations. The event log not only contains the activity details, but also timestamps, resources and data accompanied with process execution. Careful analysis of the external details from the event log can reveal useful information that can be used for making predictions on decisions that might be taken in the future, efficiency and working dynamics of the team, and performance analysis.
Streaming conformance checking is a type of doing conformance checking where the deviation is reported directly when it happens. Instead of event log, streaming conformance checking techniques take event stream and process model as input and for each received event from the stream, it will be compared with the model.
Token-based replay technique is a conformance checking algorithm that checks how well a process conforms with its model by replaying each trace on the model. Using the four counters produced tokens, consumed tokens, missing tokens, and remaining tokens, it records the situations where a transition is forced to fire and the remaining tokens after the replay ends. Based on the count at each counter, we can compute the fitness value between the trace and the model.