Data Analytics Library

Data Analytics Library
Developer(s)	Intel
Initial release	August 25, 2015;9 years ago
Stable release	2021 Update 4 / 2021;3 years ago
Repository	github.com/oneapi-src/oneDAL ;
Written in	C++, Java, Python
Operating system	Microsoft Windows, Linux, macOS
Platform	Intel Atom, Intel Core, Intel Xeon
Type	Library or framework
License	Apache License 2.0
Website	software.intel.com/content/www/us/en/develop/tools/data-analytics-acceleration-library.html

Last updated September 17, 2024

oneAPI Data Analytics Library (oneDAL; formerly Intel Data Analytics Acceleration Library or Intel DAAL), is a library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems.^[4]^[5]^[6]^[7]

History

Intel launched the Intel Data Analytics Library(oneDAL) on December 8, 2020. It also launched the Data Analytics Acceleration Library on August 25, 2015 and called it Intel Data Analytics Acceleration Library 2016 (Intel DAAL 2016).^[9] oneDAL is bundled with Intel oneAPI Base Toolkit as a commercial product. A standalone version is available commercially or freely,^[3]^[10] the only difference being support and maintenance related.

License

Apache License 2.0

Details

Functional categories

Intel DAAL has the following algorithms:^[11]^[4]^[12]

Analysis
- Low Order Moments: Includes computing min, max, mean, standard deviation, variance, etc. for a dataset.
- Quantiles: splitting observations into equal-sized groups defined by quantile orders.
- Correlation matrix and variance-covariance matrix: A basic tool in understanding statistical dependence among variables. The degree of correlation indicates the tendency of one change to indicate the likely change in another.
- Cosine distance matrix: Measuring pairwise distance using cosine distance.
- Correlation distance matrix: Measuring pairwise distance between items using correlation distance.
- Clustering: Grouping data into unlabeled groups. This is a typical technique used in “unsupervised learning” where there is not established model to rely on. Intel DAAL provides 2 algorithms for clustering: K-Means and “EM for GMM.”
- Principal Component Analysis (PCA): the most popular algorithm for dimensionality reduction.
- Association rules mining: Detecting co-occurrence patterns. Commonly known as “shopping basket mining.”
- Data transformation through matrix decomposition: DAAL provides Cholesky, QR, and SVD decomposition algorithms.
- Outlier detection: Identifying observations that are abnormally distant from typical distribution of other observations.
Training and Prediction
- Regression
  - Linear regression: The simplest regression method. Fitting a linear equation to model the relationship between dependent variables (things to be predicted) and explanatory variables (things known).
- Classification: Building a model to assign items into different labeled groups. DAAL provides multiple algorithms in this area, including Naïve Bayes classifier, Support Vector Machine, and multi-class classifiers.
- Recommendation systems
- Neural networks

Intel DAAL supported three processing modes:

Batch processing: When all data fits in the memory, a function is called to process the data all at once.
Online processing (also called Streaming): when all data does not fit in memory. Intel® DAAL can process data chunks individually and combine all partial results at the finalizing stage.
Distributed processing: DAAL supports a model similar to MapReduce. Consumers in a cluster process local data (map stage), and then the Producer process collects and combines partial results from Consumers (reduce stage). Intel DAAL offers flexibility in this mode by leaving the communication functions completely to the developer. Developers can choose to use the data movement in a framework such as Hadoop or Spark, or explicitly coding communications most likely with MPI.

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are $N$ elements, this matrix will have size $N \times N$ . In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language, hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

X-Video Motion Compensation (XvMC), is an extension of the X video extension (Xv) for the X Window System. The XvMC API allows video programs to offload portions of the video decoding process to the GPU video-hardware. In theory this process should also reduce bus bandwidth requirements. Currently, the supported portions to be offloaded by XvMC onto the GPU are motion compensation and inverse discrete cosine transform (iDCT) for MPEG-2 video. XvMC also supports offloading decoding of mo comp, iDCT, and VLD for not only MPEG-2 but also MPEG-4 ASP video on VIA Unichrome hardware.

VTune Profiler is a performance analysis tool for x86-based machines running Linux or Microsoft Windows operating systems. Many features work on both Intel and AMD hardware, but the advanced hardware-based sampling features require an Intel-manufactured CPU.

Intel Integrated Performance Primitives is an extensive library of ready-to-use, domain-specific functions that are highly optimized for diverse Intel architectures. Its royalty-free APIs help developers take advantage of Single Instruction, Multiple Data (SIMD) instructions.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

Video Acceleration API (VA-API) is an open source application programming interface that allows applications such as VLC media player or GStreamer to use hardware video acceleration capabilities, usually provided by the graphics processing unit (GPU). It is implemented by the free and open-source library libva, combined with a hardware-specific driver, usually provided together with the GPU driver.

Intel Parallel Studio XE was a software development product developed by Intel that facilitated native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

The following outline is provided as an overview of and topical guide to regression analysis:

Intel oneAPI Math Kernel Library, formerly known as Intel Math Kernel Library, is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math.

mlpack is a free, open-source and header-only software library for machine learning and artificial intelligence written in C++, built on top of the Armadillo library and the ensmallen numerical optimization library. mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. mlpack has also a light deployment infrastructure with minimum dependencies, making it perfect for embedded systems and low resource devices. Its intended target users are scientists and engineers.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

The following outline is provided as an overview of and topical guide to machine learning:

References

External links

OneAPI (compute acceleration)
oneAPI oneDAL Specification Archived 2021-10-07 at the Wayback Machine
oneDAL on GitHub
DAAL Official Product Website
DAAL Support
DAAL User Forum
DAAL Support Channel

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Intel® Data Analytics Acceleration Library Release Notes". software.intel.com.

[homepage-2] 1 2 3 4 Intel® Data Analytics Acceleration Library (Intel® DAAL) | Intel® Software

[freedaal-3] 1 2 "Open Source Project: Intel Data Analytics Acceleration Library (DAAL)".

[githubdaal-4] 1 2 3 "DAAL github".

[5] "Intel Updates Developer Toolkit with Data Analytics Acceleration Library".

[6] "Intel adds big data functions to math libraries".

[7] "Intel Leverages HPC Core for Analytics Tooling Push". nextplatform.com. 2015-08-25.

[dicedaal-8] "Try Out Intel DAAL to Process Big Data".

[9] "Intel Data Analytics Acceleration Library".

[commlic-10] "Community Licensing of Intel Performance Libraries".

[11] Developer Guide for Intel(R) Data Analytics Acceleration Library 2020

[colfaxdaal-12] "Introduction to Intel DAAL, Part 1: Polynomial Regression with Batch Mode Computation".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Intel software
Items in italics are no longer maintained or have planned end-of-life dates.
Development	Parallel Studio C++ Compiler Fortran Compiler Advisor Inspector INTERP/80 VTune
Components	Data Analytics Library (DAL) Integrated Performance Primitives (IPP) Math Kernel Library (MKL) Threading Building Blocks (TBB)
Open source	Data Analytics Library (DAL) Threading Building Blocks (TBB) Tizen OpenVINO
Software programs	Telekinesys Research ¹ Havok ¹ Vision ¹
Organizations	Developer Zone Research
¹Sold to Microsoft

v t e Numerical linear algebra
Key concepts	Floating point Numerical stability
Problems	System of linear equations Matrix decompositions Matrix multiplication (algorithms) Matrix splitting Sparse problems
Hardware	CPU cache TLB Cache-oblivious algorithm SIMD Multiprocessing
Software	ATLAS MATLAB Basic Linear Algebra Subprograms (BLAS) LAPACK Specialized libraries General purpose software