Feature engineering

Last updated

Feature engineering is a preprocessing step in supervised machine learning and statistical modeling [1] which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability. [2] [3] [4]

Contents

Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, and the Archimedes number in sedimentation. They also develop first approximations of solutions, such as analytical solutions for the strength of materials in mechanics. [5]

Clustering

One of the applications of feature engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on matrix decomposition has been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include Non-Negative Matrix Factorization (NMF), [6] Non-Negative Matrix-Tri Factorization (NMTF), [7] Non-Negative Tensor Decomposition/Factorization (NTF/NTD), [8] etc. The non-negativity constraints on coefficients of the feature vectors mined by the above-stated algorithms yields a part-based representation, and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including orthogonality-constrained factorization for hard clustering, and manifold learning to overcome inherent issues with these algorithms.

Other classes of feature engineering algorithms include leveraging a common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. An example is Multi-view Classification based on Consensus Matrix Decomposition (MCMD), [2] which mines a common clustering scheme across multiple datasets. MCMD is designed to output two types of class labels (scale-variant and scale-invariant clustering), and:

Coupled matrix and tensor decompositions are popular in multi-view feature engineering. [9]

Predictive modelling

Feature engineering in machine learning and statistical modeling involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices. [10]

Features vary in significance. [11] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting). [12]

Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:

Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection. [13]

Automation

Automation of feature engineering is a research topic that dates back to the 1990s. [14] Machine learning software that incorporates automated feature engineering has been commercially available since 2016. [15] Related academic literature can be roughly separated into two types:

Multi-relational decision tree learning (MRDTL)

Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to relational databases, handling complex data relationships across tables. It innovatively uses selection graphs as decision nodes, refined systematically until a specific termination criterion is reached. [14]

Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation. [16] [17]

Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:

Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition. [32] [33]

Feature stores

The feature store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. [34]

A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. [35]

Feature stores can be standalone software tools or built into machine learning platforms.

Alternatives

Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error. [36] [37] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. [38] However, deep learning algorithms still require careful preprocessing and cleaning of the input data. [39] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process. [40]

See also

Related Research Articles

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Advances in the field of deep learning have allowed neural networks to surpass many previous approaches in performance.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Projection of data onto lower-dimensional manifolds

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for exploratory qualitative data analysis and interactive data visualization.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

Oracle Data Mining (ODM) is an option of Oracle Database Enterprise Edition. It contains several data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence (AI), its subdisciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of, and topical guide to, machine learning:

<span class="mw-page-title-main">Andrzej Cichocki</span> Polish computer scientist

Andrzej Cichocki is a Polish computer scientist, electrical engineer and a professor at the Systems Research Institute of Polish Academy of Science, Warsaw, and Nicolaus Copernicus University (UMK) in Toruń, Poland, and a visiting professor in several universities and research institutes, especially Riken AIP, Japan. Andrzej Cichocki is among world’s top 1% most-cited researchers in the Web of Science (Clarivate) citation index and named on the annual Highly Cited Researchers 2021--2023 lists. He is most noted for his learning algorithms for  Signal separation (BSS), Independent Component Analysis (ICA), Non-negative matrix factorization (NMF), tensor decomposition,  Deep (Multilayer) Matrix Factorizations for ICA, NMF, PCA, neural networks for optimization and signal processing, Tensor network for Machine Learning and Big Data, and brain–computer interfaces. He is the author of several monographs/books and more than 600 scientific peer-reviewed articles.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

<span class="mw-page-title-main">Knowledge graph embedding</span> Dimensionality reduction of graph-based semantic data objects [machine learning task]

In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.

References

  1. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN   978-0-387-84884-6.
  2. 1 2 3 Sharma, Shubham; Nayak, Richi; Bhaskar, Ashish (2024-05-01). "Multi-view feature engineering for day-to-day joint clustering of multiple traffic datasets". Transportation Research Part C: Emerging Technologies. 162: 104607. Bibcode:2024TRPC..16204607S. doi: 10.1016/j.trc.2024.104607 . ISSN   0968-090X.
  3. Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge University Press. ISBN   9781107057135.
  4. Murphy, Kevin P. (2022). Probabilistic Machine Learning. Cambridge, Massachusetts: The MIT Press (Copyright 2022 Massachusetts Institute of Technology, this work is subject to a Creative Commons CC-BY-NC-ND license). ISBN   9780262046824.
  5. MacQueron C (2021). SOLID-LIQUID MIXING IN STIRRED TANKS : Modeling, Validation, Design Optimization and Suspension Quality Prediction (Report). doi:10.13140/RG.2.2.11074.84164/1.
  6. Lee, Daniel D.; Seung, H. Sebastian (1999). "Learning the parts of objects by non-negative matrix factorization". Nature. 401 (6755): 788–791. Bibcode:1999Natur.401..788L. doi:10.1038/44565. ISSN   1476-4687. PMID   10548103.
  7. Wang, Hua; Nie, Feiping; Huang, Heng; Ding, Chris (2011). "Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation". 2011 IEEE 11th International Conference on Data Mining. IEEE. pp. 774–783. doi:10.1109/icdm.2011.109. ISBN   978-1-4577-2075-8.
  8. Lim, Lek-Heng; Comon, Pierre (2009-04-12). "Nonnegative approximations of nonnegative tensors". arXiv: 0903.4530 [cs.NA].
  9. Nayak, Richi; Luong, Khanh (2023). "Multi-aspect Learning". Intelligent Systems Reference Library. 242. doi:10.1007/978-3-031-33560-0. ISBN   978-3-031-33559-4. ISSN   1868-4394.
  10. "Feature engineering - Machine Learning Lens". docs.aws.amazon.com. Retrieved 2024-03-01.
  11. "Feature Engineering" (PDF). 2010-04-22. Retrieved 12 November 2015.
  12. "Feature engineering and selection" (PDF). Alexandre Bouchard-Côté. October 1, 2009. Retrieved 12 November 2015.
  13. "Feature engineering in Machine Learning" (PDF). Zdenek Zabokrtsky. Archived from the original (PDF) on 4 March 2016. Retrieved 12 November 2015.
  14. 1 2 Knobbe AJ, Siebes A, Van Der Wallen D (1999). "Multi-relational Decision Tree Induction" (PDF). Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 1704. pp. 378–383. doi:10.1007/978-3-540-48247-5_46. ISBN   978-3-540-66490-1.
  15. "Its all about the features". Reality AI Blog. September 2017.
  16. Yin X, Han J, Yang J, Yu PS (2004). "CrossMine: Efficient classification across multiple database relations". Proceedings. 20th International Conference on Data Engineering. pp. 399–410. doi:10.1109/ICDE.2004.1320014. ISBN   0-7695-2065-0. S2CID   1183403.
  17. Frank R, Moser F, Ester M (2007). "A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions". Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science. Vol. 4702. pp. 430–437. doi:10.1007/978-3-540-74976-9_43. ISBN   978-3-540-74975-2.
  18. "What is Featuretools?" . Retrieved September 7, 2022.
  19. "Featuretools - An open source python framework for automated feature engineering" . Retrieved September 7, 2022.
  20. "github: alteryx/featuretools". GitHub . Retrieved September 7, 2022.
  21. Sharma, Shubham, mcmd: Multi-view Classification framework based on Consensus Matrix Decomposition developed by Shubham Sharma at QUT , retrieved 2024-04-14
  22. 1 2 Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv: 1706.00327 [cs.DB].
  23. "getML documentation" . Retrieved September 7, 2022.
  24. 1 2 3 "github: getml/getml-community". GitHub . Retrieved September 7, 2022.
  25. "tsfresh documentation" . Retrieved September 7, 2022.
  26. "Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)" . Retrieved September 7, 2022.
  27. "predict-idlab/tsflex". GitHub . Retrieved September 7, 2022.
  28. Van Der Donckt, Jonas; Van Der Donckt, Jeroen; Deprost, Emiel; Van Hoecke, Sofie (2022). "tsflex: Flexible time series processing & feature extraction". SoftwareX. 17: 100971. arXiv: 2111.12429 . Bibcode:2022SoftX..1700971V. doi:10.1016/j.softx.2021.100971. S2CID   244527198 . Retrieved September 7, 2022.
  29. "seglearn user guide" . Retrieved September 7, 2022.
  30. "Welcome to TSFEL documentation!" . Retrieved September 7, 2022.
  31. "github: facebookresearch/Kats". GitHub . Retrieved September 7, 2022.
  32. "Automating big-data analysis". 16 October 2015.
  33. Kanter, James Max; Veeramachaneni, Kalyan (2015). "Deep feature synthesis: Towards automating data science endeavors". 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 1–10. doi:10.1109/DSAA.2015.7344858. ISBN   978-1-4673-8272-4. S2CID   206610380.
  34. "What is a feature store" . Retrieved 2022-04-19.
  35. "An Introduction to Feature Stores" . Retrieved 2021-04-15.
  36. "Feature Engineering in Machine Learning". Engineering Education (EngEd) Program | Section. Retrieved 2023-03-21.
  37. explorium_admin (2021-10-25). "5 Reasons Why Feature Engineering is Challenging". Explorium. Retrieved 2023-03-21.
  38. Spiegelhalter, D. J. (2019). The art of statistics : learning from data. [London] UK. ISBN   978-0-241-39863-0. OCLC   1064776283.{{cite book}}: CS1 maint: location missing publisher (link)
  39. Sarker IH (November 2021). "Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions". SN Computer Science. 2 (6): 420. doi:10.1007/s42979-021-00815-1. PMC   8372231 . PMID   34426802.
  40. Bengio, Yoshua (2012), "Practical Recommendations for Gradient-Based Training of Deep Architectures", Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, vol. 7700, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 437–478, arXiv: 1206.5533 , doi:10.1007/978-3-642-35289-8_26, ISBN   978-3-642-35288-1, S2CID   10808461 , retrieved 2023-03-21

Further reading