Feature engineering

Last updated

Feature engineering, a preprocessing step in supervised machine learning and statistical modeling, [1] transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability. [2] [3]

Contents

Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, and the Archimedes number in sedimentation. They also develop first approximations of solutions, such as analytical solutions for the strength of materials in mechanics. [4]

Predictive modelling

Feature engineering in machine learning and statistical modeling involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices. [5]

Features vary in significance. [6] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting). [7]

Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:

Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection. [8]

Automation

Automation of feature engineering is a research topic that dates back to the 1990s. [9] Machine learning software that incorporates automated feature engineering has been commercially available since 2016. [10] Related academic literature can be roughly separated into two types:

Multi-relational decision tree learning (MRDTL)

Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to relational databases, handling complex data relationships across tables. It innovatively uses selection graphs as decision nodes, refined systematically until a specific termination criterion is reached. [9]

Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation. [11] [12]

Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:

Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition. [29] [30]

Feature stores

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. [31]

A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. [32]

Feature stores can be standalone software tools or built into machine learning platforms.

Alternatives

Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error. [33] [34] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. [35] However, it's important to note that deep learning algorithms still require careful preprocessing and cleaning of the input data. [36] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process. [37]

See also

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

Programming languages can be grouped by the number and types of paradigms supported.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

Probabilistic programming (PP) is a programming paradigm in which probabilistic models are specified and inference for these models is performed automatically. It represents an attempt to unify probabilistic modeling and traditional general purpose programming in order to make the former easier and more widely applicable. It can be used to create systems that help make decisions in the face of uncertainty.

<span class="mw-page-title-main">Gensim</span> Vector space modeling and topic modeling toolkit

Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.

<span class="mw-page-title-main">Julia (programming language)</span> Dynamic programming language

Julia is a high-level, general-purpose dynamic programming language, most commonly used for numerical analysis and computational science. Distinctive aspects of Julia's design include a type system with parametric polymorphism and the use of multiple dispatch as a core programming paradigm, efficient garbage collection, and a just-in-time (JIT) compiler.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

<span class="mw-page-title-main">TensorFlow</span> Machine learning software library

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems.

<span class="mw-page-title-main">ML.NET</span> Machine learning library

ML.NET is a free software machine learning library for the C# and F# programming languages. It also supports Python models when used together with NimbusML. The preview release of ML.NET included transforms for feature engineering like n-gram creation, and learners to handle binary classification, multi-class classification, and regression tasks. Additional ML tasks like anomaly detection and recommendation systems have since been added, and other approaches like deep learning will be included in future versions.

<span class="mw-page-title-main">Neural Network Intelligence</span> Microsoft open source library

NNI is a free and open-source AutoML toolkit developed by Microsoft. It is used to automate feature engineering, model compression, neural architecture search, and hyper-parameter tuning.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

<span class="mw-page-title-main">CatBoost</span> Yandex open source gradient boosting framework on decision trees

CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It works on Linux, Windows, macOS, and is available in Python, R, and models built using catboost can be used for predictions in C++, Java, C#, Rust, Core ML, ONNX, and PMML. The source code is licensed under Apache License and available on GitHub.

GitHub Copilot is a code completion tool developed by GitHub and OpenAI that assists users of Visual Studio Code, Visual Studio, Neovim, and JetBrains integrated development environments (IDEs) by autocompleting code. Currently available by subscription to individual developers and to businesses, the generative artificial intelligence software was first announced by GitHub on 29 June 2021, and works best for users coding in Python, JavaScript, TypeScript, Ruby, and Go. In March 2023 GitHub announced plans for "Copilot X", which will incorporate a chatbot based on GPT-4, as well as support for voice commands, into Copilot.

References

  1. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN   978-0-387-84884-6.
  2. Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge University Press. ISBN   9781107057135.
  3. Murphy, Kevin P. (2022). Probabilistic Machine Learning. Cambridge, Massachusetts: The MIT Press (Copyright 2022 Massachusetts Institute of Technology, this work is subject to a Creative Commons CC-BY-NC-ND license). ISBN   9780262046824.
  4. MacQueron C (2021). SOLID-LIQUID MIXING IN STIRRED TANKS : Modeling, Validation, Design Optimization and Suspension Quality Prediction (Report). doi:10.13140/RG.2.2.11074.84164/1.
  5. "Feature engineering - Machine Learning Lens". docs.aws.amazon.com. Retrieved 2024-03-01.
  6. "Feature Engineering" (PDF). 2010-04-22. Retrieved 12 November 2015.
  7. "Feature engineering and selection" (PDF). Alexandre Bouchard-Côté. October 1, 2009. Retrieved 12 November 2015.
  8. "Feature engineering in Machine Learning" (PDF). Zdenek Zabokrtsky. Archived from the original (PDF) on 4 March 2016. Retrieved 12 November 2015.
  9. 1 2 Knobbe AJ, Siebes A, Van Der Wallen D (1999). "Multi-relational Decision Tree Induction" (PDF). Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 1704. pp. 378–383. doi:10.1007/978-3-540-48247-5_46. ISBN   978-3-540-66490-1.
  10. "Its all about the features". Reality AI Blog. September 2017.
  11. Yin X, Han J, Yang J, Yu PS (2004). "CrossMine: Efficient classification across multiple database relations". Proceedings. 20th International Conference on Data Engineering. pp. 399–410. doi:10.1109/ICDE.2004.1320014. ISBN   0-7695-2065-0. S2CID   1183403.
  12. Frank R, Moser F, Ester M (2007). "A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions". Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science. Vol. 4702. pp. 430–437. doi:10.1007/978-3-540-74976-9_43. ISBN   978-3-540-74975-2.
  13. "What is Featuretools?" . Retrieved September 7, 2022.
  14. "Featuretools - An open source python framework for automated feature engineering" . Retrieved September 7, 2022.
  15. "github: alteryx/featuretools". GitHub . Retrieved September 7, 2022.
  16. Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv: 1706.00327 [cs.DB].
  17. Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv: 1706.00327 [cs.DB].
  18. "getML documentation" . Retrieved September 7, 2022.
  19. "github: getml/getml-community". GitHub . Retrieved September 7, 2022.
  20. "github: getml/getml-community". GitHub . Retrieved September 7, 2022.
  21. "github: getml/getml-community". GitHub . Retrieved September 7, 2022.
  22. "tsfresh documentation" . Retrieved September 7, 2022.
  23. "Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)" . Retrieved September 7, 2022.
  24. "predict-idlab/tsflex". GitHub . Retrieved September 7, 2022.
  25. Van Der Donckt, Jonas; Van Der Donckt, Jeroen; Deprost, Emiel; Van Hoecke, Sofie (2022). "tsflex: Flexible time series processing & feature extraction". SoftwareX. 17: 100971. arXiv: 2111.12429 . Bibcode:2022SoftX..1700971V. doi:10.1016/j.softx.2021.100971. S2CID   244527198 . Retrieved September 7, 2022.
  26. "seglearn user guide" . Retrieved September 7, 2022.
  27. "Welcome to TSFEL documentation!" . Retrieved September 7, 2022.
  28. "github: facebookresearch/Kats". GitHub . Retrieved September 7, 2022.
  29. "Automating big-data analysis". 16 October 2015.
  30. Kanter, James Max; Veeramachaneni, Kalyan (2015). "Deep feature synthesis: Towards automating data science endeavors". 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 1–10. doi:10.1109/DSAA.2015.7344858. ISBN   978-1-4673-8272-4. S2CID   206610380.
  31. "What is a feature store" . Retrieved 2022-04-19.
  32. "An Introduction to Feature Stores" . Retrieved 2021-04-15.
  33. "Feature Engineering in Machine Learning". Engineering Education (EngEd) Program | Section. Retrieved 2023-03-21.
  34. explorium_admin (2021-10-25). "5 Reasons Why Feature Engineering is Challenging". Explorium. Retrieved 2023-03-21.
  35. Spiegelhalter, D. J. (2019). The art of statistics : learning from data. [London] UK. ISBN   978-0-241-39863-0. OCLC   1064776283.{{cite book}}: CS1 maint: location missing publisher (link)
  36. Sarker IH (November 2021). "Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions". SN Computer Science. 2 (6): 420. doi:10.1007/s42979-021-00815-1. PMC   8372231 . PMID   34426802.
  37. Bengio, Yoshua (2012), "Practical Recommendations for Gradient-Based Training of Deep Architectures", Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, vol. 7700, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 437–478, arXiv: 1206.5533 , doi:10.1007/978-3-642-35289-8_26, ISBN   978-3-642-35288-1, S2CID   10808461 , retrieved 2023-03-21

Further reading