Feature Selection Toolbox

Feature Selection Toolbox
	A screenshot shows the full user interface of FST1. At left is a log window with feature selection results. At center right is a result table window. At bottom right is a graphic projection of data and mixture model components. On top of it is the dialog for setting parameters of optimal subset search methods.
Developer(s)	UTIA, Czech Academy of Sciences
Stable release	3.1.1 / 9 September 2012;10 years ago
Written in	C++
Operating system	Cross-platform (v3)
Type	Machine learning, pattern recognition
License	Free for non-commercial use
Website	fst.utia.cz

Last updated September 03, 2023

Feature Selection Toolbox (FST) is software primarily for feature selection in the machine learning domain,^[1] written in C++, developed at the Institute of Information Theory and Automation (UTIA), of the Czech Academy of Sciences.

Version 1

The first generation of Feature Selection Toolbox (FST1) was a Windows application with user interface allowing users to apply several sub-optimal, optimal and mixture-based feature selection methods on data stored in a trivial proprietary textual flat file format.^[2]

Version 3

The third generation of Feature Selection Toolbox (FST3) was a library without user interface, written to be more efficient and versatile than the original FST1.^[3]

FST3 supports several standard data mining tasks, more specifically, data preprocessing and classification, but its main focus is on feature selection. In feature selection context, it implements several common as well as less usual techniques, with particular emphasis put on threaded implementation of various sequential search methods (a form of hill-climbing). Implemented methods include individual feature ranking, floating search, oscillating search (suitable for very high-dimension problems) in randomized or deterministic form, optimal methods of branch and bound type, probabilistic class distance criteria, various classifier accuracy estimators, feature subset size optimization, feature selection with pre-specified feature weights, criteria ensembles, hybrid methods, detection of all equivalent solutions, or two-criterion optimization. FST3 is more narrowly specialized than popular software like the Waikato Environment for Knowledge Analysis Weka, RapidMiner or PRTools.^[4]

By default, techniques implemented in the toolbox are predicated on the assumption that the data is available as a single flat file in a simple proprietary format or in Weka format ARFF, where each data point is described by a fixed number of numeric attributes. FST3 is provided without user interface, and is meant to be used by users familiar both with machine learning and C++ programming. The older FST1 software is more suitable for simple experimenting or educational purposes because it can be used with no need to code in C++.

History

In 1999, development of the first Feature Selection Toolbox version started at UTIA as part of a PhD thesis. It was originally developed in Optima++ (later renamed Power++) RAD C++ environment.
In 2002, the development of the first FST generation has been suspended, mainly due to end of Sybase's support of the then used development environment.
In 2002–2008, FST kernel was recoded and used for research experimentation within UTIA only.
In 2009, 3rd FST kernel recoding from scratch begun.
In 2010, FST3 was made publicly available in form of a C++ library without GUI. The accompanying webpage collects feature selection related links, references, documentation and the original FST1 available for download.
In 2011, an update of FST3 to version 3.1 included new methods (particularly a novel dependency-aware feature ranking suitable for very-high-dimension recognition problems) and core code improvements.

Related Research Articles

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines 'discover' their 'own' algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.

In software engineering, the terms frontend and backend refer to the separation of concerns between the presentation layer (frontend), and the data access layer (backend) of a piece of software, or the physical infrastructure or hardware. In the client–server model, the client is usually considered the frontend and the server is usually considered the backend, even when some presentation work is actually done on the server itself.

Feature selection is the process of selecting a subset of relevant features for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.

OpenCV is a library of programming functions mainly for real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage, then Itseez. The library is cross-platform and licensed as free and open-source software under Apache License 2. Starting in 2011, OpenCV features GPU acceleration for real-time operations.

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

In machine learning the random subspace method, also called attribute bagging or feature bagging, is an ensemble learning method that attempts to reduce the correlation between estimators in an ensemble by training them on random samples of features instead of the entire feature set.

Waffles is a collection of command-line tools for performing machine learning operations developed at Brigham Young University. These tools are written in C++, and are available under the GNU Lesser General Public License.

<span class="mw-page-title-main">Symbolic regression</span> Type of regression analysis

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.

Neural Designer is a software tool for machine learning based on neural networks, a main area of artificial intelligence research, and contains a graphical user interface which simplifies data entry and interpretation of results.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. AutoML was proposed as an artificial intelligence-based solution to the growing challenge of applying machine learning. The high degree of automation in AutoML aims to allow non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models. Common techniques used in AutoML include hyperparameter optimization, meta-learning and neural architecture search.

References

↑ Petr Somol; Jana Novovičová; Pavel Pudil (2010). "Efficient Feature Subset Selection and Subset Size Optimization" (PDF). Pattern Recognition Recent Advances, INTECH. pp. 75–97. ISBN 978-953-7619-90-9.
↑ Petr Somol; Pavel Pudil (2002). "Feature Selection toolbox" (PDF). Pattern Recognition vol.35, no.12, Elsevier. pp. 2749–2759.
↑ Petr Somol; Pavel Vácha; Stanislav Mikeš; Jan Hora; Pavel Pudil; Pavel Žid (2010). "Introduction to Feature Selection Toolbox 3 – The C++ Library for Subset Search, Data Modeling and Classification" (PDF). UTIA Tech. Report No. 2287. pp. 1–12. Retrieved 2 November 2010.
↑ PRTools

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Petr Somol; Jana Novovičová; Pavel Pudil (2010). "Efficient Feature Subset Selection and Subset Size Optimization" (PDF). Pattern Recognition Recent Advances, INTECH. pp. 75–97. ISBN 978-953-7619-90-9.

[2] Petr Somol; Pavel Pudil (2002). "Feature Selection toolbox" (PDF). Pattern Recognition vol.35, no.12, Elsevier. pp. 2749–2759.

[3] Petr Somol; Pavel Vácha; Stanislav Mikeš; Jan Hora; Pavel Pudil; Pavel Žid (2010). "Introduction to Feature Selection Toolbox 3 – The C++ Library for Subset Search, Data Modeling and Classification" (PDF). UTIA Tech. Report No. 2287. pp. 1–12. Retrieved 2 November 2010.

[4] PRTools

[1]

[2]

[3]

[4]