Weka (machine learning)

Weka
	Weka logo, featuring weka, a bird endemic to New Zealand
	Weka 3.5.5 with Explorer window open with Iris UCI dataset
Developer(s)	University of Waikato
Stable release	3.8.6 (stable) / January 28, 2022;10 months ago
Preview release	3.9.6 / January 28, 2022;10 months ago
Repository	git.cms.waikato.ac.nz/weka/weka ;
Written in	Java
Operating system	Windows, macOS, Linux
Platform	IA-32, x86-64, ARM_architecture; Java SE
Type	Machine learning
License	GNU General Public License
Website	www.cs.waikato.ac.nz/~ml/weka

Last updated December 16, 2022

Waikato Environment for Knowledge Analysis (Weka), developed at the University of Waikato, New Zealand, is free software licensed under the GNU General Public License, and the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".^[1]

Description

Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions.^[1] The original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in other programming languages, plus data preprocessing utilities in C, and a makefile-based system for running machine learning experiments. This original version was primarily designed as a tool for analyzing data from agricultural domains,^[2]^[3] but the more recent fully Java-based version (Weka 3), for which development started in 1997, is now used in many different application areas, in particular for educational purposes and research. Advantages of Weka include:

Free availability under the GNU General Public License.
Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform.
A comprehensive collection of data preprocessing and modeling techniques.
Ease of use due to its graphical user interfaces.

Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted according the Attribute-Relational File Format and with the filename bearing the .arff extension. All of Weka's techniques are predicated on the assumption that the data is available as one flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. Weka provides access to deep learning with Deeplearning4j.^[4] It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using Weka.^[5] Another important area that is currently not covered by the algorithms included in the Weka distribution is sequence modeling.

Extension packages

In version 3.7.2, a package manager was added to allow the easier installation of extension packages.^[6] Some functionality that used to be included with Weka prior to this version has since been moved into such extension packages, but this change also makes it easier for others to contribute extensions to Weka and to maintain the software, as this modular architecture allows independent updates of the Weka core and individual extensions.

History

In 1993, the University of Waikato in New Zealand began development of the original version of Weka, which became a mix of Tcl/Tk, C, and makefiles.
In 1997, the decision was made to redevelop Weka from scratch in Java, including implementations of modeling algorithms.^[7]
In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service Award.^[8]^[9]
In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence.^[10] It forms the data mining and predictive analytics component of the Pentaho business intelligence suite. Pentaho has since been acquired by Hitachi Vantara, and Weka now underpins the PMI (Plugin for Machine Intelligence) open source component.^[11]

Related tools

Auto-WEKA is an automated machine learning system for Weka.^[12]
Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a similar project to Weka with a focus on cluster analysis, i.e., unsupervised methods.
H2O.ai is an open-source data science and machine learning platform
KNIME is a machine learning and data mining software implemented in Java.
Massive Online Analysis (MOA) is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand.
Neural Designer is a data mining software based on deep learning techniques written in C++.
Orange is a similar open-source project for data mining, machine learning and visualization based on scikit-learn.
RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka.
scikit-learn is a popular machine learning library in Python.

Related Research Articles

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization.

Neural network software is used to simulate, research, develop, and apply artificial neural networks, software concepts adapted from biological neural networks, and in some cases, a wider array of adaptive systems such as artificial intelligence and machine learning.

Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with only minimal, programming.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Feature Selection Toolbox (FST) is software primarily for feature selection in the machine learning domain, written in C++, developed at the Institute of Information Theory and Automation (UTIA), of the Czech Academy of Sciences.

Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.

Ian H. Witten is a computer scientist at the University of Waikato, New Zealand. He is a Chartered Engineer with the Institute of Electrical Engineers in London who graduated from the University of Cambridge with a BA and MA in mathematics in 1969 and an M.Sc. in mathematics and computer science from the University of Calgary, where he was a Commonwealth Scholar, in 1970. He received his Ph.D. for Learning to Control in 1976 from the University of Essex, England. Witten discovered temporal-difference learning, inventing the tabular TD(0), the first temporal-difference learning rule for reinforcement learning. Witten is a co-creator of the Sequitur algorithm and conceived and obtained funding for the development of the original WEKA software package for data mining. Witten further made considerable contributions to the field of compression, creating novel algorithms for text and image compression with Alistair Moffat and Timothy C. Bell. He is also one of the major contributors to the digital libraries field, and founder of the Greenstone Digital Library Software.

Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.

Neural Designer is a software tool for machine learning based on neural networks, a main area of artificial intelligence research, and contains a graphical user interface which simplifies data entry and interpretation of results.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. AutoML was proposed as an artificial intelligence-based solution to the growing challenge of applying machine learning. The high degree of automation in AutoML aims to allow non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models. Common techniques used in AutoML include hyperparameter optimization, meta-learning and neural architecture search.

scikit-mutliflow is a free and open source software machine learning library for multi-output/multi-label and stream data written in Python.

References

1 2 Witten, Ian H.; Frank, Eibe; Hall, Mark A.; Pal, Christopher J. (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco (CA). Retrieved 2011-01-19.
↑ Holmes, Geoffrey; Donkin, Andrew; Witten, Ian H. (1994). "Weka: A machine learning workbench" (PDF). Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia. Retrieved 2007-06-25.
↑ Garner, Stephen R.; Cunningham, Sally Jo; Holmes, Geoffrey; Nevill-Manning, Craig G.; Witten, Ian H. (1995). "Applying a machine learning workbench: Experience with agricultural databases" (PDF). Proceedings of the Machine Learning in Practice Workshop, Machine Learning Conference, Tahoe City (CA), USA. pp. 14–21. Retrieved 2007-06-25.
↑ "Weka Package Metadata". SourceForge. 2017. Retrieved 2017-11-11.
↑ Reutemann, Peter; Pfahringer, Bernhard; Frank, Eibe (2004). "Proper: A Toolbox for Learning from Relational Data with Propositional and Multi-Instance Learners". 17th Australian Joint Conference on Artificial Intelligence (AI2004). Springer-Verlag. CiteSeerX 10.1.1.459.8443 .
↑ "weka-wiki - Packages" . Retrieved 27 January 2020.
↑ Witten, Ian H.; Frank, Eibe; Trigg, Len; Hall, Mark A.; Holmes, Geoffrey; Cunningham, Sally Jo (1999). "Weka: Practical Machine Learning Tools and Techniques with Java Implementations" (PDF). Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems. pp. 192–196. Retrieved 2007-06-26.
↑ Piatetsky-Shapiro, Gregory I. (2005-06-28). "KDnuggets news on SIGKDD Service Award 2005" . Retrieved 2007-06-25.
↑ "Overview of SIGKDD Service Award winners". 2005. Retrieved 2007-06-25.
↑ "Pentaho Acquires Weka Project". Pentaho. Retrieved 2018-02-06.
↑ "Plugin for Machine Intelligence".
↑ Thornton, Chris; Hutter, Frank; Hoos, Holger H.; Leyton-Brown, Kevin (2013). Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. KDD '13 Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 847–855.

External links

Official website at University of Waikato in New Zealand

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Witten, Ian H.; Frank, Eibe; Hall, Mark A.; Pal, Christopher J. (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco (CA). Retrieved 2011-01-19.

[2] Holmes, Geoffrey; Donkin, Andrew; Witten, Ian H. (1994). "Weka: A machine learning workbench" (PDF). Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia. Retrieved 2007-06-25.

[3] Garner, Stephen R.; Cunningham, Sally Jo; Holmes, Geoffrey; Nevill-Manning, Craig G.; Witten, Ian H. (1995). "Applying a machine learning workbench: Experience with agricultural databases" (PDF). Proceedings of the Machine Learning in Practice Workshop, Machine Learning Conference, Tahoe City (CA), USA. pp. 14–21. Retrieved 2007-06-25.

[4] "Weka Package Metadata". SourceForge. 2017. Retrieved 2017-11-11.

[5] Reutemann, Peter; Pfahringer, Bernhard; Frank, Eibe (2004). "Proper: A Toolbox for Learning from Relational Data with Propositional and Multi-Instance Learners". 17th Australian Joint Conference on Artificial Intelligence (AI2004). Springer-Verlag. CiteSeerX 10.1.1.459.8443 .

[6] "weka-wiki - Packages" . Retrieved 27 January 2020.

[7] Witten, Ian H.; Frank, Eibe; Trigg, Len; Hall, Mark A.; Holmes, Geoffrey; Cunningham, Sally Jo (1999). "Weka: Practical Machine Learning Tools and Techniques with Java Implementations" (PDF). Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems. pp. 192–196. Retrieved 2007-06-26.

[8] Piatetsky-Shapiro, Gregory I. (2005-06-28). "KDnuggets news on SIGKDD Service Award 2005" . Retrieved 2007-06-25.

[9] "Overview of SIGKDD Service Award winners". 2005. Retrieved 2007-06-25.

[10] "Pentaho Acquires Weka Project". Pentaho. Retrieved 2018-02-06.

[11] "Plugin for Machine Intelligence".

[autoweka1-12] Thornton, Chris; Hutter, Frank; Hoos, Holger H.; Leyton-Brown, Kevin (2013). Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. KDD '13 Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 847–855.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Numerical-analysis software
Free	Advanced Simulation Library ADMB Chapel Euler Fortress FreeFem++ FreeMat Genius Gmsh GNU Octave gretl Julia Jupyter (Julia, Python, R; IPython) MFEM OpenFOAM Python R SageMath Salome ScicosLab Scilab X10 Weka
Proprietary	DADiSP FEATool Multiphysics GAUSS LabVIEW Maple Mathcad Mathematica MATLAB Speakeasy VisSim
Comparison