Hutter Prize

Last updated

The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 1 GB English text file, with the goal of encouraging research in artificial intelligence (AI).

Contents

Launched in 2006, the prize awards 5000 euros for each one percent improvement (with 500,000 euros total funding) [1] in the compressed size of the file enwik9, which is the larger of two files used in the Large Text Compression Benchmark (LTCB); [2] enwik9 consists of the first 109 bytes of a specific version of English Wikipedia. [3] The ongoing [4] competition is organized by Hutter, Matt Mahoney, and Jim Bowery. [1]

The prize was announced on August 6, 2006 [1] with a smaller text file: enwik8 consisting of 100MB. On February 21, 2020 it was expanded by a factor of 10, to enwik9 of 1GB, the prize went from 50,000 to 500,000 euros.

Goals

The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal-seeking agent in an unknown but computable environment is to guess at each step that the environment is probably controlled by one of the shortest programs consistent with all interaction so far. [5] However, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXI tl) where the environment is restricted to time t and space l, a solution can be computed in time O(t2l), which is still intractable.

The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences. [6]

Models like ChatGPT are not ideal for the Hutter Prize for a variety of reasons, they might take more computational resources than those allowed by the competition (computational and storage space).

Rules

The contest is open-ended. It is open to everyone. To enter, a competitor must submit a compression program and a decompressor that decompresses to the file enwik9. [3] It is also possible to submit a compressed file instead of the compression program. The total size of the compressed file and decompressor (as a Win32 or Linux executable) must be less than or equal 99% of the previous prize winning entry. For each one percent improvement, the competitor wins 5,000 euros. The decompression program must also meet execution time and memory constraints.

Submissions must be published in order to allow independent verification. There is a 30-day waiting period for public comment before awarding a prize. In 2017, the rules were changed to require the release of the source code under a free software license, out of concern that "past submissions [which did not disclose their source code] had been useless to others and the ideas in them may be lost forever." [4]

Winners

AuthorDateProgramTotal sizeAward
Kaido Orav and Bryon KnollSeptember 3, 2024fx2-cmix110,793,1287,950€
Kaido OravFebruary 2, 2024fx-cmix112,578,3226,911€
Saurabh KumarJuly 16, 2023fast cmix114,156,1555,187€
Artemiy MargaritovMay 31, 2021starlit115,352,9389,000€
Alexander RhatushnyakJuly 4, 2019phda9v1.8116,673,681No prize
Alexander RhatushnyakNovember 4, 2017phda915,284,9442,085€
Alexander RhatushnyakMay 23, 2009decomp815,949,6881,614€
Alexander RhatushnyakMay 14, 2007paq8hp1216,481,6551,732€
Alexander RhatushnyakSeptember 25, 2006paq8hp517,073,0183,416€
Matt MahoneyMarch 24, 2006paq8f18,324,887No prize

See also

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

gzip GNU file compression/decompression tool

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

compress is a Unix shell compression program based on the LZW compression algorithm. Compared to gzip's fastest setting, compress is slightly slower at compression, slightly faster at decompression, and has a significantly lower compression ratio. 1.8 MiB of memory is used to compress the Hutter Prize data, slightly more than gzip's slowest setting.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Quick progress in the field of deep learning, beginning in 2010s, allowed neural networks to surpass many previous approaches in performance.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

Fabrice Bellard is a French computer programmer known for writing FFmpeg, QEMU, and the Tiny C Compiler. He developed Bellard's formula for calculating single digits of pi. In 2012, Bellard co-founded Amarisoft, a telecommunications company, with Franck Spinelli.

PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio. Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge. PAQ is free software distributed under the GNU General Public License.

Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method is to average the probabilities assigned by each model. The random forest is another method: it outputs the prediction that is the mode of the predictions output by individual models. Combining models is an active area of research in machine learning.

<span class="mw-page-title-main">Intelligent agent</span> Software agent which acts autonomously

In intelligence and artificial intelligence, an intelligent agent (IA) is an agent that perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or acquiring knowledge. An intelligent agent may be simple or complex: A thermostat or other control system is considered an example of an intelligent agent, as is a human being, as is any system that meets the definition, such as a firm, a state, or a biome.

<span class="mw-page-title-main">Marcus Hutter</span> AI researcher

Marcus Hutter is a computer scientist, professor and artificial intelligence researcher. As a senior researcher at DeepMind, he studies the mathematical foundations of artificial general intelligence.

The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury corpus, based on concerns about how representative the Calgary corpus was, but the Calgary corpus still exists for comparison and is still useful for its originally intended purpose.

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

There are a number of competitions and prizes to promote research in artificial intelligence.

A Gödel machine is a hypothetical self-improving computer program that solves problems in an optimal way. It uses a recursive self-improvement protocol in which it rewrites its own code when it can prove the new code provides a better strategy. The machine was invented by Jürgen Schmidhuber, but is named after Kurt Gödel who inspired the mathematical theories.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

Zstandard is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the corresponding reference implementation in C, released as open-source software on 31 August 2016.

References

  1. 1 2 3 "500'000€ Prize for Compressing Human Knowledge". Hutter Prize. Retrieved 2023-01-08.
  2. Mahoney, Matt (2022-12-02). "Large Text Compression Benchmark" . Retrieved 2023-01-08.
  3. 1 2 Mahoney, Matt (2011-09-01). "About the Test Data" . Retrieved 2022-11-16.
  4. 1 2 "Human Knowledge Compression Contest Frequently Asked Questions & Answers". Hutter Prize. Retrieved 14 Oct 2022.
  5. Hutter, Marcus (2005). Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Texts in Theoretical Computer Science an EATCS Series. Springer. doi:10.1007/b138233. ISBN   3-540-22139-5.
  6. Mahoney, Matt (2009-07-23). "Rationale for a Large Text Compression Benchmark" . Retrieved 2022-11-16.