Canterbury corpus

Last updated

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results. [1]

Contents

Contents

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents, [2] totaling 2,810,784 bytes as follows.

Size (bytes)File nameDescription
152,089 alice29.txtEnglish text
125,179 asyoulik.txt Shakespeare
24,603cp.html HTML source
11,150fields.c C source
3,721grammar.lsp LISP source
1,029,744kennedy.xlsExcel spreadsheet
426,754lcet10.txt Technical writing
481,861plrabn12.txt Poetry (Paradise Lost)
513,216ptt5 CCITT test set
38,240sumSPARC executable
4,227 xargs.1GNU manual page

The University of Canterbury also offers the following corpora. Additional files may be added, so results should be only reported for individual files. [3]

See also

References

  1. Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN   978-1-55860-570-1.
  2. Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN   978-1-84628-603-2.
  3. "The Canterbury Corpus: Descriptions". corpus.canterbury.ac.nz.