Silesia corpus

Last updated

The Silesia corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 2003 as an alternative for the Canterbury corpus and Calgary corpus, based on concerns about how well these represented modern files. It contains various data types, including large text documents, executable files, and databases. [1] It is widely used in data compression research. [2]

Contents

Contents

The corpus consists of 12 files, totaling 211MB. The files were chosen to represent what the author considered to be data types likely to grow rapidly in size over time, such as computer programs and databases, along with more traditional compression benchmarks, such as large text files. [1]

Overview of files, their sizes, descriptions, and data types
FileSize (B)DescriptionType of data
dickens10192446The works of Charles Dickens English text
mozilla51220480Executable files for Mozilla 1.0Executable
mr9970564 MRI Images3D image
nci33553445A database of chemical structuresDatabase
office6152192A shared library from OpenOffice Executable
osdb10085684A Sample MySQL database from the Open Source Database BenchmarkDatabase
reymont6625583The text of the book Chłopi by Władysław Reymont PDF in Polish
samba21606400The source code of Samba 2‑2.3Executable
sao7251944The SAO star catalogue Binary database
webster41458703The 1913 Webster Unabridged Dictionary HTML
xml5345280Collected XML filesXML
x-ray8474240A medical X-Ray Image
Total211938580

Because it has a broader and more modern selection of datatypes, it is considered a better source of test data for compression algorithms when compared to the Calgary corpus. [3]

See also

References

  1. 1 2 Deorowicz, Sebastian. Universal Lossless Data Compression Algorithms (PDF) (Thesis). Silesian University of Technology. pp. 93–95. Archived from the original (PDF) on 2024-08-28.
  2. Maulidina, Alysha Puti; Wijaya, Rachel Anastasia; Mazel, Kimberly; Astriani, Maria Seraphina (2024). "Comparative Study of Data Compression Algorithms: Zstandard, zlib & LZ4". Science, Engineering Management and Information Technology: 394–406. doi:10.1007/978-3-031-72284-4_24.
  3. Gupta, Apoorv; Bansal, Aman; Khanduja, Vidhi (2017-02-22). "Modern lossless compression techniques: Review, comparison and analysis". 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). IEEE. pp. 1–8. doi:10.1109/ICECCT.2017.8117850. ISBN   978-1-5090-3239-6.