Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.^[1]

Size (bytes)	File name	Description
152,089	alice29.txt	English text
125,179	asyoulik.txt	Shakespeare
24,603	cp.html	HTML source
11,150	fields.c	C source
3,721	grammar.lsp	LISP source
1,029,744	kennedy.xls	Excel spreadsheet
426,754	lcet10.txt	Technical writing
481,861	plrabn12.txt	Poetry (Paradise Lost)
513,216	ptt5	CCITT test set
38,240	sum	SPARC executable
4,227	xargs.1	GNU manual page

The University of Canterbury also offers the following corpora. Additional files may be added, so results should be only reported for individual files.^[3]

The Artificial Corpus, a set of files with highly "artificial" data designed to evoke pathological or worst-case behavior. Last updated 2000 (tar timestamp).
The Large Corpus, a set of large (megabyte-size) files. Contains an E. coli genome, a King James bible, and the CIA world fact book. Last updated 1997 (tar timestamp).
The Miscellaneous Corpus. Contains one million digits of pi. Last updated 2000 (tar timestamp).

References

^ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.
^ "The Canterbury Corpus: Descriptions". corpus.canterbury.ac.nz.

External links

Official website

This computer science article is a stub. You can help Wikipedia by expanding it.

[1] Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.

[2] Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.

[3] "The Canterbury Corpus: Descriptions". corpus.canterbury.ac.nz.

[1]

[2]

[3]

Canterbury corpus

Contents

See also

References

External links