Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]
Corpus Resource Database (CoRD), more than 80 English language corpora.[2]
Coruña Corpus, a corpus of late Modern English scientific writing covering the period 1700–1900, developed by the Muste research group at the University of A Coruña
Spanish text corpus by Molino de Ideas, which contains 660 million words.[7]
CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania[8]
Reference Corpus of Contemporary Portuguese (CRPC)
TS Corpus - A large set of Turkish corpora. TS Corpus is a Free&Independent Project that aims to build Turkish corpora, NLP tools and linguistic datasets...
MacMorpho - an annotated corpus of Brazilian Portuguese text
PTC: Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN964-8699-32-1
Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
Chinese/English Political Interpreting Corpus (CEPIC)[28][29] consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts. Developed by Jun Pan and HKBU Library.
Europarl Corpus - proceedings of the European Parliament from 1996 to 2012
EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[30]
OPUS: Open source Parallel Corpus in many many languages[31]
Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.[32]
SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[34]
GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.[35]
Corpus of Political Speeches contains four collections of political speeches in English and Chinese from The Corpus of U.S. Presidential Speeches (1789–2015), The Corpus of Policy Address by Hong Kong Governors (1984–1996) and Hong Kong Chief Executives (1997–2014), The Corpus of Speeches given on New Year's days and Double Tenth days by Taiwan Presidents (1978–2014), and The Corpus of Report on the Work of the Government by Premiers of the People's Republic of China (1984–2013). Developed by HKBU Library.
Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute.[43] and published in Sketch Engine. More information about the project is on the project websites.
Corpus of Academic Written and Spoken English (CAWSE),[45] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.
English as a Lingua Franca in Academic Settings (ELFA),[46] an academic ELF corpus.[47][48]
International Corpus of Learner English (ICLE),[49] a corpus of learner written English.
Louvain International Database of Spoken English Interlanguage (LINDSEI),[50] a corpus of learner spoken English.
Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[51][52]
University of Pittsburgh English Language Institute Corpus (PELIC)[53]
Vienna-Oxford International Corpus of English (VOICE),[54] an ELF corpus.[47]
References
^Leech, Geoffrey (2007). "Teaching and language corpora: a convergence". In Wichmann, A.; et al. (eds.). Teaching and Language Corpora. London: Longman. p. 9.
^Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
^"PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
^Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
^D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
^Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
^Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes. 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001.