Corpus and Word List Development

One of the key areas of research in how computers can facilitate language learning is the field of corpus linguistics. Simply put, corpus linguistics is the study of language as expressed in samples (corpora) of “real world” texts.

In order to conduct a study of language (or develop a product) which is corpus-based, it is necessary to either gain access to, or develop a corpus of language, and then analyze the corpus using dedicated analysis tools such as concordancing programs. A corpus consists of a databank of natural texts, compiled from writing and/or a transcription of recorded speech. A concordancer is a software program which analyzes corpora and ranks or lists the results, letting us know which vocabulary words and phrases are most frequent (and thus most important to study). The main focus of corpus linguistics is to discover patterns of authentic language use through analysis of actual usage.

Many of our products and services are based on the careful development and analysis of focused-corpora, and depending on your specific needs, we can quickly create, analyze and provide output from corpora for a wide range of purposes.

For example, the  New General Service List (NGSL), one of the most important lists of vocabulary words in the past 60 years, was developed based on a carefully 273 million subsection of the 2 billion word Cambridge English Corpus. More information about this list can be found at our website here.

Another example is the popular NHK TV show “Eigo Shaberenaito” recently hired us to develop a list of essential English vocabulary words needed to be successful in business. Within less than 2 months, we were able to create a corpus of over 100 million words of current written and spoken business English, whose analysis yielded a list of 1000 high frequency business English words that are now being taught on their TV show and accompanying online and physical textbooks.

Corpora and word lists can be used in many exciting ways. One new tool we recently created is called the OGTE (Online Graded Text Editor). This free tool allows teachers, authors, researchers and published to both analyze the difficulty of texts in terms of the presence or absence of high frequency vocabulary as well as to write or edit texts to a specific level. This tool can be found hereScreen Shot 2013-11-05 at 10.34.55 PM.