Automation of the Compilation and Processing of a Hausa Corpus

10203 Words41 Pages
AUTOMATION OF THE COMPILATION AND PROCESSING OF A HAUSA CORPUS BY Eno Sam Okon Supervised By: Dr. Tunde Adegbola April 2014 ABSTRACT A spell checker is an indispensible tool for text editing as it can be used to assist the possible poor language skills of writers as well as to identify and correct inevitable typing errors. With a population of over 40 million speakers, the Hausa language is the second most widely spoken language in Africa, yet it is without a standard spell checker. To create a Hausa spell checker, a Hausa corpus was built by data entry and web crawling. The wordlist was cleaned to remove non-Hausa words as well as to correct typographical and other errors. Also, in order to determine the extent to which the modest corpus used for the spell checker covers the Hausa language, the rate of increase in the size of the wordlist in relation to corpus size was determined. A modest 2 million-word Hausa corpus was realized. The corpus was then tokenized to produce a wordlist of about 30,000 Hausa tokens. After cleaning, the wordlist was reduced to 23,306 tokens. Based on the use of Hausa morphology, the word list was compressed to 12,569 stems and 62 affix rules. This made up the spell checker files. Also, a 700,000 word corpus drawn from the Hausa corpus was tokenized in separate files with a successive increment of 20,000 words per file. Results showed that Hausa morphology proved effective for information compression as expected and a rudimentary spell checker was produced. Furthermore, results of the corpus study showed that a corpus of 20,000 words would produce an average of about 3000 tokens and the number of new tokens produced will decrease with every

More about Automation of the Compilation and Processing of a Hausa Corpus

Open Document