Arabic Sbd System

3706 Words15 Pages
Unsupervised Sentence Boundary Detection Approach for Arabic Abstract Punkt (German for period) is a sentence boundary detection system that divides an English text into a list of sentences using an unsupervised algorithm developed by Kiss and Strunk (2006) [6]. This algorithm is based-on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. The Punkt system was adapted to support Arabic language. The modified Punkt is trained on Arabic Corpus to build a model for abbreviation words, collocations, and words that start sentences. An evaluation of the performance of the modified Punkt system has revealed that an accuracy rate close to 99% has been achieved for detecting Arabic sentence boundaries. Keywords: Internal Periods, Arabic Collocations, Arabic Abbreviations, Dunning's Likelihood Ratio, Arabic Orthography, Frequent Sentence Starter. Speech Tagging usually operate on individual sentences. In languages such as English, Arabic and others, a sentence is a sequence of words ending with a terminal punctuation, such as a colon ‘:’, period ‘.’, question mark ‘?’ or an exclamation mark ‘!’. For colon, exclamation and question marks, SBD is usually not too difficult; however it is not a trivial task. About 90% of periods are sentence boundary indicators [9]. Nevertheless, periods make much ambiguity in sentence boundary because periods can be used in a number of different ways as follows: 1. To end a sentence. (e.g. I saw Ahmad.) 2. To represent a decimal point in the fractional numbers. (e.g. My length is 1.7 m) 3. To mark ordinal numbers like this ordinal list, which you are reading right now. 4. To mark an abbreviation, initials and acronyms. (e.g. He has got I.B.M. computer.) 5. To mark an abbreviation and end a sentence, at the same time. (e.g. Ali got a

More about Arabic Sbd System

Open Document