Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance

Received Mar 23, 2020 Revised Apr 24, 2021 Accepted Jun 22, 2021 In the online world, especially in the social media platform most of us write without much regard to correct spelling and grammar. The spelling mistakes are much larger in proportion when it comes to Bangla language. In our paper, we presented a method for error detection and correction in Bangla words' spellings. Our system could detect a misspelled Bangla word and provide two following services-suggesting correct spellings for the word and correcting the word. We had used Norvig's algorithm for the purpose but instead of using probabilities of the words to prepare the suggestions and corrections, we had used Jaro-Winkler distance. The previous works done in this field for Bangla language are either very slow or offers less accuracy. Our system successfully achieved a 97% accuracy when evaluated with 1000 Bangla words.


INTRODUCTION
Misspelling is a common phenomenon, especially when it comes to the internet. Bad spelling makes a person appear less intelligent and less credible than they actually are. Spelling mistakes not only put a dent on someone's professional reputation, but also cost a fortune in sales and business. Spelling errors in medical packaging can be lethal and can end up costing someone's life. So checking for spelling mistakes and correcting them is a much needed service in each and every language.
There are many works on spell checking and giving correction or suggestions in other languages whereas a very few works on Bangla or Bengali language though Bangla is spoken by 230 million people as native speakers and by 37 million people as their second language. In this research paper, a process is proposed for the development of an effectual Bangla spell corrector using maximum two edit distance from the incorrect word and in addition we use distance algorithm instead of occurrence probability in a document which is used in Norvig's spell correction algorithm [1]. We collected the Bangla dictionary and letters (vowel and consonant and also vowel mark and consonant conjunct) from various sources. Our system first matches a word with the existing dictionary and in case of mismatch, our system gives a probable list of correct words and most probable word based on string similarity. This work also shows the performance and evaluation of our proposed method. The arrangement of the rest of the paper is described here. Explanation of some technical terms and methods are given in section 2. Next comes section 3 which contains mentions of some of the related works in the field of spelling detection and correction. The details of our proposed approach is explained in the section 4, followed by section 5 where we explain the results achieved by our system and some discussions relating to our work is provided there. Finally, the paper ends with conclusion in section 6 and references.
In this paper, we address the issue of lack of proper spell checkers and correctors for Bangla language and suggest a possible solution to the problem. Although our work is not the first to address this issue, it certainly introduces a method that overcomes hurdles that the previous works failed to overcome. The previous works on this topic either offer less accuracy than ours or perform poorly when it comes to multiple error spelling mistakes. We propose a spelling corrector that can handle single and multiple errors and offers high accuracy of 97%.

BASELINE RESEARCH
In this section we will discuss about the terms that we used in this report. We will give an overview of Bangla language, type of errors in Bangla language, Norvig's spell corrector, string similarity.

Bangla language
The alphabets of Bangla language consists of 49 letters, where 11 being vowels and 39 being consonants. There is no uppercase or lowercase process in Bangla alphabet but there are some complex systems in Bangla language such as, Phonetically similar characters, consonant conjunct, Phala, Matra, vowel mark, modified symbol and many more [2], [3].

Types of errors
Kukich [4] classified misspelled word of 2 types, namely, real word mistake, and non-word mistake. Real-word mistake occurs when a correct spelling is used but the word is wrong based on the context. The latter, non-word mistake, is that type of mistake where the used word is neither a dictionary word nor a noun [5]. Non-word error is further divided into two classes-cognitive mistake and typographical mistake. Typographical mistake is simple errors like mistakenly adding or deleting or inserting or transposing characters. Table 1 shows the percentage of different types of typographical errors. Cognitive mistake is where the spelling is forgotten and typed in a similar phonetic manner [6].

Norvig's spelling corrector
Norvig proposes an algorithm for spell correction [1] which determines the correctly spelled word out of all possible suggestion with the maximum probability of occurring in a data set.
This expression has expresses Norvig's algorithm and has four main parts. 'argmax' is the selection mechanism, 'P(c)' expresses the language model, 'c ε candidates' denotes the candidate model, and 'P(w|c)' denotes the error model. The candidate model makes some small edits to a word by adding a letter, exchanging two adjoining letters, taking out one letterand putting a different letter in place of a letter. For n th length word, there are in total 54n+25 possibilities, which consists of n-1 transpositions, 26(n+1) insertions, n deletions, and 26n alterations but with some duplication and only few are dictionary words. If the edit distance is 2 the suggestion list will be bigger and again few will be dictionary words. In terms of picking one single word as correct word from the suggestion list, the model in question uses probabilities of occurrence of those words belonging to the suggestion list. The probabilities of occurrences are derived from a test document to rank the candidates. The word from the recommended list having the highest probability becomes the correction for the misspelled word.

String similarity algorithm
According to the function and types of operation, string similarity algorithms can be categorized into few domains: a. Edit distance based; edit distance based string similarity algorithm takes two words mostly of same length and compare with each other for the unmatched characters. Hamming distance [8], Levenshtein distance, Trigram comparison [9], Jaro-Winkler [10] are one of those edit distance based algorithms. Jaro-Winkler algorithm is a string matching algorithm that uses prefix-scale which makes it more accurate. It is the modified and extended method of Jaro Distance [11].
Here in (1), DJaro is Jaro distance, m is matched character's number that appeared in both spellings, t denotes the 'Number of Transpositions/2', |s1| and |s2| is 1 st and 2 nd string's length. In (2), DJaro-Winkler is for the Jaro-Winkler distance, l denotes the common prefix length at the beginning of the word (limited to having 4 characters at maximum), p is the constant balancing element which decides how much the rating for particular prefixes is set up. p had a generic value of 0.1 in Winkler's original work [12].
b. Token based; token based string similarity algorithm takes the words as a token and matches with the other token to get similarity percentage. c. Sequence based; in sequence based string similarity algorithm, it goes for the largest common character set which is matched in both strings. The process is recursive and stop when no common sub string is found. Table 2 shows the comparison of various string similarity algorithms to be "the" or "that" over the misspelled word "tha".  [14] 0.666 0.857

RELATED WORKS
In the proposed approach by Khan et al. they worked with the phonetic encoding by Soundex algorithm [15]. At 2003, Abdullah et al. worked with direct dictionary searching process and recursive simulation algorithm for detecting typographic and cognitive phonetic errors and giving suggestions for misspelled words [16]. In the work of Z. Islam et al. at 2010 they applied stemming and edit distance algorithm [17]. First the inputted word is stemmed by removing only the suffixes from the word. If the stem is not correct, a suggestion generation procedure produces a list of suggestions. Among the suggested words the edit distance algorithm finds the best matched word. They achieved 90.8% accuracy in single error correction and for multiple error correction the rate is 67%, tested with 13,000 input words.
UzZaman et al. [18] tested with 1607 words and got 98% 1-error correction accuracy and 100% accuracy for 2-error correction. They used an immediate lexicon search technique for detection of an incorrectly spelled word. They made use of the patterns of error in normal writing to generate the correct spelling recommendations for an incorrectly spelled word. They also considered the patterns of phonetic error typically seen in Bangla writing. For gener-ating suggestions for typographic errors, they calculated edit distance between misspelled word and candidate words. For generating suggestions for phonetic errors, they used double metaphone encryption [19]. Finally, they considered each of the scores that were found by phonetic error and typographical error for indexing the suggestion list.
In the year of 2014, Chaudhuri [7] made a dictionary of phonetically alike characters. He made a single unit character code for mapping those indistinguishable characters. Another reversed dictionary is used and using string matching algorithm he found the phonetic errors. As it works with the phonetic similarity, it can only correct one error. Khan et al. [20] worked on the work of Munshi et al. [21] for evaluation of their  The work done by Kumar et al. [22] showed the types of errors, error detection approaches and error correction techniques. They have done this study in context with other spell checker of Indian Languages. Etoori et al. [23] proposed based a character sequence-to-sequence text correction model for Hindi and Telegu Languages where they used LSTM encoder and decoder. For testing and evaluating they also build their own dataset. When measured the performance of their proposed system over other existing approaches, they got the highest accuracy of 85.4% whereas others have the 77.6% in case of Hindi Language. Jain et al. [24] proposed a method of detecting single word OOV or real word error where consists three main steps. First the data was collected in a confusion matrix which was used to explore frequency and types of error that had been occurred. Then using edit distance or predefined phonetically similar words are used to generate a candidate list and lastly correcting the sentence through Viterbi algorithm. It had the accuracy of 86% when threshold was greater than 5.

MATERIALS AND METHODS
In this section, our proposed process and material that we have used are briefly described.

Dataset
A huge collection of Bangla character combinations along with single characters is collected from [25], where 14980 individual character is placed. And also 959232 unique words are collected. We need this huge data set because if the corpus is enriched, the output will be more accurate.

Process
First, the system needs to take a Bangla word as input. The word may be correct or misspelled. Then it looks into the dictionary to find the word, if the word is found in the corpus the system will declare the input word as a correctly spelled word and terminate the process for this word. But if the word can not be found in the existing dictionary, the system will try to generate a list of suggestions for the correct word. The system will go for the 1-character edit first and then for the 2-character edit. In both cases, the word will be split by one character and based on those splits there will be deletion of n-character(s), insertion of ncharacter(s), Transpose within n-character(s) and replace n-character(s) with new n-character(s) from the data set of individual characters according to the split sets outcome where n can be 1 or 2, and then it will generate a long list containing all outputs of all steps mentioned above. The list now contains some correct words and a huge number of incorrect words.
The list of words are then matched with the dictionary words and correct words are shown as suggestions for that misspelled word. Among those suggestions the most probable one chose by index number. Index number is generated by distance of string similarity algorithms. The best result is found by applying Jaro-Wrinkler distance algorithm discussed in section 2.4 (a). The lower the distance, the greater the probability of higher index. The highest index that represents a word from the final list then becomes the most probable word. The whole process is shown in Figure 2.

RESULTS AND DISCUSSION
In this proposed approach we used string similarity as a factor to generate a list of suggested words and picking aword as the correct one. But our system does not make use on any probability measures. Adding the usage of occurrence probability scores from a huge corpus of Bangla Literature with our system will make the final output more accurate according to the Algorithm 1. We focused on the non-word errors in this work. So another limitation of our work is not considering real-word errors. Another limitation of our work is that it is not as fast as some of the spell correcting algorithms available in other languages specially in English. Our algorithm is also highly dependant on the dictionary that is used but it is an unavoidable factor when it comes to spelling correction. We have tested over 1000 Bangla words, and it detected 970 words as correct word or misspelled word and gave proper suggestions' list in case of misspelled one. Those 1000 Bangla words are collected randomly from different comments of various Facebook pages. If we would take consideration of the word mentioned in Figure 3  This system was compatible to detect an error which was out of vocabulary. In case of choosing the most probable word as it uses the edit distance so there was a huge probability of having the highest and at the same time that edit distance might be of more than one word. In that case the last word found was taken but this process needed to be modified by having contextual perspective over the whole sentence. As we discussed in the section 4.1, the performance of our system depends on the size of the corpus. can enlarge the corpus the output suggestions will be more accurate. A comparison between the performance of previous works and our work is shown in Table 3.

FUTURE WORK
As future work, we will expand the scope of our work by including the correction of real-word error by adding pattern matching. This system predicts the most probable word from the suggestion list based on JW Distance. But to achieve the accurate context of the text the suggested word may not appropriate. It can be solved by having a probability of word occurrence in that document by TF-IDF and that probability will be considered along with the edit distance value to predict the most accurate word from the suggested list generated through Norvig's algorithm. Again, the dictionary that was used can be more enriched by analyzing different social media's posts and comments of Bangla language and from there an occurrence probability will be calculated for a specific word and will be stored in the dictionary along with each word. When this dictionary will be used it will also match that word's occurrence probability in the internet also in the specified document from where the misspelled word had been picked.

CONCLUSION
Spelling mistakes might not be a recent phenomenon, but with the increase in the usage of social media and micro-blogging websites, it sure is more prevalent in recent years. While there are abundant instances of works done in this field of spelling mistake detection and correction, the number of works done in Bangla language are not more than a few. Our work presents an approach to detect spelling mistakes, prepare a list of suggestions for the correct word and to choose a word from the list as the correctly spelled word for the input word. For this, we used Norvig's algorithm along with Jaro-Winklers distance as a measure of string similarity. Our method offered 97% accuracy when tested 1000 misspelled words.