Evaluation of domain sulfur industry for DIA translator using bilingual evaluation understudy method

ABSTRACT


INTRODUCTION
Evaluating the systems of machine translation (MT) is a considerable field of researches to optimize the effectiveness of technologies of MT improvement cycle [1].Evaluation refers to estimate or examine the validity of a particular thing.Anytime a particular novel technology is to be under development.it again requires updated testing or assessment on particular bases.Similarly, the requirement forevaluating the MT arise [2].Given the great development in the system of MT field, as well as the prominence of the requirement for great speed and extremely high degree of accuracy in the information convenience interchangeablely between two or more languages [3], human evaluation consumes more time in addition of being expensive and thus inappropriate to be used repeatedly during research or develop MT system engines [4].
Evaluating of the system of MT and the system of MT itself are of the same importance, tackling issues concerning the interpretation of linguistic item precisely, fluently, and in an acceptable acceptable say [5], and then attestattesting an MT algorithm.During the last dozens of years, it has been used a huge number of metrics for evaluating the quality of MT, on the ground of a variety of similar standards that are proposed to be an independent tongue and not targeted a certain natural language.The majority of them are relied on comparing between the automatic translation and that of direct reference [6].

371
Any machine translatoin accuracy is normally estimated through making a comparison between the outcomes of it with those of expertise human judgements [7].The recent study has been conducted on the performance-based method.BiLingual evaluation understudy (BLEU) which is introduced by Papineni et al. [2] is a method used to evaluate MT systems, which is supposed to be autonomous language independent and greatly based upon the human assessment.BLEU is highly constructed on an essential notion for determining the goodness of a particular MT programme.It could be made briefly by the proximity of the proposed outcome of the MT scheme with indication to a translated text done by an (experienced human) translation of the text itself [8].
The proximity of the selected translation to the referred one is decided by a mutated n-gram accuracy when n={1, 2, 3, 4} [9].The mutated n-gram accuracy is the essential standard that BLEU apply to differentiate among well done and weak selected translations [10], as this standard is centred on calculating the amount of highly occurred words in the selected translation as well as the referred rendering, followed by dividing the amount of the highly occurred words by the gross amount of words in the selected rendering [11].The mutated n-gram accuracy determines selected linguistic structures as being shorter than those of referred opposite parts [12] in addition, this n-gram determines selected linguistic structures which have over generated correct word forms.
English-to-Arabic MT has been an annoying and exciting research subject for a high number of researchers in the domain of processing standard Arabic language.A significant amount of attempts had been conducted for performing or improve MT from Arab language into many different ones [13].This research concentrates on the assessment of the performance of the English-Arabic DIA MT software and the production of Google Translate.The purpose behind the recent research is to get an estimation for the conduction of DIA programme in comparison with that of Google Translate by dealing with a variety of text types directed from English into Arab language, as well as the quality of being acceptable acceptable and usable for the end-users.The adequacy scores [7], [14] and fluency scores are the main tests used to assess the quality of the translation [15].

METHOD
The recent research adopts the BLEU method [16], [17] for evaluating DSI for English-Arabic DIA translator and the Google translator.The evaluation conducted automatically just supples a way that compare the output texts with that of human references without absolutely measurng the goodness of the translation.Arab language uses variety of forms and arrangements for words, so as it could communicate any idea in various forms.Moreover, the so many dialects existed and the merit of being expressed in various forms is not necessarily similar concerning the two involved languages expectedly results in the probablity of indicating so many meanings for only one sentence as it it is Alqudsi et al. [18].
In these studies, the measurement of being intelligible is centred on a pair of characteristics s, i.e. being fluent as well as being adequate by using BLEU-score formula.It is resulted from the division of the brevity penalty (BP) by the geometric mean of altered n-gram accuracies.Therefore, we must begin by calculating the geometric mean of n-gram's altered accuracy.After that, the size of the candidate's text (c) and the duration of the effective reference corpus (r) must be calculated so as to be ready for calculating the BP.Then the closest human judgment score is determined.In (1) [2] demonstrates the way to generate a BP exponentially reduced (r/c): In (2) shows the way of computing the final BLEU score: Whereas N equals 4, while regular weights wn equals (1/N) [7].
The BLEU metric scores are ranging from 0 to 1 [2]; where the value (1) implies that the applicant text has fully matched the reference form, and the value (0) implies that the applicant text and the respective reference text are totally distinct.In view of the fact that the phrase is the fundamental unit in the translation process of the two programs assessed, it was selected as the fundamental test element.As a result of this research, only the output quality of sentences was assessed, the focus was on the preservation of meaning, which involves a comparison of meaning in output with that in the original [19].
Pre-processing data by separating any version into distinct n-gram dimensions, like the following: (uni)grams, (bi)grams, (tri)grams, and (tetra)grams.The accuracy of the DIA translator system and the Google translator was calculated for each of the four gram dimensions.Calculate a unified accuracy rating for each of the four n-gram dimensions.These scores are then contrasted to decide which of them will get the highest version [20] (compare MT schemes: individual devices and system components are rated on the basis of how often they are considered to be superior than or equivalent to any other scheme).Algorithm that follows is applied to evaluate the translation as in the main steps: i) start; ii) input (source text); iii) input (two reference of target text); iv) translation source text by DIA translator; v) translation source text by Google translator; vi) automatic evaluating of DIA translator quality; vii) automatic evaluating of Google translator quality; viii) compare between DIA and Google output quality; ix) compare result quality by human expert evaluator; x) print (rank MT systems from best to worst); and xi) End.
The BLEU is quite a rough measure of translation performance [21], Figure 1 illustrates the main steps of the method and the way of extracting n-grams from English, Arabic, Arab language references of linguistic structures for calculating BLEU scores concerning the systems of MT of DIA translate plus Google translator.After that, the nearest human assessment could be judged.A variety significant factors can provide a contribution to the bilingual evaluation understudy (BLUE) grossness [22]: i) synonyms and paraphrases will only be used if they are in a collection of various reference types [23]; ii) word results are similarly weighted so that there is no extra punishment for missing content-bearing content [24]; and iii) the punishment for brevity is a stop-gap measure to compensate for the relatively severe issue of not being prepared to calculate recall [25].Each of these mistakes leads to an enhanced number of inappropriately indistinguishable transmissions in the assessment.Since BLEU can theoretically assign equivalent scores to translations of manifestly distinct performance [26], it is logical that a greater BLEU rating is not possible.

RESULTS AND DISCUSSION
We created software of automatic evaluation on Arabic MT quality (BLUE method) by applying Asp.net 2017 to execute this task.The quality evaluation of MT is illustrated by the quality evaluation of MT is illustrated by Table 1 shows, Adequacy: Does the output convey the same meaning as the input sentence?Is part of the message lost, added, or distorted?while Table 2 shows, Fluency: Is the output good fluent English?This involves both grammatical correctness and idiomatic word choices.Figure 2, which describes the main screen of the system of evaluating MT.Most proposed approaches for English-Arabic DIA translator have been tested on limited domain; sulfur industry.So, for evaluating the obtained outcomes of this scheme of evaluation, we selected a corpus of 1,200 phrases which are categorised under 4 criteria; terms, phrase, text with limited domain, and general text and then they were rendered into their counterparts in Arabic language by making use of each of the Babylon and Google translators.
The results obtained through using the BLUE method by the system of DIA programme as well as the application of Google translation, we have reached a conclusion of the the following:  The analysis of using chemical symbols shows that Google data base doesn't include those symbols concerning the field of translating sulfur industry, on the contrary to DIA system which shows an integrated information of the input symbols because of being specialized informative system for translation in this field. The analysis of terminology test which is often not more than three words, also shows that Google data base hardly includes few terms compared with DIA system which was able to translate them.
 Testing specialized sulfur industry expressions shows that DIA system is highly better than Google. Testing texts of no more than 50 words shows that DIA has the priority in showing the translated synonyms.While, both (DIA and Google) systems were equal concerning grammatical order of sentence constituents. Testing common texts of no more than 50 words shows that Google has the priority in the translation because its data base is richer than that of DIA.Concerning texts other than the field of sulfur industry. It has been generally observed that the Google Translate scheme has been normally noticed to be inferior in most applications as it compared with the system of DIA as indicated in Table 1. Finaly, by analysing human evaluation, and comparing the results by using BLUE method in the translations of both (DIA and Google) shows that BLUE method was of (89.875%) adequacy.Concerning the rate of the degree of accuracy of results for each phrase of the Google and BLUE methods corpus, DIA programme confirmed greater translation accuracy than that of Google Translate (73.325%) concerning DIA, while 30.325%concerning Google forthe performed tests.The systems of MT are illustrated in both of Figure 3 and Table 3  Figure 4 shows Summary of average precision, x-exis include terms of sulfur industry, phars less than six wordes, and sentenses less then 50 wordes, the y-exis include the range of quality unit.

CONCLUSION
The recent research concludes that the automatic evaluation of MT of the efficiency of domain sulfur industry for DIA system using BLUE technique for determining the technique of assessment is closer to that human assessment.Furthermore, the recent research, refers that many experiments about the effectiveness concerning both internet systems of MT (i.e., Google translator and DIA translator) to translate 1,200 English symbols of sulfur, terms and texts within the competence of the sulfur industry into Arabic have been conducted.Most of the applied techniques to assess automatically the accuracy of the conversion of scheme of MT are relied on contrasting between the texts of both applicant and the reference.The obtained findings refer that the normal acquiescence accuracy concerning the system of DIA translator is almost about 73.325% in comparison with that accuracy concerned with the Google system of MT of nearly 30.325% if the BLUE technique is used.The BLUE method efficiency is about (90.478%) as compared with the human expert evaluator.

Figure 1 .
Figure 1.The structure of the main method steps

Figure 3 .
Figure 3. Results of MT systems evaluation

Figure 4 .
Figure 4. Summary of average precision Evaluation of domain sulfur industry for DIA translator using bilingual ... (Tahseen Ameen Faisal) 375

Table 1 .
respectively.The scales for assigned fluency scores

Table 2 .
Scales of scores used for assigned adequacy

Table 3 .
Average precision for each type