A Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity

ABSTRACT


INTRODUCTION
Dengue outbreaks are occurring across the world resulting in increasing numbers of patients, particularly in countries with tropical and subtropical climates [1].It is a disease caused by the dengue virus which is transmitted by Aedes mosquitoes which accelerate the rapid spread of the disease [2], [3].In addition, Aedes are also the vector of Zika, Chikungunya, and other virus infections, but dengue exclusively accounts for approximately 390 million dengue cases annually [4].Thus, dengue is a global health threat with significant social and economic impact.Although the disease is endemic in more than 100 countries in Southeast Asia, the Americas, the Western Pacific, Africa and the Eastern Mediterranean region [5], according to the World Health Organization, it is estimated that between 2.5 and 3 billion people worldwide live in dengue-endemic areas [6], posing a risk of getting dengue.Nowadays, there is no specific treatment for dengue and its vaccines can only provide symptomatic treatment [7], [8].Therefore, early detection of the disease allows healthcare professionals to deliver timely medical treatment and it also lowers the mortality rates below 1%.A patient with dengue suffers flu-like symptoms [4] with a high fever of 40 °C or 104 °F and are usually accompanied with at least two of the following symptoms: i) headache; ii) pain behind the eyes; iii) nausea and vomiting; iv) glandular swelling; v) joint, bone or muscle pain; and vi) rash [9].
For patients with severe symptoms, there will be a critical period of about 3-7 days after the onset of the illnesses; the fever will not go away with the symptoms including: i) severe abdominal pain; ii) persistent vomiting; iii) bleeding gums; iv) vomiting blood; v) rapid breathing; and vi) fatigue and restlessness [9].When critical dengue is suspected, seek medical treatment as soon as possible to prevent risks of plasma leaking, circulatory failure or severe bleeding, shortness of breath, and organ failure [10].These severe symptoms result in hypovolemic shock and lead to a risk of death [9], when the patient does not receive timely medical treatment.
In Thailand, a country in Southeast Asia, dengue outbreaks occur throughout the nation due to the favorable climate for breeding Aedes mosquitoes.As a result, dengue fever poses a significant threat [11], [12] to the health of the population in the country.Furthermore, this illness also impacts the public healthcare system and the nation's economy.
Therefore, the researcher aims to develop a Thai-language chatbot for analyzing Aedes-borne diseases using Jaccard similarity.The artificial intelligence (AI)-based chatbot uses natural language processing technology to analyze users messages in order to diagnose diseases by selecting the word features in the text through an integration between the term frequency-inverse document frequency (TF-IDF) method and the Jaccard similarity measurement in the Aedes-borne disease database.The development of the chatbot incorporates the Line Messaging API for user-system communication through the Line application.The chatbot is developed using PHP 7.2.34 and utilizes MySQL 5.7.32 for database management, with Apache 2.2.29 serving as the bot server.The contributions of this chatbot development can be beneficial in the fields of medical services and public health, as well as serve as a guideline for enhancing various service sectors in the future.

METHOD 2.1. Artificial intelligence
AI refers to technology that mimics human intelligence [13], using sophisticated mathematical algorithms to process data and produce results [14] without requiring a new command for each task.Nowadays, AI, such as chatbots, facial recognition systems, virtual assistants, and more, plays a crucial role in our daily lives by performing various tasks for humans.Due to its ability for automatic learning, AI enables processing, analysis, planning, and decision-making similar to humans.

Natural language processing
Natural language processing is a branch of AI technology that leverages knowledge from various fields [15], such as linguistics, computer science, and statistics, to analyze the language humans use in daily communication.It enables computers to understand humans' intentions and meanings in their communication.Natural language processing technology can be applied in various domains, including healthcare, education, and business.

Chatbot
Chatbot is a computer program developed for communicating with humans through naturally occurring everyday language used to communicate in our everyday life.The program simulates conversations with human users the same way they interact with human beings.This program can be used in real-time, 24 hours a day [16], [17].Chatbots are divided into 2 main types including: i) rule-based chatbots, which is a chatbot developed to process and understand human conversations according to predefined rules and conditions [18]; however, when a question is outside the predefined rules or conditions, the program will not be able to answer it or provide a wrong answer [19], and ii) AI-based chatbot or intelligent bot, a chatbot developed with natural language processing that understands natural language or the language that humans use to communicate in everyday life without relying on any predefined answers [20].Therefore, the program comprehends the user's intent and provides correct answer to the question.

Line Messaging API
Line Messaging API is a communication channel between Line application users and service providers.Messages are received and sent between the servers of the parties via the Line Platform, enabling the development of chatbots for interactive messaging with users.This API provides a seamless and efficient means of connecting users with various services and facilitating real-time communication, as shown in Figure 1.This study aims to develop a chatbot using PHP 7.2.34 and MySQL 5.7.32 to manage the database with Apache 2.2.29 functioning as the bot server.The server and Line Messaging API allow users to communicate via Line application.The architecture of the system is shown in Figure 2. The figure shows the working processes of Line Messaging API, a communication channel between Line application users and chatbot on Line Platform.Bot server must be connected to the Line platform, which transmits data between the server and the Line application users in JSON FORMAT.

Chatbot development process
This research aims to develop a Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity in which the working processes of the chatbot can be divided into 6 steps as: a. Tokenization is the process of splitting a text object into smaller units [21], [22] in order to determine the boundaries.This study applied the longest word pattern matching technique in which the longest matching word in a string is separated by comparing it with the dictionary.b.Stop word is the process of removing insignificant words like prepositions, pronouns, conjunctions, and interjections, from a text.When these words are removed, the meaning will not be affected, and it also helps reduce the size of the text [23]- [25].c.Stemming word is the process of replacing words with the same root or words with the same meaning by the same token [26], [27].d.Feature selection is the process of selecting words that are significant to the text.In this research, feature selection was performed by using TF-IDF.

Bulletin of Electr Eng & Inf
ISSN: 2302-9285  A Thai-language chatbot analyzing mosquito-borne diseases using jaccard similarity (Benjamin Chanakot) 651 e. Similarity measurement is the process of measuring the similarity of users messages through the Jaccard similarity in the Aedes-borne disease database.f.Responds is the process of displaying the processing output to users.

Term frequency-inverse document frequency
TF-IDF is a mathematical algorithm used to calculate the weight of significant words in a text [28], [29] on the assumption that a word appearing frequently in the text is usually with high term frequency (TF), while a low document frequency (DF) can be found when a word does not appear in any other texts, but with a high inverse document frequency (IDF) as seen in ( 1) for word weight calculation [30], [31].
where   is the frequency of the word   in statement , and ∑    is the sum of the frequencies of all words appearing in statement  [30].
where || is the total number of words in the corpus and |{:      }| is the number of the word   appearing in all statements in the corpus [32].

Jaccard similarity
Jaccard similarity is a statistical method used to measure similarity or coefficients.The method, used to measure the similarity between sets, is calculated by dividing the intersection of set A and B by the union of both data sets, with the result between 0-1, where 0 refers to no similarities and 1 appears when they are similar [33]- [35].Jaccard similarity can be calculated through (4):

Measuring the chatbot efficiency
Chatbot efficiency measurement refers to the accuracy evaluation of the chatbot to examine whether the interaction with users is effective enough for it to meet the needs or desire of users.In this research, purposive sampling was applied in selecting 10 experts that including 5 people of information technology experts and 5 people of medical specialists.The experts asked 120 questions to measure the chatbot efficiency, which can be calculated through (5) [36]:

The chatbot usability assessment
The chatbot usability was tested using the system usability scale (SUS), a questionnaire used in evaluating usability of an application developed by John Brooke in 1986 [37] including 10 items as shown in Table 1.Each of them is measured at 5 levels of satisfaction using likert Scale [38] as shown in Table 2.The score for the strongly agree is equal to 5, while the strongly disagree represents the score of 1 [39], [40].In this research, the 10 experts responding to the questionnaire were purposively selected applying Purposive Sampling.The experts consisted of 5 people of information technology experts and 5 people of medical specialists.The calculation of the score from the questionnaire can be conducted through (6): In which: X=the sum of the scores of all odd-numbered questions minus 5. Y=25 minus the sum of all even-numbered questions.I would like to use this chatbot often.2.
I think chatbots shouldn't be unnecessarily complicated.

3.
I think this chatbot is user-friendly.4.
I think technical support is required to use this chatbot.5.
I've found many functions of the chatbot work well.6.
I think some functions of this chatbot are inconsistent.7.
I think most people can quickly learn to use this chatbot.8.
I find using this chatbot very cumbersome.9.
I feel confident using this chatbot.10.
I need to learn a lot before I can use this chatbot.

RESULTS AND DISCUSSION
The output of the development of a Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity processing is shown in Figure 3.The Figure 3(a) displays the use of Rich menu, the menu that facilitates the users.It also Figure 3(b) outlines an example of a natural language interaction between a user and the chatbot.The intents confusion matrix was employed to evaluate the intent accuracy of Aedesborne disease analysis in the performance evaluation of the Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity.The results are presented in Table 3. Table 3 shows the performance evaluation results of the Thai-language chatbot analyzing mosquitoborne diseases using Jaccard similarity.It can be concluded that the chatbot achieved an intent accuracy of 85.00%.This indicates that the chatbot effectively and accurately analyzes Aedes-borne diseases in line with the users intentions.Figure 4 shows the results of the usability test of the Thai-language chatbot analyzing Bulletin of Electr Eng & Inf ISSN: 2302-9285  A Thai-language chatbot analyzing mosquito-borne diseases using jaccard similarity (Benjamin Chanakot) 653 mosquito-borne diseases using Jaccard similarity through the SUS.The mean score of the usability was 89.75 referring to excellent usability.This indicates that the chatbot was user-friendly and straightforward.Due to the fact that the respondents were already familiar with using Line application in their daily life, it was easy for them to understand the functionality of the Thai-language chatbot without any need to additionally learn or study how to use the program.Figure 4.The usability testing result of the Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity through the SUS

CONCLUSION
This research focuses on develop a Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity.It utilizes natural language processing to understand the intentions of users interacting with the chatbot.The selection of text attributes from a text is done using the TF-IDF before the Jaccard similarity is then used to measure similarity against a database of mosquito-borne diseases, providing appropriate responses to users.The chatbot is developed using PHP 7.2.34 and MySQL 5.7.32 for database management in which Apache 2.2.29 operates as a bot server using and incorporates the Line Messaging API for communication with users via the Line application.The research findings indicated that the Thailanguage chatbot achieved an intent accuracy of 85.00%, accurately capturing user intentions.The SUS assessment also indicated a high usability score of 89.75, demonstrating that the chatbot was user-friendly.
Therefore, it can be concluded that the Thai-language chatbot, which analyzes mosquito-borne diseases using Jaccard similarity, provides accurate and user-friendly interactions.The research found that the performance of the chatbot in engaging in conversations and providing precise responses aligns correctly with the user's intentions.This performance relies on tokenization to define appropriate boundaries for morphemes, as the Thai language lacks clear word boundaries.Therefore, when appropriate boundaries are assigned to the words, it impacts the accuracy of term weighting for feature selection using TF-IDF and the measurement of Jaccard similarity, resulting in increased accuracy.

Figure 1 .
Figure 1.Line Messaging API working processes

Figure 2 .
Figure 2. The architecture of the chatbot

Figure 3 .
Figure 3.A Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity, (a) an example of a Rich menu and (b) an example user-chatbot interaction

Table 2 .
(6)cription of the score from the SUS calculated from(6)

Table 3 .
The intents confusion matrix of the Thai-language chatbot analyzing mosquito-borne diseases using Jaccard similarity