A latent semantic analysis method for ranking the results of human disease search engine

ABSTRACT


INTRODUCTION
The development of the internet and web ontology has made medical and disease data more and more huge [1], [2].Many disease search engines appear to make it easier for people to access these data sources [1], [3], [4].Especially the human disease search engine or medical search engine based on disease factors or symptoms helps people conveniently self-diagnose their diseases [1], [5].Therefore, the disease results returned by the search engine not only need to be accurate but also must be ranked reasonably so that the users can know the disease having the highest probability they are likely to have [6].In information retrieval, a common method to rank results is the term frequency-inverse document frequency (TF-IDF) method [7], [8].This method calculates the importance of words in the query to the result document in order to rank the results [9].However, this method does not address the relationship between the words in the query and the words in the result document.Another method of ranking disease results is using the bayesian algorithm [10].This method is based on the superclass of the disease results and the number of diseases belonging to the superclass to calculate the probability of the disease results.The limitations of this method are that if the number of diseases of the superclass has only one disease or very few diseases, it will give a very low probability for the disease results, which will not be correct in the case of the disease results containing several disease factors having a high fit to the disease factors in the query.In this paper, we propose a method to rank disease results of search engines using the latent semantic analysis (LSA) technique.This method exploits the relationship between disease factors in the query and in the disease results to help the result ranking of the human disease search engine more accurately and avoid the limitations of the result ranking by using the bayesian method or common method.− Disease ontology Data from the human disease search engine were extracted from disease ontology (DO).DO is an internet resource for disease knowledge [11].It was created in 2003 by using the ninth revision of international classification of diseases (ICD-9).The DO was then reorganized based on unified medical language system (UMLS) disease concepts [12].Currently, the DO terms are continuously being improved and extended.DO has a single structure for disease classification and provides a clear definition for each.A disease has a label, definition, subclass, superclass, and property.The disease property or disease factor includes symptom, cause and location (positions happening symptoms).Figure 1 shows the hierarchy of DO.The proposed method of this paper uses LSA technique.LSA or latent semantic indexing (LSI) is a statistical method that was created in the late 1980s at bell core/bell laboratory by Launder and his team.They defined "LSA as theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text" [13].Scientific research has proven that LSA is similar to the way the human brain receives meaning from text, and LSA is capable of inferring deeper relationships in text data [14], [15].LSA starts with creating a term-document matrix; the columns stand for words or terms, and the rows stand for documents.Each entry in the term-document matrix is the TF-IDF score of the word in the document.Next, the singular value decomposition (SVD) technique is applied to the term-document matrix; this step is a key feature of LSA.The matrixes created by this step have a dimensional reduction, and we can exploit the hidden meaning in the text of the document from these matrixes [16].

METHOD
The LSA method applies to the human disease search engine as described in Figure 2. In this section, each part of Figure 2 is described in detail.The human disease search engine uses the MySQL database.Data from the disease database were extracted from DO.The search engine accesses MySQL database faster and more flexibly than DO.The disease database includes many tables that store information about all diseases, disease superclasses and subclasses, and all disease factors.The search engine processes so much on the disease definition table, which contains all diseases and their definitions.The definition of disease includes information about the disease and its factors.Disease factors can be a symptom, cause or location (positions happening symptoms).Figure 3 shows a disease definition.
The data of disease definition table consists of column "diseaseID", column "disease label" and column "definition".The column "definition" is extracted into a data frame for tokenized processing.Each row of the data frame is considered as a document.The tokenized processing removes stopwords, punctuation, and lowercase words.Each document is processed into an array of words.TF-IDF algorithm is applied to these arrays.

Query
The human disease search engine supports hint suggestions when users query in the search box of the engine Figure 4.This increases the interaction between the user and the search engine [17].When users enter the keyword in the search box, the search engine suggests similar keywords in the engine, and users can choose the keyword that suits their intent.These keywords are the disease factors in the engine database, so the returned results will be more precise.Suggested hint also helps users in case they only remember a part of the keyword, they can fill this part in the search box, and the search engine will suggest the full keyword and many other keywords similar to that keyword [17].Hints are necessary because normal users who do not have much medical knowledge [18], [19] may not be able to enter keywords correctly with the medical expertise contained in the engine database, leading to inaccurate results.For medical experts, hint suggestions will be useful in case they only remember part of the keyword [18], the search engine will fully suggest helping them remember the keyword they need, and they can refer to other similar keywords in the engine.
Figure 5 shows the hint suggestion process of the search engine.The hint suggestion process starts with the user entering keywords into the search engine.After each "space key press event", the search engine will get similar keywords in the cache for suggestions.In case there is no keyword in the cache yet, the search engine will query in the engine database for similar keywords to return to the user and store them in the cache for next time use.

TF-IDF disease definition
TF-IDF is a technique used in information retrieval to measure the importance of a word to a document in a collection of documents [20].The TF-IDF of a word is calculated in (1).

𝑇𝐹 − 𝐼𝐷𝐹 (𝑡, 𝑑) = 𝑇 𝐹 (𝑡, 𝑑) × 𝐼𝐷𝐹 (𝑡)
(1 where TF is the number of occurrences of that word in the document divided by the number of all words in the document.IDF is calculated in (2).
|D| is the number of all documents, d is a document, and t is a word or term [21].In this paper, each disease definition is a document and all diseases in the database are the document collection.TF-IDF technique generates matrix term-document.In this matrix, each word has a score.The term-document matrix of disease data has a large number of columns, about 5,000 columns; processing on this matrix will be computationally expensive.SVD algorithm is applied to this term-document matrix to reduce the dimension and exploit the relationship between words in the document.

SVD term-document
SVD is a matrix factorization technique to split a matrix into two or three matrices.It is commonly used for dimensionality reduction to make data easier to visualize and extract desired information [22].Dimension reduction is a process that reduces the number of features [23], so it improves computational efficiency.In addition, dimension reduction also helps to reduce noise and sparsity of the raw features [24].SVD algorithm is calculated in (3).
Where U is a m×r orthogonal left singular matrix, V T is a r×n orthogonal right singular matrix, S is a r×r diagonal matrix and A is the original matrix [25].SVD algorithm reduces matrix A from m×m to m×r and r×m (r≤m).In this paper, the original matrix is the term-document matrix.The number of components (r) of the SVD algorithm applied on the term-document matrix of disease data will be the total number of disease classes (super class and subclass) in the disease database (number of disease classes<number of diseases<number of disease words).The results of the SVD algorithm applied to the term-document matrix of disease data create a word-component matrix (U matrix, Figure 6) and a component-disease matrix (V T matrix, Figure 7).

Retrieval algorithm
The human disease search engine performs a full-text search based on the search query to find out the diseases containing disease factors in the search query.Then the search engine sums the score of words in the search query on each row of the word-component matrix to get the row with the highest total score, and that row is also the row of the component that best matches the search query.Algorithm 1 presents the detailed working of the proposed model.

RESULTS AND DISCUSSION
In order to investigate the effectiveness of this proposed ranking method, we analyzed 2 tests on the human disease search engine.In the first test, we randomly selected "fever" and "paralysis" symptoms to search on the search engine for diseases matching these symptoms.Table 1 shows returns the evaluation results of models.Powassan encephalitis disease is ranked in the first place because this disease contains not only the symptoms in the search query, but also contains other symptoms relatively close in meaning to the symptoms in the search query.Rabies disease is ranked second because it contains its feature symptoms such as "hydrophobia", "prickling or itching sensation at the site of bite" and "difficulty swallowing" but other symptoms of this disease also have the meaning close to the symptoms in the search query.La crosse 1194 encephalitis disease has fewer symptoms close in meaning to the symptoms in the search query than powassan encephalitis disease and rabies disease, so it is ranked in the last place.
In the second test, we randomly selected "fever", "sore throat" and "skin" symptoms to search on the search engine for diseases suitable to these symptoms.Table 2 shows the evaluation results."Hand, foot, and mouth" disease is ranked first because this disease not only contains the symptoms in the search query but also contains many other symptoms that are close in meaning to the symptoms in the search query.Chickenpox disease contains fewer symptoms close to the symptoms in the search query, so it is ranked second.The analysis of the result tables shows that this proposed ranking method gives reasonable and effective results.This method not only relies on the symptoms in the result diseases matching with the symptoms in the search query, but also exploits the meaning of other symptoms in the result diseases compared to the meaning of the symptoms in the search query so that the ranking of the results is reasonable.

CONCLUSION
This paper uses the LSA method to rank disease results of the human disease search engine.This method takes advantage of both the TF-IDF score and the implicit relationship between disease factors.This makes the ranking of disease results more reasonable and better.Further, this method can also be combined with deep learning techniques to adjust the results more accurately.It helps the self-diagnosis of human disease search engine be more effective.

Algorithm 1 ::: 2 : 4 :
Retrieval Input: word-component dataframe, WCdf Output: the component number best match with search query 1For each row in WCdf do 2: S = sum the scores of words of search query on row 3: Sumlist.add(row.index,S) //row index is the component number 4Return sumlist.getmax().index// return the index of the element with max value in Sumlist Finally, the search engine analyzes the component-disease matrix and browses the column of the most suitable component for the search query found in the previous step.To achieve the same goal, Algorithm 2 is designed, which transfers the raw data into a required input format.The result diseases are ranked based on the scores of the result disease rows.Algorithm 2 Structuring data Input: component-disease dataframe, CDdf the component number best match with search query, compnumber the result disease id list from the full text search, diseaseIDlist Output: the ranked result disease id list 1For each row in CDdf[['diseaseID','component'+'compnumber']] do If row.diseaseID in diseaseIDlist then 3: rankdiseaseIDlist.add(row.diseaseID.value,row.'component'+'compnumber'.value) Return rankdiseaseIDlist.sort()//return the rankdiseaseIDlist sorted by the values of the elements

Table 1 .
The search engine for diseases matching these symptoms

Table 2 .
The search engine for diseases suitable with these symptoms disease that results_in infection located_in skin, has_material_basis_in human coxsackievirus A16 or has_material_basis_in human enterovirus 71, which are transmitted_by contaminated fomites, and transmitted_by contact with nose and throat secretions, saliva, blister fluid and stool of infected persons.The infection has_symptom fever, has_symptom poor appetite, has_symptom malaise, has_symptom sore throat, has_symptom painful sores in the mouth, and has_symptom skin rash on the palms of the hands and soles of the feet 0.181783 2 Chickenpox A viral infectious disease that results_in infection located_in skin, has_material_basis_in human herpesvirus 3, which is transmitted_by direct contact with secretions from the rash or transmitted_by droplet spread of respiratory secretions.The infection has_symptom anorexia, has_symptom myalgia, has_symptom nausea, has_symptom fever, has_symptom headache, has_symptom sore throat, and has_symptom blisters 0.114284