A framework for predicting lncRNAs expression in human dendritic cells in response to M. tuberculosis infection

,

INTRODUCTION Tuberculosis (TB) is one of the most deadly contagious disease in the world that caused by antibiotic resistant bacteria known as Mycobacterium tuberculosis (Mtb). It primarily affects human lung and could travel through the bloodstream to infect other part of human body. Even worse, the bacteria could travel to the meninges, which are the membranes surrounding the brain and spinal cord. Since the function of meninges is to protect human central nervous system, the infected meninges causes a rise in pressure within the skull, resulting in nerve and brain tissue damage, which is often severe. This life-threatening condition is known as meningeal tuberculosis (TB meningitis). It has been reported that 10-20% of the TB meningitis patients will suffer long-term after-effect such as severe brain damage, epilepsy, paralysis, hearing loss, and blindness [1]. The human immune system ables to protect human body against diseases by identifying, attacking and destroying threats from viruses, bacteria and parasites when function properly. As for TB, the immune response of a patient is critical whereby it could either help the body to fight the progression of the disease or it could exacerbate the bacteria infection when there is involvement of certain key molecules. Even worse, tuberculosis is said to co-evolve with human and the ability of Mtb to manipulate human immune system to destruct lung tissue has made it an ultimate pathogen [2]. M. bovis Bacillus Calmette-Guérin (BCG), the vaccine that was introduced and being widely used since 1921 does not work well in protecting human against TB due to ❒ ISSN: 2302-9285 evolvement and latent movement of Mtb. Developing an effective vaccine has been very challenging and difficult since the bacteria ables to evade immune system attack by co-opting the mechanisms of the immune system itself to its own advantage [3], [4]. The immune response against Mtb involves a network of innate and adaptive immune responses, where dendritic cells (DCs) are the key cells that bridge them. DCs is one of the main types of professional antigen-presenting cells (APCs) of the immune system [5]. Though the Mtb primarily resides in DCs and ables to interfere with their functions as it has the ability to impair host innate and adaptive immune responses, their interactions are less well understood [6]. Furthermore, existing research findings shows that the interaction of DCs with Mtb is contradictory [7]. Long non-coding ribonukleat acids (lncRNAs) are RNA molecules with length exceeding more than 200 nucleotides, which do not code for proteins. Although lncRNAs are less understood, they do have crucial roles in diverse biological, pathological processes, and could cause prominent implications to various human diseases. They are expressed in a tissue-specific context and responsible for regulating transcriptional control. Recent studies show that they are functionally associated with various cancers [8]- [10] and immune-mediated diseases [11], [12]. They are the regulators of various immune function, where they have large effects on adaptive and innate immune system [6], [13], [14]. Microarray technology [15] and traditional wet-lab experiments [16] had been applied for many years to uncover lncRNAs differential expression in patients infected with tuberculosis. However, RNA-seq is proven to provide better estimates of transcript expressions [17]. As shown in Figure 1, we propose a general framework for predicting lncRNAs being expressed in human DCs. We intend to conduct RNA-seq expression analysis with convulational neural networks (CNNs), a well-known deep learning (DL) technique to reveal the identification and characterization of lncRNAs found in DCs associated to TB infection for TB resistant patients who were identified to have non-infected and infected states as discussed in [18]. The RNA-seq datasets of resistant patients are obtained in a form of FASTQ file format from sequence read archive (SRA) database. They are single-end reads of next generation sequencing (NGS) library, using Illumina HISeq 4,000 instrument, which sequencing was performed by Gilad Lab, University of Chicago [18]. The experiment of transcriptomic response of DCs towards Mtb infection will take into consideration the infected and non-infected group of patients. There are 26 samples altogether whereby 13 samples from infected patients, and another 13 samples are from non-infected patients. Information on raw RNA-Seq datasets used in this study are shown in Tables 1 and 2

Data preprocessing and quality control
Initial quality control (QC) assessment will be performed on all data samples as an effort of checking whether these data has any problems before continuing further analysis. This process will be done using FASTQC (version 0.11.9) software, a simple graphical user interface (GUI) tool developed in Java by Babraham Bioinformatics group at the Babraham Institute. In a form of graph and tables (QC report), this tool provides a quick access and overview on problematic areas of raw sequence data coming from high throughput pipelines [19]. It has 12 analysis modules, which meet the quality control standard for raw reads.
FastQC program generates a QC report, which lead to the choice of preprocessing steps that need to be undertaken in order to fix the identified issues. A read trimming and filtering tool optimized for Illumina NGS data known as Trimmomatic [20] will be utilized to discard low-quality reads, trim adapter sequences and eliminate poor quality bases. In this work, we will use Trimmomatic to trim effected data samples that contains error in "per base sequence content" module as reported by QC report. After the effected reads were trimmed, FastQC will be run again to assess the quality of the trimmed reads. The QC report will be checked again to make sure that all the reads that previously have issue with per base sequence content are error free. Once the reads satisfied all the quality requirements, we will consider that these reads are good to go for the next step of RNA-seq data processes and analysis.

Reference-based RNA-seq read alignment
To infer which transcripts are expressed, identify genomic positions or estimate where the reads originated from, the sample RNA-Seq reads need to be aligned against reference genome. In this work, hierarchical indexing for spliced alignment of transcripts 2 (HISAT2) [21], a fast and sensitive graph-based alignment program, which is developed to map NGS reads to a single or population of human genomes will be used to align these reads against the selected human reference genome. HISAT2 ables to produce higher accuracy of sequencing reads alignment compared to original HISAT [22] system as it incorporates algorithmic improvements, where a hierarchical graph FM index (HGFM) is applied [23].
The strain of human reference genome used in this project is genome research consortium human build 38 (GRCh38). In order to run HISAT2, the reference genome need to be built in a form of indexes. Since the GRCh38 indexes are already available in HISAT2 website, we downloaded the HGFM index for reference plus transcripts directly from their website. These indexes use ensembl gene annotations, where it has many more transcripts compared to reference sequence (RefSeq) annotations [23]. The output of HISAT2 aligner is in sequence alignment map (SAM) format. Using Samtools, the SAM files will be sorted and converted into binary alignment map (BAM), which stores the same data but in a compressed binary representation for improved performance [24]. These BAM files will then be used as the input files in transcriptome assembly.

Transcriptome assembly
RNA-seq reads need to be reconstructed into a full length transcripts to allow for gene expression studies. For this purpose, we will be using StringTie, a transcript assembler and quantification tool for RNA-Seq, which was also developed by central for computational Biology (CCB) of John Hopkins University. This tool uses genome-guided transcriptome assembly together with de novo genome assembly approaches to improve transcript assembly. It applies network flow algorithm to estimate expression level for each transcripts ❒ ISSN: 2302-9285 [25]. StringTie is claimed to be better than other leading assembler such as cufflinks as it could produce more complete and accurate reconstructions of genes and better expression level estimation [26], [27]. Each alignment files produced by HISAT2, which then will be converted into BAM format are assembled using StringTie2. StringTie2 is the latest release of StringTie program, which include a new method in better high rate error handling and has the ability to work with full length super-reads that brings to better quality of short-reads assemblies [27]. The assembly process requires for reference genome annotation in .gtf format. Therefore, we downloaded human GRCh38 (version 21) comprehensive gene annotation file from genome ENCyclopedia Of DNA elements (GENCODE) [28] online database for this purpose.

Data filtering
Using in-house scripts, assembled transcripts will need to be checked and filtered to meet only criteria of a lncRNA.The selection criteria includes to select only transcripts with nucleotide sequence more than 200 base pairs, minimum coverage is at least 3x and the number of exon should be more than 2 as suggested in [29], [30]. All transcripts that do not meet these criteria will be discarded. Then, the lncRNA transcripts from the 26 samples will be merged into two different files according to patients group.

Differential expression analysis
One of the key steps in analyzing RNA-seq data is to perform the genes differential expression analysis between different biological conditions, where the summarized data will be assessed by statistical models [17]. To identify and quantify the changes in expression levels between the two groups of TB resistant patients, DEseq2 bioconductor software package will be used as the statistical tool to perform differential analysis of count data. As an improved package of DESeq [31], DESeq2 estimates and perform statistical inference on differential data based on negative binomial distribution. Using shrinkage estimators for the dispersion and fold changes in differential expression analysis allows DEseq2 to offer a sound, consistent performance and statistically well-founded solution to the wide dynamic range of RNA-seq experiments [32].

Deep learning method
In predicting non coding RNAs, it had been proven that the prediction performance of tools that apply deep learning methods exceeded other traditional machine learning methods in terms of identification reliability, ease of use and ability to utilize features not incorporated in the current knowledge [33], [34].

Convulational neural networks
The considerable success of CNNs in various visions and imaging tasks has given significant impact to nearly all scientific fields including bioinformatics. Alzubaidi et al. [35], CNNs is the most utilized deep learning network type that could automatically identifies relevant features without any human intervention and considered to be more powerful than recurrent neural networks (RNNs), another well-known deep learning algorithm capable of processing sequential data. Therefore, as shown in Figure 1, CNNs technique is proposed to be implemented in identifying and classifying between lncRNAs and mRNAs being expressed in human DCs. We intend to explore the capability of CNNs to extract information from one-dimensional biological sequences data as discussed by [36], [37].
We will enhance the lncRNAs prediction by comparing, mapping and consolidating the results yielded from CNNs technique with those discovered by differential expression analysis method using DESeq2. These processes will be done by running our own in-house scripts. We expect that the annotated and putative lncRNAs could be identified and any possible predicted mRNAs will be discarded.

Training datasets
Datasets of human lncRNAs from two well-known public databases, which are GENCODE and LNCipedia, and human mRNAs datasets from the RefSeq, an NCBI RefSeq database will be used as the training datasets to discover and learn RNAs data patterns. As reported by [38], GENCODE had suggested that there are more than 16,000 lncRNAs in human genome. It has comprehensive gene annotation of lncRNA genes on the reference chromosomes. While LNCipedia currently contains 127,802 transcripts with 107,039 are considered as high-confidence set of lncRNAs [39]. As for RefSeq, it contains curated, non-redundant collection of sequences representing genomes, transcripts and proteins [40].

CONCLUSION
One of the important research applications of RNA-seq these days is to discover the lncRNAs expression profiles using computational tools and pipelines. There are studies to uncover lncRNAs differential expressions in patients with tuberculosis infections using microarray technology and traditional wet-lab experiments. However, RNA-seq is proven to come up with better estimates of transcript expressions. The sophisticated high-throughput RNA sequencing technology allows researchers to characterize and quantify differential expressions with higher sensitivity, higher speed and higher dinamic range. To further enhance and improve the prediction results, we propose a framework to discover lncRNA transcripts being expressed in human DCs of two TB resistant patient groups by incorporating CNNs classification technique with existing RNA-Seq expression analysis.