Current location - Training Enrollment Network - Education and training - What changes have deep learning brought to biology?
What changes have deep learning brought to biology?
Deep learning research and its potential application in biomedical field

Deep learning has been successful in various biological applications. In this section, we review the challenges and opportunities of deep learning in various research fields, and if possible, we also review the research on applying deep learning to these problems (Table 1). We first reviewed the important fields of biomarker development, including genomics, transcriptomics, protein genomics, structural biology and chemistry. Then, we reviewed the prospect of drug discovery and reuse, including the use of multi-platform data.

Biomarkers. An important task of biomedicine is to transform biological data into effective biomarkers that reflect phenotypes and physical conditions (such as diseases). Biomarkers are important for evaluating the results of clinical trials. Identifying sensitive and specific biomarkers is a great challenge for modern translational medicine. Computational biology is that development of biomarker. In fact, from genomics to protein omics, any data source can be used; These will be discussed in the next section.

Genomics. Next Generation Sequencing (NGS) technology has allowed the generation of a large amount of genomic data. Most of the analysis of these data can be carried out on the computer by modern calculation methods. This includes the structural annotation of the genome (including non-coding regulatory sequences, protein binding site prediction and splicing sites).

An important branch of genomics is metagenomics, also known as environmental, ecological genomics or community genomics. NGS technology reveals the natural diversity of uncultured microorganisms, which has not been fully studied before.

There are several bioinformatics challenges in metagenomics. A major challenge is the functional analysis of sequence data and the analysis of species diversity. The use of deep belief networks and recurrent neural networks has allowed phenotypic classification of metagenomic pH data and human microbiome data. Compared with baseline methods, these methods do not improve the classification accuracy like reinforcement learning, but provide the ability to learn the hierarchical representation of data sets.

Deep learning has also achieved some success in processing high-dimensional matrix transcriptome data. In another method, the characteristics of gene expression and regions that do not encode transcripts such as miRNA are extracted; This is achieved by using the deep belief network and active learning, in which the deep learning feature extractor is used to reduce the dimensions of six cancer data sets and is superior to the basic feature selection method [27]. The application of active learning and classification improves the accuracy and allows the selection of cancer-related features (improved cancer classification), not just based on gene expression profiles. Feature selection using miRNA data is realized by using the relationship with the target gene of the previously selected feature subset.

In another deep learning application, Fakoor et al. popularized it with a self-encoder network, and applied it to cancer classification by using microarray gene expression data of different gene sets obtained from different types of microarray platforms (Affimetrix family) [28]. They combine PCA with unsupervised nonlinear sparse feature learning (through automatic encoder) and use dimensionality reduction to construct features for general classification of microarray data. The classification results of cancer and non-cancer cells show important improvements, especially the use of supervised fine-tuning, which makes the features less universal, but even for data without cross-platform standardization, higher classification accuracy can be obtained. The global generalization ability of automatic encoder is helpful to the data collected by different microarray technologies, so it may be promising to analyze the data in the public domain on a large scale.

Image processing applications. Gene expression can also be stored as images in visual form, such as image fluorescence signals from microarrays or RNA in situ hybridization fluorescence or radioactive signals. In some applications, CNN, which is famous for its excellent image processing performance, has shown the potential to improve these image analysis.

In microarray analysis, it may be challenging to detect signals and identify fluorescent spots due to the change of spot size, shape, position or signal intensity, and the fluorescence signal intensity usually corresponds to the difference of gene or sequence expression level. In an application of deep learning technology to this problem, CNN is used to segment microarray images, and it shows similar accuracy to the benchmark method, but the training is simpler and requires less computational resources. [29]

Another opportunity to apply CNN to image-based gene expression data is RNA in situ hybridization, which is a tedious technology. When this operation is allowed, gene expression can be located and visualized in a group of cells, tissue slices or the whole organism. This method promotes strong longitudinal research and explains the changes of expression patterns in the development process. It is used to construct a detailed brain map of Allen developing mice, which contains more than 2,000 gene expression maps, and each gene is described in multiple brain parts. In the past, these manual annotations were time-consuming, expensive and sometimes inaccurate. Recently, however, people have used deep pre-training CNN for automatic tagging [30]. Therefore, the neural network model trains the brains of developing countries at different levels of the original natural in-situ hybridization image without exact information about coordinates (spatial information); This technology has achieved excellent accuracy at multiple brain levels in four stages of development.

Edit. Another application field of deep learning is stitching. Splicing is one of the main factors that eukaryotes provide protein diversity. In addition, recent research shows the connection between "splicing code" and various diseases [3 1]. However, modern science still cannot fully understand the mechanism of controlling splicing regulation. Modern concepts of splicing regulation include transcription level, the existence of specific signal regulation sequence elements (splicing enhancers or silencers), the structure of splicing sites and the state of splicing factors (for example, phosphorylation of specific sites may change the activity of splicing factors). All these factors complicate the analysis because there are a large number of elements and complex nonlinear interactions between them. The existing mosaic prediction software needs high-throughput sequencing data as input, and faces the problems that the original reading is shorter than the conventional gene, the duplication level in the genome is high and there are pseudogenes. Therefore, the analysis algorithm of stitching mechanism is very slow and requires highly combined computing resources, and deep learning may provide improvement in this respect. In a deep learning application using five tissue-specific RNA-seq data sets, DNN was developed by using the hidden variables of genome sequence and tissue type, and it was proved to be superior to Bayesian method (splicing code metrics) in predicting the percentage changes of transcripts spliced by exons in individual and inter-tissue tissues [32].

Non-coding RNA Non-coding RNA is another problem in biology, which requires complex calculation methods, such as deep learning. Non-coding RNA is very important, involving the regulation of transcription, translation and epigenetics [33], but it is still difficult to distinguish it from RNA encoding protein. For short non-coding RNA, this task has been well solved, but it is still quite challenging for lncRNA. LncRNAs are heterogeneous and may contain a putative origin of replication (ORF) and short protein-like sequences. A new deep learning method, called lncRNAMFDL, was developed to identify lnc-RNAs, using ORFs, K adjacent bases, secondary structures and predicted coding domain sequences. This method uses five independent features extracted from the sequence data of Gencode(lncRNA) and Refseq (protein-encoded mRNA data), and produces a prediction accuracy of 97. 1% in the human data set.

Gene locus analysis of expression traits. Finally, quantitative trait loci (QTL) analysis has the potential for further study. QTL analysis identifies loci containing polymorphisms that lead to phenotypic variation of complex polygenic traits (such as body weight, drug response, immune response). One such "feature" that shows genetic variation is the expression or transcription abundance of any given gene in a given tissue and/or condition. Expression QTL(eQTL) is a genetic variation site that affects the abundance of transcripts. EQTL analysis leads to a deeper understanding of human gene expression regulation, but it faces many challenges. EQTL (cis -eQTL) that locally regulates expression is relatively easy to identify by a limited number of statistical tests, but trans -eQTL that regulates gene expression in other parts of the genome is more difficult to detect. Recently, a deep learning method, MASSQTL[35], was proposed to solve the trans-eQTL prediction problem by using various coded biological features, such as physical protein interaction network, gene annotation, evolutionary conservation, local sequence information and different functional elements from the ENCODE project. DNN is superior to other machine learning models. By using nine DNN models from their respective cross-validation folds, it provides a new mechanism for the regulatory framework of gene expression. The deep decoding system is also used to cluster the trans-eQTL feature vectors, and then visualize them by t-SNE dimensionality reduction technology.

Protein omics. Compared with transcriptomics, genomics in protein is a rather underdeveloped research field, with less data and less calculation methods for analysis. Even though there are similar signal coding and transmission mechanisms, the lack of human protein omics data and the difficulty of transforming the results of model organisms into humans complicate the analysis.

Deep learning can benefit protein omics in many ways, because some methods don't need as many training cases as other machine learning algorithms. Other advantages of deep learning methods are that they establish hierarchical representation of data and learn general features from complex interactions, which is beneficial to protein omics and network analysis in protein. For example, using phosphorylation data, the bimodal depth belief network has been used to predict the cellular response of rat cells to the same stimulus [36]. Compared with the traditional pipeline, the developed algorithm achieves quite high accuracy.

Structural biology and chemistry. Structural biology includes protein folding analysis, protein dynamics, molecular modeling and drug design. Secondary and tertiary structures are important features of protein and RNA molecules. For protein, the correct structure determination is very important for predicting the function of enzyme, the formation of binding between catalytic center and substrate, immune function (antigen binding), transcription factor (DNA binding) and post-transcriptional modification (RNA binding). Loss of proper structure will lead to loss of function and, in some cases, abnormal protein aggregation, which may lead to neurodegenerative diseases, such as Alzheimer's disease or Parkinson's disease. [37]

Comparative modeling based on compound homology is a possible method to predict the secondary structure of protein, but it is limited by the number of well-annotated compounds. On the other hand, machine learning ab initio prediction is based on the recognition pattern of compounds with well-known structures, but it is not accurate enough to be used in practice. Using the deep learning method from scratch, the structure prediction was improved by using protein sequencing data [38]. Similarly, deep learning has been applied to predict the contact and orientation between secondary structural elements and amino acid residues using star database data and complex three-stage methods [39]. The method used is an effective tool to analyze biased and highly variable data.

The invariance of three-dimensional structure is also important in function. However, some protein species do not have unique structures to participate in basic biological processes, such as cell cycle control, gene expression regulation and molecular signal transmission. In addition, recent research shows the importance of some disordered protein [37]; Many oncogenes protein have non-domains, and abnormal aggregation of misfolded protein leads to disease development [40]. This protein without a fixed three-dimensional structure is called Inherent Disordered protein (IDP), while the domain without a constant structure is called Inherent Disordered Region (IDR).

Many parameters distinguish IDP/IDR from structured protein, which makes the prediction process challenging. This problem can be solved by using deep learning algorithm, which can consider various features. In 20 13, Eickholt and Cheng published a series-based deep learning prediction index DNdisorder, which improved the prediction of disordered protein compared with the advanced prediction index [4 1]. Later, in 20 15, Wang et al. proposed a new method, DeepCNF, which can accurately predict multiple parameters, such as IDPs or protein with IDR, by using the experimental data of critical assessment of protein structure prediction (CASP9 and CASP 10). By using many features, the performance of DeepCNF algorithm is better than that of baseline ab initio [42].

Another important type of protein is RNA binding to protein, which binds to single-stranded or double-stranded RNA. These protein are involved in various post-transcriptional modifications of RNA: splicing, editing, translation regulation (protein synthesis) and polyadenylation. RNA molecules form different types of arms and rings, and it is necessary to identify and form secondary and tertiary structures connecting RNA and protein. The secondary and tertiary structures of RNA are predictable and have been used to model structural preferences and predict binding sites of RBP by applying deep belief networks [43]. The deep learning framework is verified on the real CLIP-seq (cross-linked immunoprecipitation high-throughput sequencing) data set to show the ability to extract hidden features from the original sequence and structural distribution, and accurately predict the locus of RBP.

Drug discovery and reuse. Computational pharmacobiology and biochemistry are widely used in almost every stage of drug discovery, development and reuse. In the past decades, different research groups and companies have developed a large number of calculation methods for computer simulation of drug discovery and target extension around the world to reduce time and resource consumption. Although there are many methods [44], none of them is optimal (for example, it is impossible to screen the flux or limit it by protein category). Now some studies show that deep learning is an important consideration method (table 1).

One of the important tasks of drug discovery is to predict the interaction between drug targets. The target (protein) usually has one or more binding sites with substrates or regulatory molecules; These can be used to build prediction models. However, the inclusion of other protein components may bias the analysis. Wang et al. used paired input neural network () to accept the ability of two vectors with characteristics obtained from protein sequence and target distribution to calculate target-ligand interaction [45]. This advantage of neural network is more accurate than other representative prediction methods of target-ligand interaction.

Drug discovery and evaluation are expensive, time-consuming and risky; Calculation methods and various forecasting algorithms are helpful to reduce risks and save resources. One potential risk is toxicity; For example, hepatotoxicity (hepatotoxicity) is a common reason for drug discontinuation. Predicting hepatotoxicity by calculation method may help to avoid possible hepatotoxic drugs. Using deep learning, the toxicity of compounds with original chemical structure can be effectively determined without complicated coding process [46]. Using CNN can also predict epoxidation and other properties, which means high reactivity and possible toxicity; This is the first time Hughes and others have realized it. By using the data of epoxidized molecules and hydroxide molecules in simplified molecular input line input specifications (SMILES) format as negative control [47].

Multi-platform data (Multiomics). Being able to use multi-platform data is the main advantage of deep learning algorithm. Because biological systems are complex and have many interrelated elements, the systematic integration of genomics, epigenomics and transcriptomics data is the key to extract the most effective and biologically meaningful results. The integration process is not unimportant in calculation, but the advantage is that the specificity and sensitivity of biomarkers are increased compared with single-source methods.

One of the main fields in computational biology that need to analyze combinatorial data is computational epigenetics. The combined analysis of genome, transcriptome, methylation group characteristics and histone modification provides accurate epigenome prediction.

Some researchers have developed deep learning methods that can be used to analyze data from multiple sources (table 1). Tools.genes.toronto.edu/deepbind/, is a method based on deep learning, developed by Alipanahi et al., which is used to calculate the ability of nucleotide sequences to bind transcription factors and RNA binding proteins in various diseases, and to characterize the influence of single point mutation on binding characteristics. DeepBind software is inspired by CNN and insensitive to technology; On the contrary, it is compatible with qualitative different forms of data from microarray to sequence. The implementation of CPU also allows users to parallelize the calculation process [48]. In another application based on CNN, Zhou and Troyanskaya designed the DeepSEA framework to predict chromatin characteristics and evaluate disease-related sequence variations. Different from other computational methods, their algorithm can capture the large-scale context sequence information of each binding site for annotating de novo sequence variants [49]. A similar CNN channel was developed to reveal the influence of sequence variation on chromatin regulation, and DNase-seq(DNase I sequencing) data was trained and tested [50]. A deep learning software named Bassed is superior to the baseline method, and achieves an average AUC of 0.892 on all data sets. Finally, with the development of depth feature selection model, deep learning is used to identify activity enhancers and promoters. This model takes advantage of DNN's ability to model complex nonlinear interactions and learn advanced generalized features [5 1]. The model selects features from multi-platform data and ranks them according to their importance. In these applications, deep learning method is a more sensitive and powerful predictor of chromatin characteristics, and it is also the key to develop complex biomarkers.

Cancer is a group of heterogeneous diseases, some of which are caused by gene mutation, so the classification of cancer using multi-platform data can reveal the potential pathology. Liang et al. developed a deep belief network model with multi-platform data to cluster cancer patients [52]. The restricted Boltzmann machine is used to encode the features defined by each input mode. One advantage of this method is that the deep belief network does not need normally distributed data, because other clustering algorithms and genetic (biological) data are not normally distributed.

Finally, from the perspective of natural language processing, deep learning tests the rationality of the hypothesis when browsing huge unstructured (research publications and patents) and structured data (knowledge tagging maps, such as gene ontology [53] or Chembl[54]). These databases together constitute a huge and multi-platform data set, which will be richer and more comprehensive if combined.

In a word, the huge scale of modern biological data is too large and complicated for people-oriented analysis. Machine learning, especially the combination of deep learning and human professional knowledge, is the only way to comprehensively integrate multiple large multi-platform databases. Deep learning enables human beings to do things unimaginable before: image recognition with millions of inputs, speech recognition and speech automation close to human capabilities. Although deep learning, especially unsupervised deep learning, is still in its infancy, especially in biological applications, the initial research supports that it is a promising method, which can overcome some problems of biological data and give new insights into the mechanisms and ways of millions of indirect and interrelated diseases, although there are no restrictions and challenges in implementation.