Current location - Training Enrollment Network - Mathematics courses - Introduction to bioinformatics
Introduction to bioinformatics
Directory 1 pinyin 2 English reference 3 main research contents of bioinformatics 3. 1 Obtaining complete genomes of human beings and various organisms 3.2 Discovering new genes and new single nucleotide polymorphisms 3.4 Studying biological evolution at the genome level 3.5 Completing comparative research of genomes 3.6 From functional genomics to systematic biology 3.7 protein structural simulation and drug design 3.8 Research on application and development of bioinformatics 1 pinyin shē ng wù xī nī xī xé.

2 English reference bioinformatics

Bioinformatics is a new interdisciplinary subject. Many people will think that bioinformatics involves both biology and physics, and it must be a very extensive subject field. In fact, its connotation is very specific and its scope is very clear. Bioinformatics is accompanied by genome research, so its research content is closely developed with genome research.

Broadly speaking, bioinformatics is engaged in the acquisition, processing, storage, distribution, analysis and interpretation of biological information related to genome research. This definition contains two meanings: one is the collection, arrangement and service of massive data, that is, the management of these data; The other is to discover new laws from it, that is, to make good use of these data.

Specifically, bioinformatics takes the analysis of genomic D NA sequence information as the source to find the coding regions representing protein and R NA genes in the genome sequence; At the same time, the information essence of a large number of non-coding regions in the genome is clarified, and the genetic language rules hidden in the D NA sequence are deciphered; On this basis, the data of transcription spectrum and protein spectrum related to the release and regulation of genomic genetic information are summarized and sorted out, so as to understand the laws of metabolism, development, differentiation and evolution.

Bioinformatics also uses the information of coding region in genome to simulate the spatial structure of protein and predict the function of protein, and combines this information with the physiological and biochemical information of organisms and life processes to clarify its molecular mechanism, and finally carries out molecular design of protein, nucleic acid, drug design and individualized medical care design.

Genome informatics, structural calculation and simulation of protein and drug design are closely related to the central principle of genetic information transmission, so they must be organically linked.

Why does genome research need to rely on bioinformatics? First of all, with the research of genome, the related information has exploded, and it is urgent to deal with massive biological information. Since scientists deciphered the full-length genome of Haemophilus influenzae in 1.995, they have completed the whole genome sequencing of about 60 kinds of microorganisms and several eukaryotes, such as yeast, nematodes, fruit flies and Arabidopsis thaliana. By the spring of 200 1 year, scientists had published most of the sequences of the human genome, that is, the working sketch of the human genome. These achievements mean that genome research will enter a new stage of information extraction and data analysis. According to the statistics of international database, the number of DNA bases in 1999 12 is 3 billion, compared with 6 billion in April 2000, and now it has reached 1400 million, doubling about every14 months. At the same time, the growth of digital processing capacity of electronic computer chips is equivalent to doubling every 18 months. Therefore, computers can effectively manage and run massive data.

However, the more essential reason is the complexity of genome data. The so-called genome of an organism refers to the sum of all genetic materials of an organism. Biological genetic material is a biological macromolecule called deoxyribonucleic acid (DNA), which is composed of four nucleotides in series, usually represented by characters A, T, G and C. Generally speaking, the biological genetic code is a linear long chain formed by connecting these four characters. This kind of chain is often very long. For example, the human genetic code contains 3.2 billion characters. When they are piled together, they form a "heavenly book" with more than 6.5438+0 million pages and 3,000 words per page. This "heavenly book" contains a lot of information about the structure, function and life activity process of the human body, but it consists of only four words, without lexical, syntactic and punctuation marks. It seems that every page is similar. How to read is a big problem. Genome research is ultimately the process of transforming biological problems into digital symbols. In order to solve this problem, we must develop new analytical theories, methods, techniques and tools, and we must rely on computer information processing.

Engaged in bioinformatics research should have a variety of scientific foundations. First of all, it needs certain computing power, including corresponding software and hardware equipment. There should be various databases or effective communication with international and domestic database systems. Have a developed and stable Internet system; At the same time, bioinformatics needs powerful innovative algorithms and software. Without algorithm innovation, bioinformatics cannot achieve sustainable development. Finally, it should establish extensive and close links with experimental science, especially with the automatic large-scale Qualcomm biological research methods and platform technology. These technologies are not only the main methods to generate bioinformatics data, but also the key means to verify the research results of bioinformatics. Therefore, people engaged in bioinformatics research must also have interdisciplinary knowledge.

The research and application of bioinformatics in China has a certain foundation, so it is expected to achieve breakthrough results, which is very important for strengthening China's strength in the field of basic research and occupying an international leading position in some aspects. The application of bioinformatics results will also produce great social and economic benefits.

3 The main research content of bioinformatics at present 3. 1 Obtaining the complete genome of human beings and various organisms The primary goal of genome research is to obtain the complete genetic code of human beings. The human genetic code has 3.2 billion bases, but the current D-NA sequencer can only read hundreds to thousands of bases per reaction. In other words, to get all the genetic codes of human beings, we must first crack the human genome, then measure the short sequence and reassemble it.

However, it is easy to imagine that if a book is torn into pieces of the same size, it will never be able to put them back correctly because the context of the book is lost at the same time. How should we do this? We can take two identical books and tear them in different ways. By cross-referencing different fragments and finding the same words, the context of the book can be partially restored. The more books are torn, the more contextual connections are restored. Therefore, in order to obtain a complete set of human genetic code, it is not only necessary to measure the 3.2 billion bases of human beings once, but often to measure them many times. For example, the draft report of human genome published in Nature and Science at the beginning of this year said that it contains about 2.9 billion bases, with 96% physical map coverage and 94% sequence coverage. More than 90% of the continuous sequence groups are larger than 654.38+ million bases; About 25% of the contiguous sequences are equal to or greater than 1 10 million bases. 30,000-40,000 genes encoding protein were found in these sequences. Obtaining such a map is equivalent to measuring the human genome about five times. To do this, tens of millions of small fragments need to be connected by comparison, which is often called the splicing and assembly of genome sequence data.

Every link of large-scale genome sequencing is closely related to information analysis. From optical density sampling and analysis of sequencer, base reading, vector identification and removal, splicing and filling sequence gaps, to repetitive sequence identification, frame prediction and gene marking, every step is closely dependent on bioinformatics software and database. Among them, sequence splicing and filling sequence gaps are the most critical and primary problems. Its difficulty lies not only in its huge mass data, but also in its highly repetitive sequence. Therefore, in this process, it is particularly necessary to link experimental design with information analysis. On the other hand, we must develop appropriate algorithms and corresponding software according to the requirements of different steps to deal with various complex problems. Many famous genome research centers in the world have their own splicing and assembly strategies, which are all done on supercomputers.

With a complete genome, human beings will have a more detailed and accurate understanding of themselves. For example, only 1. 1% in our genome actually encodes protein (called exon). The region between exons (called intron) accounts for 24%; However, the interval sequence between genes accounts for 75%, that is to say, the regions in the human genome that do not encode protein account for the vast majority. It is found that human genes encoding protein are more complicated than other organisms, and there are more abundant splicing methods. It is found that fragment duplication in genome is very common, which reflects the complex evolutionary history of human beings. It is found that human chromosome 13 is relatively stable, while male chromosome 12 and female chromosome 16 are variable, and so on.

3.2 Discovery of new genes and new single nucleotide polymorphisms The discovery of new genes is a hot spot in international genome research at present, and bioinformatics is an important means to discover new genes. For example, the complete genome of Saccharomyces cerevisiae contains about 6000 genes, about 60% of which are obtained by information analysis.

Computer Cloning of (1) Gene

Using E ST database to discover new genes is also called computer cloning of genes. E ST sequence is a short c DNA sequence of gene expression, which carries the information of some fragments of complete gene. By 200 1 and 10, there are more than 3.8 million human E-ST sequences in the EST database of GenBank, covering about 90% of human genes.

As early as 1996, China began to search for new genes by computer cloning. Its principle is simple, that is, find all the E ST fragments belonging to the same gene and connect them. Because E ST sequences are randomly generated in many laboratories all over the world, there must be a large number of repeated small fragments in many E ST sequences belonging to the same gene. Using these small fragments as markers, we can connect different ESTs until we find their full length, so we can say that we have found a gene through computer cloning. If this gene has not been discovered before, then we have discovered a new gene. However, the design of computer cloning program is complicated and the calculation is huge.

(2) Predicting new genes from genomic D NA sequences.

Predicting new genes from genome sequences is essentially to distinguish between regions encoding protein and regions not encoding protein. For the theoretical method, it is to find out what mathematical and physical characteristics are different between the coding area and the non-coding area. By comparing these sequences with the database of known genes, new genes can be found.

The discovery of new genes will deepen our understanding of life activities. According to the report of Nature from 1999 to 65438+February 2, 679 genes have been identified from the data of human chromosome 22, of which 55% are unknown. There are 35 diseases related to chromosomal mutation, such as immune system diseases, congenital heart diseases and schizophrenia. However, it is still a very important and arduous task to integrate all human genes and their corresponding protein and related functions into an index completely and correctly. The International Human Genome Collaboration Group is working hard to establish a complete "comprehensive gene index" and related "comprehensive protein index".

(3) Single nucleotide polymorphism was found.

Some people smoke and drink but live long, and some people are sick since childhood; The same drug for treating tumors is very effective for some people and completely ineffective for others. Why is this? The answer is the difference in their genomes. Many of these differences are single base variation, that is, single nucleotide polymorphism (S NP).

It is generally believed that the study of S NP is an important step in the application of human genome project. This is mainly because S NP will provide a powerful tool for the discovery of high-risk groups, the identification of disease-related genes, the design and testing of drugs and the basic research of biology. S NP is widely distributed in the genome, and recent research shows that it appears every 300 base pairs in the human genome. The existence of a large number of S NP loci gives people the opportunity to find genomic mutations related to various diseases, including tumors; From the experimental operation, it is easier to find disease-related gene mutations through S NP than through family. Some S NP does not directly lead to the expression of disease genes, but because it is adjacent to some disease genes, it becomes an important marker. Nanoparticles also play a great role in basic research. In recent years, the analysis of Y chromosome S NP has made a series of important achievements in the fields of human evolution, human population evolution and migration.

3.3 Study on the Structure and Function of Non-coding Protein Region of Genome

Recent studies show that in bacteria and other microorganisms, the non-coding protein region only accounts for 10% to 20% of the whole genome sequence. With the evolution of organisms, there are more and more non-coding regions, and non-coding sequences have accounted for the vast majority of genome sequences in higher organisms and human genomes. This shows that these non-coding sequences must have important biological functions. It is generally believed that they are related to the regulation of gene expression.

For the human genome, so far, only the region (gene) encoding protein on D-NA has been truly mastered, and the latest data show that this part of the sequence only accounts for 1. 1% of the genome. The research on the coding region that only accounts for 1. 1% of the human genome has produced dozens of Nobel Prize winners, and 98% of the non-coding regions will contain considerable achievements. Therefore, finding the coding characteristics, information regulation and expression rules of these areas will be a hot topic for a long time to come, and it will also be the source of important achievements.

3.4 Studying biological evolution at the genome level In recent years, with the massive increase of genome sequence data, the debate on the relationship between sequence differences and evolution has become increasingly fierce. Firstly, it is found that the phylogenetic trees reconstructed from different molecular sequences of the same population may be different. At the same time, the discussion on the relationship between "vertical evolution" and "horizontal evolution" has gradually attracted people's attention. That is, the phenomenon of gene "lateral transfer" discovered in recent years. That is, genes can migrate between coexisting populations, which may lead to sequence differences, but this difference has nothing to do with evolution. Even the analysis of the human genome found that dozens of people's genes are only similar to bacterial genes, but not in fruit flies and nematodes. If we use these human gene sequences to study evolution, we will draw absurd conclusions. Therefore, in the current study of molecular evolution, vertically evolving molecules must be selected as samples. Especially in molecular evolution analysis, "similarity" and "homology" are two different concepts. Similarity only reflects the similarity between the two, and does not contain any meaning related to evolution. Homology refers to the similarity related to the same ancestor.

3.5 Comparative Study of Whole Genome In the post-genome era, there are more and more whole genome data. With these data, people can analyze and study some important biological problems, such as: Where did life originate? How did life evolve? How did the genetic code originate? How many genes does the smallest independent living body need? How do these genes make organisms alive? Wait a minute. These important questions can only be answered at the genome level. For example, the genomes of mice and humans are about the same size, both contain about 3 billion base pairs, and the number of genes is similar, and most of them are homologous. But that's the big difference between mice and people. Why? Similarly, some scientists estimate that the genome difference between different races is only 0.1%; The difference between apes is about 1%. But the difference between their phenotypes is very significant. Therefore, this difference should be attributed not only to the gene and D NA sequence, but also to the difference of the whole genome and chromosome tissue. This work initiated comparative genomics.

Scientists have found that all genes can be divided into several categories according to their functions and phylogeny, including genes related to replication, transcription, translation, molecular bridesmaid, energy production, ion transport and various metabolism. This work also provides a new way for the classification of protein. At the same time, by comparing several complete genomes, scientists have calculated that the number of genes needed to sustain life activities is at least about 250. Similarly, when we compare the genomes of mice and humans, we will find that although the genome size and the number of genes are similar, the organizational structure of the genome is completely different. For example, the genes existing in mouse chromosome 1 have been distributed to human chromosomes 1, 2, 5, 6, 8, 13 and 18. Studies have shown that the differences in the arrangement order of some ribosomal proteins can reflect the genetic relationship between species, and the closer the genetic relationship, the closer the gene arrangement order. In this way, the phylogenetic relationship between species can be studied by comparing the sequences of genes.

Large-scale sequencing and analysis of microbial genomes have been carried out in China since 1998. At present, there are: thermophilic eubacteria and thermophilic archaea independently identified in China; Shigella flexneri; Leptospira hemorrhagic jaundice dependent strain; Staphylococcus epidermidis; China scientists of Xanthomonas Chrysanthemi have completed the sequencing of human genome 1%, and recently completed the "working sketch" of rice genome with 430 million base pairs. These data will provide the most direct materials for China's research in this field.

3.6 From functional genomics to systems biology, the number of genes expressed in different tissues varies greatly. The number of genes expressed in the brain is the largest, with about 30,000-40,000 transcripts, and some tissues only express dozens or hundreds of genes. At different stages of individual growth and development, the types and quantities of genes expressed in the same tissue are also different. Some genes are expressed in childhood, some in middle age and some in old age. We should not only know the sequence of the gene, but also know the function of the gene, that is, the expression profile of the gene in different tissues at different times. This is commonly known as functional genome research.

In order to obtain gene expression profiles, new technologies have been developed at the level of nucleic acid and protein. This is the gene chip (or D NA chip) technology at the nucleic acid level and the large-scale protein isolation and sequence identification technology at the protein level, which is the so-called protein Group technology. Because of the high density of sample points on the chip, each chip can reach hundreds of thousands. Expression spectrum data mining and knowledge discovery have become the key to the success of this research. The development of biochip and protein Group technology is increasingly dependent on the theory, technology and database of bioinformatics. In the next step, functional genomics research will develop into complex systems, that is, exploring the interactions of various parts and levels in biological systems, thus entering the field of systems biology.

3.7 Structural Simulation and Drug Design in protein protein has a history of 20 or 30 years. With the rapid development of human genome research, this field is facing a new situation, that is, it is just around the corner to find the base sequence of 30-40 thousand human genes, so the amino acid sequence of its expression products will be gradually realized. At this time, it is an urgent task to predict the spatial structure of these protein, and then realize targeted drug design. This is also a large-scale calculation problem.

3.8 Research on the Application and Development of Bioinformatics The research results of bioinformatics not only have important theoretical value, but also can be directly applied to industrial and agricultural production and medical practice. Therefore, the analysis and application algorithms, software and databases related to bioinformatics have important economic value, which will eventually form commodities and provide economic and social benefits.

(1) Disease-related gene information and related algorithm and software development.

Many diseases are related to gene mutation or gene polymorphism. It is estimated that about 1000 proto-oncogenes and 100 tumor suppressor genes are related to cancer. There are more than 6,000 human diseases related to various human genetic changes. More diseases are the result of the interaction between environment (including pathogenic microorganisms) and human genes (gene products). With the deepening of the human genome project, when we know the positions of all human genes on chromosomes, their sequence characteristics (including S NPs), their expression rules and the characteristics of products (R NA and protein), people can effectively judge the molecular mechanisms of various diseases, and then formulate corresponding diagnosis and treatment methods. Therefore, two bioinformatics tasks are very important: one is to establish a database of human gene information related to diseases (including S NP database), and the other is to develop bioinformatics algorithms for effectively analyzing genotyping data, especially the calculation method of the association between S NP data and diseases and pathogenic factors.

(2) Establish genome database related to animal and plant breeding, and develop molecular marker-assisted breeding technology.

According to the evolutionary distance between different species and the homology of functional genes, we can easily find genes related to the economic benefits of various livestock and cash crops, and further understand the various ways and mechanisms of their development, growth and stress resistance. On this basis, the use of related genomic molecular markers can speed up the breeding and transform according to people's wishes.

(3) Developing drug design software and molecular biology technology based on biological information.

Human genome information provides new candidate molecules and new candidate drug target genes for drug development. At the same time, the design of expression vectors, P CR and hybridization primers commonly used in molecular biology and various kits (including D NA chips) must depend on the sequence information of nucleic acids. A great deal of information provided by genomic informatics provides a broad world for the development of this kind of technology.