Unlike rule-based machine translation system, which consists of dictionary and grammar rule base, corpus-based machine translation system takes the application of corpus as the core and consists of divided and labeled corpora. Corpus-based methods can be divided into statistical methods and case-based methods. Statistical-based machine translation The statistical-based machine translation method regards machine translation as a process of information transmission and uses channel model to explain machine translation. According to this idea, the translation from the source language sentence to the target language sentence is a probability problem. Any target language sentence may be a translation of any source language sentence, but the probability is different. The task of machine translation is to find the most probable sentence. The specific method is to regard translation as a decoding process of transforming the original text into a translation through a mode. Therefore, statistical machine translation can be divided into the following problems: model problem, training problem and decoding problem. The so-called model problem is to establish a probability model for machine translation, that is, to define the calculation method of translation probability from source language sentences to target language sentences. The training problem is to get all the parameters of this model by using corpus. The so-called decoding problem is to find the most possible translation for any input sentence in the source language on the basis of known models and parameters.
In fact, the idea of solving machine translation problems by statistical methods is not a brand-new idea in the 1990s. W. Weaver put forward this method in the machine translation memorandum of 1949, but it was soon abandoned because of the criticism of Ji by N.Chomsky and others. The main reason for criticism is that language is infinite and empirical statistical description can not meet the actual requirements of language.
In addition, limited by the speed of computers at that time, the value of statistics is out of the question. Computers have greatly improved in speed and capacity, and now small workstations or personal computers can do what used to be done by large computers. In addition, the successful application of statistical methods in speech recognition, character recognition, lexicography and other fields also shows that this method is very effective in the field of language automatic processing.
The mathematical model of statistical machine translation method was put forward by the researchers of International Business Machines Corporation (IBM). In the famous article Mathematical Theory of Machine Translation, a statistical model from five words to words is proposed, which is called IBM model 1 to IBM model 5. These five models are all derived from the source-channel model, and the parameters are estimated by maximum likelihood method. Due to the limitation of computing conditions at that time (1993), training based on large-scale data could not be realized. Subsequently, the statistical model based on hidden Markov model proposed by Stephan Vogel was also taken seriously to replace IBM Model 2. In this study, the statistical model only considers the linear relationship between words, but does not consider the structure of sentences. This may not be very effective when the word order of the two languages is very different. If syntactic structure or semantic structure is taken into account when considering language model and translation model, better results should be obtained.
Six years after the publication of this article, a group of researchers realized the GIZA software package in the summer camp of machine translation at Johns Hopkins University. Franz Joseph Och then optimized the software to speed up the training. Especially the training of IBM Model 3 to 5. At the same time, he proposed a more complex model 6. The software package released by Och was named GIZA++, and until now, GIZA++ is still the cornerstone of most statistical machine translation systems. There are several parallel versions of GIZA++ for large-scale corpus training.
However, the performance of word-based statistical machine translation is limited because the modeling unit is too small. Therefore, many researchers began to turn to phrase-based translation methods. Franz-Josef Och's differentiated training method based on maximum entropy model greatly improved the performance of statistical machine translation, and in the following years, the performance of this method was far ahead of other methods. One year later, Och revised the optimization criteria of the maximum entropy method and directly optimized the objective evaluation criteria, thus giving birth to the minimum error rate training, which is widely used today.
Another important invention that promotes the further development of statistical machine translation is the appearance of automatic objective evaluation method, which provides an automatic evaluation method for translation results, thus avoiding tedious and expensive manual evaluation. The most important evaluation is the BLEU evaluation index. Most researchers still use BLEU as the primary criterion to evaluate their research results.
Moses is a well-maintained open source machine translation software developed by researchers from Edinburgh University. Its release simplifies the complicated processing in the past.
Google's online translation is well known, and the technology behind it is a machine translation method based on statistics. The basic operation principle is to search a large number of bilingual web pages as corpus, and then the computer automatically selects the most common words and their corresponding words, and finally gives the translation results. There is no denying that the technology used by Google is advanced, but it still often makes all kinds of "translation jokes". The reason is that statistics-based methods need large-scale bilingual corpora, and the accuracy of translation model and language model parameters directly depends on the number of corpora, while the quality of translation mainly depends on the quality of probability model and the coverage ability of corpora. Although statistics-based methods do not need to rely on a lot of knowledge, they rely directly on statistical results for ambiguity resolution and translation selection, which avoids many difficulties in language understanding, but the selection and processing of corpus is huge. Therefore, the machine translation system in general field pays little attention to statistical methods. Case-based machine translation is the same as statistical method, and case-based machine translation is also a corpus-based method. Its basic idea was put forward by a famous Japanese machine translation expert Zhen, who studied the basic model of foreign language beginners and found that beginners always remember the most basic English sentences and corresponding Japanese sentences first, and then do replacement exercises. Referring to this learning process, he put forward the idea of example-based machine translation, that is, without in-depth analysis, only through existing experience and knowledge, through analogy principle. The translation process is to first correctly decompose the source language into sentences, then into phrase fragments, then translate these phrase fragments into target language phrases through analogy, and finally merge these phrases into long sentences. For case-based method system, the main source of knowledge is bilingual case base, without dictionary and grammar rule base. The core problem is to get the bilingual case base through the maximum unified measurement.
Example-based machine translation has a very significant effect on the translation of the same or similar texts, and its role is becoming more and more significant with the increase of the size of the sample sentence library. For the existing texts in the example library, high-quality translation results can be obtained directly. For texts that are very similar to the examples in the case base, we can construct approximate translation results through analogical reasoning and a few modifications to the translation results.
This method was praised by many people when it was first introduced. But after a while, the problem appeared. Because this method needs a large corpus as support, the actual demand for language is very huge. However, limited by the size of the corpus, it is difficult for example-based machine translation to achieve a high matching rate, and the translation effect can only meet the requirements if it is limited to narrow or professional fields. Up to now, few machine translation systems adopt purely case-based methods, and generally take case-based machine translation methods as one of various translation engines to improve translation accuracy.