1. What is NLP?
NLP, known as natural language processing in Chinese, is a subject that allows computers to understand, analyze and generate natural languages. The general research process is: developing a model that can express language ability-proposing various methods to continuously improve the ability of language model-designing various application systems according to language model-continuously improving language model.
NLP understands natural language in two ways:
1. Understanding natural language based on rules means designing a program by formulating a series of rules, and then solving natural language problems through this program. The input is a rule and the output is a program;
2. Natural language understanding based on statistical machine learning, that is, training a model with a large amount of data through machine learning algorithm, and then solving natural language problems through this model. The input is data and expected results, and the output is model.
Next, briefly introduce the common tasks or applications of NLP.
2. What can. NLP do?
1. participle
Chinese can be divided into words, phrases, sentences, paragraphs and documents. If you want to express a meaning, it is often impossible to express it through one word. At least one word can better express a meaning. Therefore, generally speaking, words are the basic units to express "phrases, sentences, paragraphs and documents". As for computers, because Chinese doesn't separate words with spaces like English, computers can't distinguish which words are in a text, so they need to divide words. At present, there are two commonly used word segmentation methods:
(1) Rule-based: Heuristic, Keyword Table
(2) machine learning/statistical methods: HMM (hidden Markov model) and CRF (conditional random field)
(Note: The principle and implementation process of the method are not introduced in detail here. If you are interested, you can check Baidu. )
Now the word segmentation technology is very mature, and the accuracy of word segmentation has reached the level that can be used. There are also many third-party libraries for us to use, such as jieba, so we usually use the method of "jieba+ custom dictionary" for word segmentation in practical application.
2. Word coding
Now the text of "I like you" is divided into three words: I, I like you, you. At this time, the computer can't understand these three words as input by the computer, so we convert these three words into a way that the computer can understand, that is, word coding. Nowadays, words are generally expressed as word vectors as the input and representation space of machine learning. There are currently two kinds of representation spaces:
(1) discrete representation:
A. One-heat representation
Suppose our corpus is:
I like you. Do you have feelings for me?
Dictionary {I: 1, Like: 2, You: 3, Right: 4, You: 5, Feeling: 6, Am”:7}. A * * * has seven dimensions.
So what One-hot means is:
"Me"? :[ 1, 0, 0, 0, 0, 0, 0]
"Like": [0, 1, 0, 0, 0]
"Really": [0,0,0,0,0,0, 1]
That is, a word is expressed in one dimension.
B. word package: that is, the vectors of all words are added directly as a vector of a document.
So "I like you" means: "[1, 1, 1, 0,0]".
C. Binary grammar and N-gram grammar (language model): Considering the order of words, a word vector is represented by word combinations.
The idea behind these three ways is that different words represent different dimensions, that is, a "unit" (words or combinations of words, etc. ) is a dimension.
(2) Distributed representation: word2vec, which represents a * * * current matrix vector. The idea behind it is that "a word can be represented by nearby words".
Discrete or distributed representation spaces have their own advantages and disadvantages, and interested readers can look up the information themselves, so I won't go into details here. Here's a problem. The larger the corpus, the more words it contains, and the larger the dimension of the word vector, so that the amount of storage and calculation in space will increase exponentially. Therefore, engineers usually reduce the dimension when dealing with word vectors, which means that some information will be lost, thus affecting the final effect. Therefore, as a product manager, when following up the project development, we also need to understand the rationality of engineers' dimension reduction.
3. Automatic summarization
Automatic summarization refers to automatically summarizing key texts or knowledge from the original text. Why do you need automatic summarization? There are two main reasons: (1) information overload, and we need to extract the most useful and valuable texts from a large number of texts; (2) The cost of manual summary is very high. At present, there are two ways to solve automatic summarization: the first is summarization, which finds some key sentences from the original text to form a summary; The other way is abstract. The computer first understands the content of the original text and then expresses it with its own meaning. At present, automatic summarization technology is the most widely used in the field of news. In the era of information overload, this technology is used to help users know the most valuable news in the shortest time. In addition, how to extract structured knowledge from unstructured data will also be a major direction of question answering robot.
4. Entity identification
Entity recognition refers to the recognition of specific types of entities in a text, such as names, places, values, proper nouns, etc. It is widely used in information retrieval, automatic question answering, knowledge mapping and other fields. The purpose of entity recognition is to tell the computer that the word belongs to a certain kind of entity, which is helpful to identify the user's intention. For example, Baidu's knowledge map:
The entity of "How Old is Stephen Chow" is "Stephen Chow" (star entity), and the relationship is "age". The search system can know that the user is asking the age of a star, and then combine the data "Stephen Chow? Date of birth? 1June 22, 962 "and the current date, and display the results directly to the user instead of displaying the links of the candidate answers.
In addition, NLP's common tasks include topic recognition, machine translation, text classification, text generation, sentiment analysis, keyword extraction, text similarity and so on. I will give you a brief introduction later.
Three. The difficulty is. Current NLP
1. The language is not standardized and has high flexibility.
Natural language is not standardized. Although some basic laws can be found, natural language is too flexible and the same meaning can be expressed in many ways. It is difficult to understand the natural language based on rules, and it is also difficult to learn the inherent characteristics of data through machine learning.
2. typographical errors
When dealing with text, we will find a lot of typos. How to make the computer understand the true meaning of these typos is also a major difficulty for NLP.
3. New words
We are in an era of rapid development of the Internet, and a large number of new words are produced on the Internet every day. How to find these new words quickly and make them understood by computers is also the difficulty of NLP.
4. There are still shortcomings in using word vectors to represent words.
Above, we said that the computer can understand words through word vectors, but the space represented by word vectors is discrete, not continuous, such as some positive words: very good, very good, great, great and so on. In the word vector space from "good" to "very good", you can't find some words, from "good" to "very good", so of course there are some algorithms that calculate word vectors and make continuous approximation, but this is definitely accompanied by the loss of information. In a word, word vector is not the best way to express words, and better mathematical language is needed to express words. Of course, it is also possible that our natural language itself is discontinuous, or that human beings cannot create a "continuous" natural language.
Summary: Through the above, we have a general understanding of what NLP is, what it can do and the existing problems. As an artificial intelligence product manager, understanding NLP technology can improve our own technical understanding ability, which is very helpful for understanding industry needs and promoting project development. In fact, this can give us a connection ability, connect requirements with engineers, and connect problems with solutions. Although there are many shortcomings in artificial intelligence technologies such as NLP, we need to adjust our mentality. The application of artificial intelligence has just begun, and it is bound to be imperfect. Don't be a critic, be a promoter of the era of artificial intelligence.
nt-sizf@? 2W