What the model can say also depends on the selected corpus. If you already have the original chat data, you can use SQL to query some conversations through keywords, that is, choose a small library from a large library for training. From some papers, many algorithms are at the level of data preprocessing. For example, a mechanism-aware neural machine for dialogue response generation is introduced, which extracts small libraries from large libraries and then fuses them to train distinctive dialogues.
For English, you need to know NLTK, which provides functions such as corpus loading, corpus standardization, corpus classification, part-of-speech tagging and semantic extraction.
Another powerful tool library is CoreNLP, an open source tool from Stanford, which has the functions of entity tagging, semantic extraction and supporting multiple languages.
The following mainly introduces two contents:
Chinese word segmentation
At present, there are many Chinese word segmentation SDKs and algorithms. There are also many articles comparing the performance of different SDK. The sample code of Chinese word segmentation is as follows.
# code: utf8
''' ?
Splitter with Chinese?
'''
Imported jieba?
Import language id
Def paragraph _ Chinese _ sentence (sentence):
'''
Return to a segmented sentence.
'''
seg_list = jieba.cut(sentence,cut_all=False)
seg_sentence = u " "。 Join (seg_list)
Returns seg_sentence.strip (). Coding ("utf8")
Def process_sentence:
'''
Only deal with Chinese sentences.
'''
Iflangid.classify,],, form a question and answer. And [10, 1 1 2] form a question and answer.
Start training
Cp config.sample.ini config.ini # Modify the key?
python deepqa2/train.py?
Config.ini is a configuration file, which is modified according to config.sample.ini The training time depends on epoch, learning rate, maxlength and the number of dialogue pairs.
Deepqa2/train.py is about 100, and the data dictionary is loaded. session, saver and writer of tensorflow are initialized, and the neuron model is initialized. Iterate according to epoch and save the model to disk.
A session is a network diagram, which consists of placeholders, variables, cells, layers and outputs.
Saver saves the model and can also be used to restore the model. A model is a session that instantiates variables.
Writer is a collector for viewing loss fn or other data that developers are interested in. The author's results will be saved and then viewed by tensorboard.
model
Input, state, software maximum and output should be considered in the model construction.
Define the loss function and iterate with AdamOptimizer.
Finally, refer to the cycle part of training.
Every time you train, the model will be stored in? Under the save path, the name of the folder is generated according to the host name and timestamp of the machine.
provide services
TensorFlow provides a standard service module-TensorFlow Servicing. But after studying for a long time, I specially watched "Essentials of C++" and haven't finished it yet. The community generally complains that tensorflow's service is not easy to learn and use. After the training, use the following script to start the service. The serve part of DeepQA2 is still the python api that calls TensorFlow.
CD deep QA 2/save/deep learning . cobra . Vulcan . 20 170 127. 175256/deep QA 2/serve?
cp db.sample.sqlite3 db.sqlite3?
python manage . py runserver 0 . 0 . 0 . 0:8000?
test
POST/API/v 1/question HTTP/ 1. 1?
Host: 127.0.0. 1:8000?
Content type: application /json?
Authorization: Basic YWRtaW46cGFzc3dvcmQxMjM=?
Cache control: no cache
{"Message": "Glad to know"}
Response?
{
“RC”:0,
"Message": "Hello"
}
The core code of serve is in serve/API/chatbotmanager.py.
Use scripts
scripts/start_training.sh? Start training
scripts/start_tensorboard.sh? Start tensor plate
scripts/start_serving.sh? Start service
Model evaluation
At present, the code is highly maintainable, which is also the reason for reconstructing from DeepQA project, which makes data preprocessing, training and service clearer. There are new changes that can be added to deepqa2/models, and then changed in train.py and chatbotmanager.py.
What needs to be improved?
A. create a new model /rnn2.py and use dropout. Drop has been used in DeepQA at present.
B.tensorflow rc0. 12.x has provided the seq2seq network, which can be updated to tf version.
C. integration training. At present, the model has only one library. It should design a new model to support large and small libraries with different weights, just like a mechanism-aware neural machine? Introduction of dialogue response generation.
D. the code supports multi-machine and multi-GPU operation.
E. At present, the results of training are all QA right, and a question can have multiple answers.
F. At present, there is no method to test the accuracy. One way of thinking is to provide interference items in training, because there are only correct answers at present. If you provide wrong answers (and the more the better), you can use recall_at_k method to test.
I hope what I learned from the robot family will be useful to you.