The main function of this layer algorithm is to mine the basic resources necessary for Weibo recommendation, solve the general technical problems in recommendation, complete the necessary data analysis and provide guidance for recommendation business.
The commonly used algorithms and techniques in this part are as follows:
Word Segmentation Technology and Core Words Extraction
It is the basis of Weibo's content recommendation, and it is used to transform Weibo's content into structured vectors, including word segmentation, word information tagging, content core word/entity word extraction, semantic dependency analysis, etc.
Classification and anti-spam
Used for the analysis of content recommendation candidates in Weibo, including Weibo content classification and Weibo identification of marketing advertisements/pornography;
Content classification is realized by decision tree classification model, with *** 3-level classification system and 148 categories; Bayesian and maximum entropy mixed model is used to identify marketing advertisements/pornographic Weibo.
Cluster technology
It is mainly used for hot topic mining and provides related resources for content-related recommendation. Word Vector Topic, a clustering technology independently developed by Weibo, is designed according to the content characteristics and communication law of Weibo.
Communication model and user influence analysis
Carry out research on Weibo's communication mode and analysis of users' network influence (including depth influence, breadth influence and intra-domain influence).
Main recommendation algorithms
1. Graph-based recommendation algorithm
Weibo has the following characteristics: users contribute content and spread it through social channels, which leads to the explosive spread of information. It is called graph-based recommendation algorithm, rather than the general memory-based algorithm in the industry, mainly because:
Our recommendation algorithm design is based on social networks, and the core point is to start from social networks, integrate into the information dissemination model, and comprehensively utilize all kinds of data to provide users with the best recommendation results; For example, many times, we are only the key link of information dissemination. If we add the necessary recommendation supervision and change the information dissemination channels, the subsequent dissemination will naturally spread along the original network.
Feed stream recommendation (we call it trend) is our most important product, and the result must include user relationship.
From the macro point of view, our goal is to establish a higher value user relationship network, promote the rapid spread of high-quality information and improve the quality of feed streams. Among them, the important work is key node mining, content recommendation of key nodes and user recommendation.
Sort out this part of the algorithm, as shown in the following table:
Please click to enter a picture description.
The difficulty here is how to quantify and select the "edges" of the graph, calculate them according to the comprehensive scores of multiple "edges" and "nodes", and integrate them with the analysis results of network mining.
In the research and development of this part of the algorithm, the following data products have been produced:
Please click to enter a picture description.
2. Content-based recommendation algorithm
Content-based recommendation is the most commonly used and basic recommendation algorithm in Weibo, and its main technical link lies in the content structured analysis and association operation of candidate sets.
Content-based recommendation is the most widely used place. Take it as an example and say it briefly.
Please click to enter a picture description.
Many points of content analysis have been described before, and two points are emphasized here:
Content quality analysis mainly adopts the method of Weibo exposure income+content information amount/readability. Weibo's exposure income measures the quality of content with the help of user group behavior; The calculation of content information is relatively simple, that is, idf information iteration of Weibo keywords; As for the measurement of content readability, we have made a small classification model, taking news corpus with good readability and oral corpus with poor readability as training samples, and calculating the probability of good readability of new Weibo by extracting various word collocation information.
The effect of word expansion and content-based depends on the depth of content analysis. Weibo's content is brief, and few key information can be extracted. When doing related operations, due to sparse data, it is easy to balance the recommendation recall rate and accuracy. We introduce word2vec technology to optimize the effect of word expansion, and then cluster words on this basis, which realizes the synchronous improvement of recommendation recall and accuracy.
The technical points of correlation calculation are vector quantization and distance measurement. We usually use two methods: "tf*idf weighted quantization+cosine distance" or "topic probability +KLD distance".
3. Model-based recommendation algorithm
As the largest social media product in China, Weibo has a huge number of users and information resources. This poses two challenges to the proposal:
Source fusion and sorting
The variety of candidates means that we have more choices, so the generation of our recommendation results includes two levels: the primary selection of various recommendation algorithms and the selection of source fusion sorting. In order to get more objective and accurate ranking results, it is necessary to introduce machine learning model to learn the laws hidden behind the user group behavior.
Dynamic Content Classification and Semantic Relevance
The content production mode of UGC in Weibo, as well as the rapid dissemination and update of information, means that the previous methods of manually labeling samples and training static classification models are outdated. We need a good clustering model to aggregate the recent total information into categories, and then establish semantic association to complete recommendation.
Model-based algorithm is to solve the above problems. Here are our two most important machine learning tasks:
3. 1 CTR/RPM (relationship achievement rate per thousand recommendations) prediction model. The basic algorithm used is logistic regression. The following is the overall architecture diagram of our CTR prediction model:
Please click to enter a picture description.
This part of the work includes sample selection, data cleaning, feature extraction and selection, model training, online prediction and sorting. It is worth mentioning that data cleaning and noise elimination before model training are very important, and data quality is the upper limit of algorithm effect. We have suffered losses in this place before.
Logistic regression is a two-category probability model.
Please click to enter a picture description.
The goal of optimization is to maximize the "product value of the correct classification probability of samples"; We use the vowpal_wabbit machine learning platform developed by Yahoo to complete the optimization process of solving the model eigenvalue.
3.2 LFM (latent factor model): LDA, matrix decomposition (SVD++, SVD features).
LDA was a key project in the beginning of 20 14, and now it has achieved good output and been applied in online commodity recommendation. LDA itself is a very beautiful and rigorous mathematical model. Here is an example of LDA topic for reference only.
Please click to enter a picture description.
As for matrix decomposition, we tried it on 20 13, and the effect was not particularly satisfactory, so we didn't continue to invest.
The hidden semantic model is the single model with the highest recommendation accuracy, and its difficulty lies in that when the data scale is large, the calculation efficiency will become a bottleneck; We have carried out some work in this place, and some students will introduce this piece in the future.
Hybrid technology
Two heads are better than one, and each method has its limitations. It is an extremely effective way to learn from each other's strengths and give full play to their respective values. Weibo recommendation algorithm mainly adopts the following mixed technologies:
Time series mixing:
That is, different recommendation algorithms are adopted in different time periods of the recommendation process; Taking the recommendation related to the text page as an example, in the early stage of the exposure of the text page, the recommendation result is generated by the method based on content +ctr prediction, and after generating enough credible user click behavior, the recommendation result is obtained by the user-based collaborative filtering method, as shown in the following figure:
Please click to enter a picture description.
In this way, the problem of cold start is well solved by using content-based, the role of user-based CF is fully exerted, and the effect of 1+ 1 >:2 is realized.
Hierarchical model mixing:
In many cases, a model can't get the desired effect well, but hierarchical combination can often get better results. Hierarchical model mixing refers to "taking the output of the upper model as the eigenvalue of the lower model to comprehensively train the model and complete the recommendation task". For example, when we do ctr prediction sorting on the right side of Weibo homepage, we use hierarchical logistic regression model to solve the problems of natural missing features, sample size difference and effect deviation caused by exposure position among different products.