Current location - Training Enrollment Network - Books and materials - What are the small technical points of query analysis of search engines?
What are the small technical points of query analysis of search engines?
Hello:

Data analysis of query

Query means that users enter query conditions in search engines. In a general search engine, it generally refers to the input keywords. In various industries or vertical search engines, you can also enter categories, such as "movies" and "TV series" in Youku's website. In e-commerce websites, various product brands, models, styles and prices are also common query conditions.

Word segmentation is an indispensable tool for analyzing every $ term content in a query. Word segmentation algorithms range from the simplest maximum forward and maximum reverse word segmentation algorithms to complex hidden Markov and CRF models. CRF model is a machine learning method for sequence labeling. The key of word segmentation algorithm is how to get enough corpus with accurate labels, and enough training corpus is the basic condition for the success of the model.

Sort PV from high to low after query. The abscissa is the query number, and the ordinate is the PV of the query. It is obvious from the figure below that the PV distribution of query is a long tail distribution.

The query of each search engine

Each has its own characteristics. It is very necessary to design your own algorithm and corresponding products according to the characteristics of query. For example, Baidu has a lot of inquiries about "how to get from A to B" and "how to ××". I believe that Baidu has studied these queries before pushing Baidu's products such as "Post Bar", "Know" and "Encyclopedia". There must be a big difference between general search engines and e-commerce websites. For example, excellent Dangdang, there must be a large number of titles. In e-commerce websites, there are a large number of category+attribute query methods. How to combine the input conditions, accurately analyze the user's intention and ensure the recall and accuracy of search engine results is a challenge.

Rule 20-80: Query and Cache

We found that 20% of popular queries accounted for 80% of PV traffic. If the analysis and sorting problems of these 20% queries are solved, most traffic problems will be solved.

For 20% of the queries, we can optimize the index structure of search engines, and try our best to directly return the information users need. In the query analysis module, we can store the results of word segmentation, part-of-speech tagging and query classification. In short, efficient use of memory results in a great improvement in memory performance.

Query Classification and "Box Computing"

Query classification is a problem that general search engines must solve at present. Entering "×× city weather" on Baidu or Google will display weather status pictures, temperatures and so on. Input "PetroChina" to directly display the share price of PetroChina; Enter "Flight" directly from the selection of flight start and end points. This is also Baidu's so-called "box computing", that is, the analysis is directly completed in the search box, and the specific application is directly reached.

How to classify?

Suppose that the search engine has classified the web pages, count the classification of the clicked pages under each query, and arrange them according to the probability of page classification from high to low, that is, the classification of queries. You can also know the classification of this query. But this can only be used when there are enough query clicks.

Another way is to use Bayesian method to infer which categories each query may belong to through page classification.

Query navigation

Query classification is actually a basic condition of navigation. Navigation really begins only when you have an accurate classification of the query and an accurate understanding of the part of speech of each $ term in the query.

On e-commerce websites, such as Amazon and JD.COM. Precise navigation is very necessary.

And accurate navigation is the first step. According to the user's input, it is a further requirement for navigation to embody relevant popular recommendations or personalized recommendations in navigation.

On Taobao search products, when users enter keywords, they will automatically prompt the corresponding categories and attributes, and display the popular category attributes in front, while folding the relatively unpopular categories and attributes. Maximize the use of the limited display space of the webpage.

Query suggestion g

Query and personalization

When it comes to personalization, it inevitably involves the collection of user data. According to the user's behavior or settings, analyze the user's age, gender, preferences, etc. Search for "coffee shop" in the same way, and your search results in Beijing and Shanghai may be very different.

And these analysis data come from the behavior log of each user in the search engine. The search engine will analyze each user's search and click behavior. It exists in the distributed key-value storage database when it is stored.

User behavior is not only useful for individual users themselves. A large number of user behavior logs are widely used in data mining of recommendation systems. For example, books purchased by users on Dangdang Excellence come from the purchase and browsing records of a large number of users. Recommendation system has developed from common association rule analysis to various complex graph relationship analysis algorithms.

For more information, please refer to:

/subview/ 10083/ 1467006 1 . htm