Current location - Training Enrollment Network - Books and materials - How to start using Python crawler
How to start using Python crawler
It's a good motivation at first, but it may be slower. If you have a project in your hand or mind, you will be driven by the goal in practice, instead of learning slowly like a learning module.

In addition, if every knowledge point in the knowledge system is a point in the graph and the dependency is an edge, then the graph must not be directed acyclic graph. Because learning the experience of A can help you learn B. Therefore, you don't need to learn how to "get started" because there is no such "getting started" point! What you need to learn is how to make something bigger. In this process, you will quickly learn what you need to learn. Of course, you can argue that you need to know python first, otherwise how can you learn python as a reptile? But in fact, you can learn Python in the process of making this reptile: d Look at the "technique" mentioned in many previous answers-how to climb with what software, and then I will talk about "Tao" and "technique"-how the reptile works and how to realize it in Python.

Let's make a long story short and sum it up. You need to learn:

The basic working principle of reptiles

The basic combination of/nvie/rqrqrq and Scrapy: darkrho/scrapy-redis github post-processing, webpage extraction (Granger/Python-Goose github), storage (Mongodb) The following is a long story. Tell me about the experience of a cluster climbing down the whole douban.

1) First of all, you have to understand how reptiles work.

Imagine you are a spider, and now you are put on the Internet. Then, you need to read all the web pages. What shall we do? No problem, just start somewhere, for example, the home page of People's Daily, which is called the initial page, and it is represented by $

On the home page of People's Daily, you can see various links that the page points to. So you climbed happily from the "domestic news" page. Great, so you have climbed two pages (home page and domestic news)! Regardless of how the captured page is handled for the time being, imagine that you copy this page into an html and put it on you.

Suddenly you find a link back to the "home page" on the domestic news page. As a clever spider, you must know that you don't have to climb back, because you have already seen it. Therefore, you need to use your brain to save the address of the page you have seen. In this way, every time you see a new link that may need you to climb, you should first check in your mind whether you have been to this page address. If there is, don't go.

Well, theoretically, if you can reach all the pages from the initial page, you can prove that you can climb all the pages.

So how to implement it in python? It's simple:

computer programming language

Import queue

initial_page = ""url_queue = Queue。 Queue seen () = Set ()

seen.insert(initial_page)

url_queue.put(initial_page)

While (true): # Go on until the seas run dry and the rocks crumble.

if URL _ queue . size()& gt; 0:

Current_url = url_queue.get() # Take out the first URL stored in the queue (current _ URL) # Store the web page represented by this URL for next_url in extract _ URLs (current _ URL): # If the next _ URL does not appear, extract the URL:

seen.put(next_url)

url_queue.put(next_url)

Otherwise:

break

Import queue

initial_page = ""url_queue = Queue。 Queue seen () = Set ()

seen.insert(initial_page)

url_queue.put(initial_page)

While (true): # Go on until the seas run dry and the rocks crumble.

if URL _ queue . size()& gt; 0:

Current_url = url_queue.get() # Take out the first URL stored in the queue (current _ URL) # Store the web page represented by this URL for next_url in extract _ URLs (current _ URL): # If the next _ URL does not appear, extract the URL:

seen.put(next_url)

url_queue.put(next_url)

Otherwise:

break

It's already pseudocode.

The spines of all reptiles are here. Let's analyze why reptiles are actually a very complicated thing-search engine companies usually have a whole team to maintain and develop them.

2) Efficiency

If you directly process the above code and run it directly, it will take you a whole year to climb down the whole douban content. Not to mention that a search engine like Google needs to crawl the content of the whole network.

What's the problem? There are too many pages to climb, and the code on them is too slow. Assuming that there are n websites in the whole network, the complexity of judging duplication is N*log(N), because all web pages need to be traversed once, and it takes the complexity of log(N) to judge set repeatedly. Ok, I know python's set implementation is hash-but it's still too slow, at least the memory usage is not efficient.

What is the practice of sentencing? Bloom filter. Simply put, it is also a hash method, but its characteristic is that it can use fixed memory (not increasing with the number of URLs) to judge whether the URL is already in the collection with O( 1) efficiency. Unfortunately, there is no such thing as a free lunch. Its only problem is that if this url is not in the collection, BF can 100% determine that this url has not been seen. But if this URL is in the collection, it will tell you that this URL should have already appeared, but I am 2% uncertain. Please note that when you allocate enough memory, the uncertainty here will become very small. A simple tutorial: Bloom Filters by Example noticed this feature. If the URL has been seen, it may be repeated in a small probability (it doesn't matter, it won't die) But if you haven't seen it, you will definitely be seen (this is very important, otherwise we will miss some web pages! )。 [Important: There is something wrong with this paragraph, please skip it temporarily]

Ok, now we are close to the fastest way to deal with weight judgment. Another bottleneck-you only have one machine. No matter how big your bandwidth is, as long as the speed of downloading web pages from your machine is the bottleneck, then you should speed up this speed. If one machine is not enough, use multiple machines! Of course, let's assume that each machine is at its maximum efficiency-using multithreading (multi-process in python).

3) Cluster grabbing

When climbing watercress, I always use 100 machines to run around the clock for a month. Imagine if only one machine is used to run 100 months ... So, assuming that you have 100 machines available now, how can you realize a distributed crawling algorithm in python?

We call 99 of 100 machines with low computing power as slaves and another machine with high computing power as master, so review the url_queue in the code above. If this queue can be placed on this host, all slaves can communicate with the host through the network. Every time the slave finishes downloading a webpage, it will request a new webpage from the host to grab it. And every time slave captures a new webpage, it sends all the links on this webpage to the queue of master. Similarly, bloom filter is also put on the master, but now the master only sends URLs that have not been visited to slave. Bloom Filter is placed in the memory of the master, and the accessed url is placed in Redis running on the master, thus ensuring that all operations are O( 1). (At least the average share is O( 1), and the access efficiency of Redis is as shown in Linsert–Redis). Consider how to implement it in python:

If scrapy is installed on each slave, then each machine becomes a slave with crawling ability, and Redis and rq are installed on the host as distributed queues.

Then write the code as:

computer programming language

#slave.py

Current_url = Request from host ()

to_send = []

For the next url in the extracted url (current URL):

to_send.append(next_url)

store(current _ URL);

Send to host

#master.py

Distributed Queue = Distributed Queue ()

bf = BloomFilter()

initial_pages = "www。 People's Daily. com "

while(True):

if request == 'GET ':

if distributed_queue.size()>0:

Send (distributed_queue.get ())

Otherwise:

break

elif request == 'POST ':

bf.put(request.url)

#slave.py

Current_url = Request from host ()

to_send = []

For the next url in the extracted url (current URL):

to_send.append(next_url)

store(current _ URL);

Send to host

#master.py

Distributed Queue = Distributed Queue ()

bf = BloomFilter()

initial_pages = "www。 People's Daily. com "

while(True):

if request == 'GET ':

if distributed_queue.size()>0:

Send (distributed_queue.get ())

Otherwise:

break

elif request == 'POST ':

bf.put(request.url)

Well, as you can imagine, someone has written what you need: darkrho/scrapy-redis github 4) Foreground and post-processing Although a lot of "simplicity" is used above, it is not an easy task to realize a commercial crawler. The above code is not a big problem to climb an entire website.

But if you add these follow-up treatments, such as

Effective storage (how should the database be arranged)

Effectively judge duplication (I don't want to climb the People's Daily and copy the Great People's Daily here) and effectively extract information (for example, how to extract all the addresses on the webpage "China Road, Fenjin Road, Chaoyang District"). Search engines usually don't need to store all the information, such as why I save pictures ... update them in time (predict the update frequency of this page). As you can imagine, every point here can be used by many researchers. Even so, "the road is long and Xiu Yuan is awkward, and I will go up and down." .