Current location - Training Enrollment Network - Books and materials - How to break through the frequency limitation of douban crawler
How to break through the frequency limitation of douban crawler
Make a cookie UA disguise. Douban with biscuits will keep a certain rhythm, not 403. It will jump to the verification code, simply binarize the verification code and throw it to the open OCR API, and then correct the English words (Douban verification code is basically English words), and the automatic recognition rate is basically above 30%. Find the maximum concurrency limit of this rhythm, and then grasp it slowly. If not, you can open multiple ip agents to catch it. A few months ago, catching watercress was basically written like this. First, roughly calculate the order of magnitude of the pages to be crawled. Sometimes, a page is 1 second. If you grab it slowly, you can meet the demand for a few days. If you can't, you can't go to the agency.