1. Learn the basic knowledge of Python and realize the basic crawling process.
The process of obtaining data is generally realized according to three processes: sending a request, obtaining page feedback, parsing and storing data. This process is actually a simulation of a manual browsing process.
There are many packages related to reptiles in Python: urllib, requests, bs4, scrapy, pyspider, etc. We can connect to the website, return the webpage according to the request, and parse the webpage with Xpath, which is convenient for extracting data.
2. Understand the storage of unstructured data
The data structure crawled by crawler is complex, and the traditional structured database is not necessarily suitable for us to use. Recommended MongoDB in the early stage.
3. Master some common anti-reptile skills.
Using proxy IP pool, packet grabbing and verification code OCR processing can solve the anti-crawler strategy of most websites.
4. Understand distributed storage
This thing of distribution sounds terrible, but in fact, it is to use the principle of multithreading to make multiple reptiles work at the same time. You need to master three tools: Scrapy+MongoDB+Redis.
-? Wind? Small? Hey? y/m
Stand on tiptoe and touch happiness.
¡ô The corne