Seven tips for getting started quickly in Python.
1, basic webpage crawling
Acquisition method
Posting method
2. use proxy IP
In the process of developing crawler, IP is often blocked, so it is needed.
Proxy IP;
There is a proxy handler class in the urllib 2 package, through which you can set the proxy.
Visit the web page, the following code snippet:
3. Cookie processing
Some websites use Cookies to identify users and track conversations.
Python provides data (usually encrypted) stored in the user's local terminal.
Cookie lib module is used to process Cookie. The main work of cookie lib module is
It is used to provide an object that can store cookie to facilitate cooperation with the urllib 2 module.
Used to access Internet resources.
Code snippet:
The key is Cookie Jar (), which is used to manage HTTP cookie values, storage.
Cookie generated by HTTP requests are added to outgoing HTTP requests.
The object of. The entire cookie is stored in memory and enters the Cookie Jar instance.
Cookie will also be lost after garbage collection, and all processes do not need to operate separately.
Add cookie manually:
Step 4 pretend to be a browser
Some websites don't like the visit of reptiles, so they all refuse the request of reptiles. So use
HTTP error 403 often occurs when urllib 2 directly visits the website:
Prohibited circumstances.
Pay special attention to some headers, which the server will target.
Do an examination:
1. Some servers or agents will check this value to determine.
Whether the browser initiated the request.
2. When using the REST interface, the server will check.
A value that determines how to parse the content in the HTTP body.
This can be achieved by modifying the header in the http packet. The code snippet is as follows.
5, verification code processing
For some simple verification codes, simple identification can be carried out. We only did it once.
Some simple verification codes, but some anti-human verification codes, such as 12306.
, you can manually code through the coding platform, of course, there is a charge.
6.gzip compression
Have you ever encountered some web pages? No matter how transcoding, it is garbled. Haha, that
That means you don't know that many web services have the ability to send compressed data.
Thereby reducing a large amount of data transmitted on the network line by more than 60%. This applies especially to
XML web service, because the compression rate of XML data can be very high.
But the general server will not send compressed data for you unless you tell the server that you can do so.
To process compressed data.
Therefore, you need to modify the code like this:
This is the key: create a request object, add an accept-
The header information tells the server that you can accept gzip compressed data.
Then the data decompression:
7. Multi-thread concurrent acquisition
If single thread is too slow, multi-threading is needed. The following is a simple thread pool template.
This program simply prints 1- 10, but it can be seen that it is concurrent.
Python's multithreading is a chicken rib, but it is very difficult for reptiles with frequent networks.
, but also improve efficiency to a certain extent.