杰瑞科技汇

Python urllib2如何处理HTTP请求?

Of course! Let's dive deep into urllib2 for HTTP requests in Python.

Python urllib2如何处理HTTP请求?-图1
(图片来源网络,侵删)

Important: urllib2 vs. urllib (Python 2) vs. urllib (Python 3)

This is a common point of confusion. You mentioned urllib2, which is the standard library name in Python 2.

  • Python 2: The library was split into two modules: urllib2 and urllib.

    • urllib2: For opening URLs (HTTP, HTTPS, FTP). This is what you're asking about.
    • urllib: For URL encoding/decoding (urllib.quote, urllib.unquote).
  • Python 3: The modules were merged and reorganized into urllib.request and urllib.parse.

    • urllib.request: The direct successor to urllib2. It handles opening URLs.
    • urllib.parse: The successor to the old urllib. It handles URL parsing and encoding.

Recommendation: For any new project, you should use the requests library. It's far more intuitive, powerful, and easier to use than the built-in urllib. However, understanding urllib2 is essential for reading legacy Python 2 code.

Python urllib2如何处理HTTP请求?-图2
(图片来源网络,侵删)

urllib2 in Python 2: Core Concepts

The main entry point for urllib2 is the urlopen() function. It can handle both simple requests and more complex ones that require custom headers, authentication, or cookies.

Making a Simple GET Request

This is the most basic use case: fetching the content of a webpage.

import urllib2
# The URL we want to fetch
url = "http://httpbin.org/get"
try:
    # urlopen opens the URL and returns a file-like object
    response = urllib2.urlopen(url)
    # We can read the content of the response
    html = response.read()
    # The response object also has useful headers
    print "Response Code:", response.getcode()
    print "Headers:"
    print response.info()
    print "\n--- Content ---"
    print html
except urllib2.URLError as e:
    print "Failed to open URL:", e.reason

Explanation:

  • urllib2.urlopen(url): Opens the URL and returns a addinfourl object (a file-like object).
  • response.read(): Reads the entire content of the response as a string.
  • response.getcode(): Returns the HTTP status code (e.g., 200 for OK, 404 for Not Found).
  • response.info(): Returns an email.message.Message object containing the response headers.

Adding Custom Headers (e.g., User-Agent)

Many websites block default Python user-agents. To avoid this, you can add a User-Agent header. For this, you need to create a Request object.

Python urllib2如何处理HTTP请求?-图3
(图片来源网络,侵删)
import urllib2
url = "http://httpbin.org/user-agent"
# Create a dictionary of headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept': 'application/json'
}
# Create a Request object with the URL and headers
request = urllib2.Request(url, headers=headers)
try:
    # Pass the Request object to urlopen
    response = urllib2.urlopen(request)
    response_data = response.read()
    print response_data
except urllib2.URLError as e:
    print "Failed to open URL:", e.reason

Explanation:

  • urllib2.Request(url, headers=...): Instead of passing a string directly, we create a Request object. This allows us to specify headers, data (for POST), and other request-level parameters.
  • urllib2.urlopen(request): We pass the Request object to urlopen.

Making a POST Request

To send data to a server, you use a POST request. You pass the data as the second argument to the Request object.

import urllib2
import urllib # Note: We need the old urllib for urlencode
url = "http://httpbin.org/post"
# The data to send. This must be in a specific format.
# We use urllib.urlencode to convert a dictionary to a query string.
post_data = urllib.urlencode({
    'username': 'test_user',
    'password': 'secure_password123'
})
# The data must be passed as bytes in Python 2
post_data_bytes = post_data.encode('utf-8')
# Create a Request object, passing the URL and the data
request = urllib2.Request(url, data=post_data_bytes)
try:
    response = urllib2.urlopen(request)
    response_data = response.read()
    print "Response from POST request:"
    print response_data
except urllib2.URLError as e:
    print "Failed to open URL:", e.reason

Explanation:

  • urllib.urlencode(): This function (from the old urllib module) is crucial. It takes a dictionary and converts it into a format suitable for an HTTP request body (key1=value1&key2=value2).
  • Request(url, data=post_data_bytes): By providing the data argument, urllib2 automatically changes the request method from GET to POST.

Handling HTTP Errors (e.g., 404 Not Found)

If a URL returns an error status code (4xx or 5xx), urlopen raises an HTTPError.

import urllib2
url = "http://httpbin.org/status/404"
try:
    response = urllib2.urlopen(url)
    print "Success! Code:", response.getcode()
except urllib2.HTTPError as e:
    # This block catches HTTP errors like 404, 500, etc.
    print "HTTP Error occurred!"
    print "Error Code:", e.code
    print "Error Reason:", e.reason
    print "Error Headers:", e.headers
except urllib2.URLError as e:
    # This block catches other URL-related errors (e.g., no network)
    print "URL Error occurred:", e.reason

Handling Cookies

urllib2 has a built-in HTTPCookieProcessor to handle cookies automatically. You need to build an "opener" to use it.

import urllib2
import cookielib # The old cookie library
# Create a cookie jar to store cookies
cookie_jar = cookielib.CookieJar()
# Create an opener that will handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
# A URL that sets a cookie
url_set_cookie = "http://httpbin.org/cookies/set?test_cookie=12345"
# A URL that expects the cookie to be sent back
url_check_cookie = "http://httpbin.org/cookies"
# Install the opener. Now all calls to urlopen will use it.
urllib2.install_opener(opener)
print "--- Visiting URL to set cookie ---"
response = urllib2.urlopen(url_set_cookie)
print response.read()
print "\n--- Visiting URL to check cookie ---"
response = urllib2.urlopen(url_check_cookie)
print response.read()
# You can inspect the cookies in the jar
print "\n--- Cookies in the jar ---"
for cookie in cookie_jar:
    print cookie

The Modern Alternative: The requests Library

As promised, here's how you'd do the same tasks with the requests library in Python 3 (or 2 with pip install requests). The syntax is much cleaner.

Installation

pip install requests

requests Examples

import requests
# 1. Simple GET Request
response = requests.get('http://httpbin.org/get')
print("Status Code:", response.status_code)
print("JSON Response:", response.json()) # .json() parses the JSON response
# 2. Adding Headers
headers = {'User-Agent': 'MyCoolAgent/1.0'}
response = requests.get('http://httpbin.org/user-agent', headers=headers)
print(response.json())
# 3. Making a POST Request
payload = {'username': 'test_user', 'password': 'secure_password123'}
response = requests.post('http://httpbin.org/post', data=payload)
print("POST Response JSON:", response.json())
# 4. Handling Errors (Requests does this for you)
try:
    response = requests.get('http://httpbin.org/status/404', timeout=5)
    response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)
except requests.exceptions.HTTPError as err:
    print("HTTP Error:", err)
except requests.exceptions.RequestException as err:
    print("Other Error:", err)
# 5. Handling Cookies (Automatically!)
# Session objects persist cookies across requests
with requests.Session() as session:
    # Get a cookie
    session.get('http://httpbin.org/cookies/set/test_cookie/12345')
    # The cookie is now attached to the session
    response = session.get('http://httpbin.org/cookies')
    print("Cookies from session:", response.json())

Summary: urllib2 vs. requests

Feature urllib2 (Python 2) requests (Modern Python)
Ease of Use Verbose, requires manual handling of many cases. Simple, intuitive, "Pythonic" API.
JSON Handling Manual: json.loads(response.read()) Automatic: response.json()
POST Data Requires urllib.urlencode(). Pass a dictionary directly.
Sessions/Cookies Requires building an opener with HTTPCookieProcessor. Built-in Session object handles cookies automatically.
Error Handling Requires catching URLError and HTTPError separately. raise_for_status() and a single requests.exceptions.RequestException base class.
Recommendation Only for maintaining legacy Python 2 code. The standard for all new Python projects.
分享:
扫描分享到社交APP
上一篇
下一篇