杰瑞科技汇

Python 2.7 urllib2 如何正确使用?

Of course. Here is a comprehensive guide to using urllib2 in Python 2.7.

Python 2.7 urllib2 如何正确使用?-图1
(图片来源网络,侵删)

Important Note: Python 2.7 is End-of-Life

Before we dive in, it's crucial to understand that Python 2.7 reached its official end-of-life on January 1, 2025. It no longer receives security updates, bug fixes, or other improvements. Using it in production is strongly discouraged.

However, understanding urllib2 is still valuable for:

  • Maintaining legacy Python 2.7 applications.
  • Reading and understanding older codebases.
  • Appreciating the history of Python's standard library.

For any new project, you should be using Python 3, where the urllib library has been reorganized and improved into urllib.request, urllib.parse, and urllib.error.


urllib2 Overview

urllib2 is a Python 2.7 standard library module for fetching URLs (opening and reading them). It's a powerful tool that can handle:

Python 2.7 urllib2 如何正确使用?-图2
(图片来源网络,侵删)
  • HTTP and HTTPS URLs.
  • Different HTTP methods (GET, POST).
  • Custom headers (e.g., User-Agent, Authorization).
  • Cookies.
  • Authentication (Basic, Digest).
  • Following redirects.

The primary objects you'll work with are:

  • urllib2.urlopen(): The main function for opening a URL.
  • urllib2.Request: A class for building a request with custom headers, data, etc.
  • urllib2.install_opener(): To build and install a fancy opener (for cookies, proxies, etc.).

Basic GET Request

The simplest use case is fetching the content of a webpage.

import urllib2
try:
    # The URL you want to fetch
    url = "http://www.example.com"
    # Open the URL and read the response
    response = urllib2.urlopen(url)
    # The response object is like a file object
    html = response.read()
    # Print the first 500 characters
    print(html[:500])
except urllib2.URLError as e:
    print("Failed to open URL: " + str(e))
except Exception as e:
    print("An error occurred: " + str(e))
finally:
    # It's good practice to close the response
    if 'response' in locals() and response:
        response.close()

Explanation:

  • urllib2.urlopen(url) sends a simple GET request to the URL.
  • It returns a file-like object (response).
  • response.read() reads the entire content of the response as a string.
  • response.getcode() returns the HTTP status code (e.g., 200 for OK, 404 for Not Found).
  • response.headers contains the response headers.
  • Always close the response to free up resources, or use a with statement if available (though urllib2.urlopen doesn't natively support it, you can wrap it).

Adding Custom Headers (e.g., User-Agent)

Many websites block default Python user-agents. You can customize your request headers using the Request object.

Python 2.7 urllib2 如何正确使用?-图3
(图片来源网络,侵删)
import urllib2
url = "http://httpbin.org/user-agent" # A site that echoes back your User-Agent
# Create a Request object instead of calling urlopen directly
request = urllib2.Request(url)
# Add a custom User-Agent header
request.add_header('User-Agent', 'My-Cool-App/1.0 (Windows NT 10.0; Win64; x64)')
try:
    response = urllib2.urlopen(request)
    html = response.read()
    print(html)
except urllib2.URLError as e:
    print("Error: " + str(e))
finally:
    if 'response' in locals() and response:
        response.close()

Explanation:

  • Instead of urllib2.urlopen(url), we first create a Request object.
  • request.add_header() lets you add any HTTP header you need.
  • You can also set headers when creating the request: urllib2.Request(url, headers={'User-Agent': '...'}).

Making a POST Request

To send data in the body of a request (e.g., submitting a form), you use the data argument with urllib2.Request. The data must be in a specific format: application/x-www-form-urlencoded.

import urllib2
import urllib # Note: We need the 'urllib' module for urlencode
url = "http://httpbin.org/post" # A site that echoes back POST data
# The data to send (a dictionary)
post_data = {
    'username': 'test_user',
    'password': 's3cr3t_p@ss'
}
# URL-encode the dictionary into a string like 'username=test_user&password=s3cr3t_p%40ss'
encoded_data = urllib.urlencode(post_data)
# Create a Request object with the URL and the encoded data
# Note: The data argument makes the request a POST request
request = urllib2.Request(url, data=encoded_data)
try:
    response = urllib2.urlopen(request)
    html = response.read()
    print(html)
except urllib2.URLError as e:
    print("Error: " + str(e))
finally:
    if 'response' in locals() and response:
        response.close()

Explanation:

  • urllib.urlencode() is essential for converting a Python dictionary into a format suitable for a POST request.
  • When you pass the data argument to Request, urllib2 automatically sets the Content-Type header to application/x-www-form-urlencoded and changes the request method to POST.

Handling HTTP Errors (404, 500, etc.)

By default, urllib2.urlopen() will raise an exception for HTTP error status codes (4xx, 5xx). You can catch these exceptions to handle them gracefully.

import urllib2
url = "http://httpbin.org/status/404" # This URL returns a 404 Not Found error
try:
    response = urllib2.urlopen(url)
    print("Success! Status code:", response.getcode())
    print(response.read())
except urllib2.HTTPError as e:
    # This block catches HTTP errors like 404, 500, etc.
    print("HTTP Error occurred!")
    print("Error code:", e.code)
    print("Error reason:", e.reason)
    # You can even read the error page content
    print("Error page:", e.read())
except urllib2.URLError as e:
    # This catches other URL-related errors (e.g., network down, invalid domain)
    print("URL Error occurred!")
    print("Reason:", e.reason)

Following Redirects

urllib2 automatically follows HTTP redirects (status codes 301, 302, etc.) by default. You can disable this behavior.

import urllib2
url = "http://httpbin.org/redirect/1" # Redirects once to /get
# To disable redirects, create an opener with a redirect handler
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
# opener.open() will now raise an exception on redirect
try:
    # This will raise an HTTPError because we're not following redirects
    response = opener.open(url)
    print("Should not be here")
except urllib2.HTTPError as e:
    print("Caught redirect because we disabled it.")
    print("Status code:", e.code)
    print("Redirect URL was:", e.headers.get('Location'))

Handling Cookies

For websites that require a login, you need to handle cookies. The best way is to use an HTTPCookieProcessor.

import urllib2
import cookielib # Note: 'cookielib' is the Python 2 module name
# Create a cookie jar to store cookies
cookie_jar = cookielib.CookieJar()
# Build an opener that will handle cookies for us
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
# Install this opener as the default opener for all future calls
urllib2.install_opener(opener)
# Now, any request made with urlopen will automatically handle cookies
# Example: Log in to a site (this is a hypothetical example)
login_url = "http://example.com/login"
login_data = urllib.urlencode({'username': 'user', 'password': 'pass'})
# First, make a POST request to log in. The cookies will be stored.
response = urllib2.urlopen(login_url, login_data)
print("Logged in. Cookie jar now contains:")
for cookie in cookie_jar:
    print(cookie.name + " = " + cookie.value)
# Now, make a request to a protected page. The cookies are sent automatically.
protected_page_url = "http://example.com/dashboard"
response = urllib2.urlopen(protected_page_url)
print("\nContent of protected page:")
print(response.read()[:200])

The Modern Python 3 Alternative

For context, here is how you would perform the same tasks in Python 3. The library was split into submodules.

# Python 3 Example
import urllib.request
import urllib.parse
import urllib.error
# 1. Basic GET Request
try:
    with urllib.request.urlopen('http://www.example.com') as response:
        html = response.read()
        print(html[:500].decode('utf-8')) # Decode bytes to string
except urllib.error.URLError as e:
    print("Error:", e.reason)
# 2. Custom Headers & POST Request
url = 'http://httpbin.org/post'
data = {'username': 'test_user', 'password': 's3cr3t_p@ss'}
# Encode data and create request in one go
req = urllib.request.Request(url, data=urllib.parse.urlencode(data).encode('utf-8'), headers={'User-Agent': 'My-Python-3-App/1.0'})
try:
    with urllib.request.urlopen(req) as response:
        print(response.read().decode('utf-8'))
except urllib.error.HTTPError as e:
    print("HTTP Error:", e.code, e.reason)
# 3. Cookies
import http.cookiejar
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
# ... rest of the cookie logic is similar ...

Summary: Key Differences (Python 2 urllib2 vs. Python 3 urllib)

Feature Python 2.7 (urllib2) Python 3 (urllib.request)
Module Name urllib2 urllib.request
URL Encoding urllib.urlencode() urllib.parse.urlencode()
Error Handling urllib2.URLError, urllib2.HTTPError urllib.error.URLError, urllib.error.HTTPError
Cookies cookielib.CookieJar() http.cookiejar.CookieJar()
Import import urllib2 import urllib.request
Open URL urllib2.urlopen(url) urllib.request.urlopen(url)
Request Object urllib2.Request(url) urllib.request.Request(url)
Data Encoding urllib.urlencode(dict) returns a string urllib.parse.urlencode(dict).encode('utf-8') returns bytes
分享:
扫描分享到社交APP
上一篇
下一篇