杰瑞科技汇

python urllib2 api

Of course! Here is a comprehensive guide to using the urllib2 API in Python.

python urllib2 api-图1
(图片来源网络,侵删)

Important Note: urllib2 vs. urllib in Python 3

First, a crucial point for modern Python users:

  • In Python 2: The library was called urllib2.
  • In Python 3: The urllib2 module was merged and reorganized into urllib. The functionality is now split across three modules:
    • urllib.request: For opening and reading URLs (the direct replacement for urllib2).
    • urllib.error: Contains exception classes for errors raised by urllib.request.
    • urllib.parse: For parsing URLs (this was part of urlib in Python 2).

This guide will focus on the Python 2 urllib2 API, as that's what your question specifically asked for. I will include a "Python 3 Equivalent" section for each major concept.


Core Concepts of urllib2

The urllib2 library is a standard library for fetching URLs. Its main components are:

  1. urllib2.urlopen(url[, data[, timeout]]): The primary function for opening a URL. It returns a file-like object.
  2. urllib2.Request: A class that represents a request to a URL. It allows you to add headers, set data, and use different HTTP methods.
  3. urllib2.build_opener([handler, ...]): A function to create an "opener" object, which is more advanced than urlopen. It allows you to install handlers for things as cookies, HTTP authentication, and proxies.
  4. urllib2.install_opener(opener): Installs the opener you built as the default opener for the module.
  5. urllib2.HTTPError & urllib2.URLError: Exception classes for handling network and HTTP errors.

Basic URL Fetching (urlopen)

This is the simplest way to get the content of a webpage. urlopen returns a file-like object, so you can read from it just like a file.

python urllib2 api-图2
(图片来源网络,侵删)
import urllib2
try:
    # The URL you want to open
    url = "http://www.example.com"
    # Open the URL
    response = urllib2.urlopen(url)
    # Get the HTTP response status code (e.g., 200 for OK)
    print "Response Status Code:", response.getcode()
    # Read the content of the response
    # .read() returns the entire content as a string
    html_content = response.read()
    # Print the first 200 characters of the HTML
    print "Content (first 200 chars):", html_content[:200]
    # It's good practice to close the response
    response.close()
except urllib2.URLError as e:
    print "Failed to reach a server."
    print "Reason:", e.reason
except urllib2.HTTPError as e:
    print "The server couldn't fulfill the request."
    print "Error code:", e.code
    print "Page:", e.read() # You can read the error page's content

Python 3 Equivalent:

# In Python 3, you use urllib.request
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
try:
    url = "http://www.example.com"
    with urlopen(url) as response: # The 'with' statement handles closing
        html_content = response.read()
        print("Response Status Code:", response.status)
        print("Content (first 200 chars):", html_content[:200].decode('utf-8')) # Decode bytes to string
except URLError as e:
    print("Failed to reach a server.")
    print("Reason:", e.reason)
except HTTPError as e:
    print("The server couldn't fulfill the request.")
    print("Error code:", e.code)
    print("Page:", e.read().decode('utf-8'))

Adding Headers to a Request

Many websites require specific headers, like a User-Agent, to allow requests from scripts.

You must create a Request object to add headers.

import urllib2
url = "http://www.example.com"
# Create a Request object
req = urllib2.Request(url)
# Add a User-Agent header
# This makes the request look like it's coming from a standard browser
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36')
# Add another header, e.g., for Accept-Language
req.add_header('Accept-Language', 'en-US,en;q=0.9')
# Now open the Request object, not the URL directly
response = urllib2.urlopen(req)
html_content = response.read()
print "User-Agent header added successfully."
print "Content (first 200 chars):", html_content[:200]
response.close()

Python 3 Equivalent:

from urllib.request import Request, urlopen
url = "http://www.example.com"
req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36')
with urlopen(req) as response:
    html_content = response.read()
    print("User-Agent header added successfully.")
    print("Content (first 200 chars):", html_content[:200].decode('utf-8'))

Sending Data (POST Request)

To send data to a server (like a form submission), you use the data argument of urlopen. The data must be in a specific format: application/x-www-form-urlencoded. The urllib.urlencode function is perfect for this.

import urllib2
import urllib # Note: we need the 'urllib' module for urlencode
url = "http://www.example.com/submit-form"
# The data to be sent, as a dictionary
form_data = {
    'username': 'john_doe',
    'password': 'secret123'
}
# Encode the data into the correct format
# e.g., "username=john_doe&password=secret123"
encoded_data = urllib.urlencode(form_data)
# Note: When data is passed, urlopen automatically makes a POST request
# The request object is created internally by urlopen
req = urllib2.Request(url, data=encoded_data)
# You can still add headers
req.add_header('User-Agent', 'MyCoolScript/1.0')
try:
    response = urllib2.urlopen(req)
    response_content = response.read()
    print "POST request successful."
    print "Response:", response_content[:200]
    response.close()
except urllib2.HTTPError as e:
    print "Error during POST:", e.code, e.reason

Python 3 Equivalent:

from urllib.request import Request, urlopen
from urllib.parse import urlencode # urlencode is in urllib.parse in Python 3
url = "http://www.example.com/submit-form"
form_data = {
    'username': 'john_doe',
    'password': 'secret123'
}
encoded_data = urlencode(form_data).encode('utf-8') # Data must be bytes in Python 3
req = Request(url, data=encoded_data, method='POST') # Explicitly setting method is good practice
req.add_header('User-Agent', 'MyCoolScript/1.0')
try:
    with urlopen(req) as response:
        response_content = response.read()
        print("POST request successful.")
        print("Response:", response_content[:200].decode('utf-8'))
except HTTPError as e:
    print("Error during POST:", e.code, e.reason)

Handling Cookies

urllib2 has built-in support for cookies. The HTTPCookieProcessor handler manages them.

import urllib2
import cookielib # The module for cookie handling
# Create a cookie jar to store cookies
cookie_jar = cookielib.CookieJar()
# Create an opener that will handle cookies for us
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
# Install this opener as the default opener for the module
urllib2.install_opener(opener)
# Now, any request made with urlopen will automatically handle cookies
url = "http://www.example.com/login-page" # A hypothetical login page
response = urllib2.urlopen(url)
print "Visited login page. Cookies received:", cookie_jar
# Now visit another page that requires a session
url2 = "http://www.example.com/protected-page"
response2 = urllib2.urlopen(url2)
print "Visited protected page. Cookies are now in the jar."
print "Content:", response2.read()[:200]
response.close()
response2.close()

Python 3 Equivalent:

from urllib.request import build_opener, install_opener, Request, urlopen
from urllib.error import HTTPError
from http.cookiejar import CookieJar # CookieJar is in http.cookiejar
cookie_jar = CookieJar()
opener = build_opener(HTTPCookieProcessor(cookie_jar))
install_opener(opener)
url = "http://www.example.com/login-page"
try:
    with urlopen(url) as response:
        print("Visited login page. Cookies received:", cookie_jar)
    url2 = "http://www.example.com/protected-page"
    with urlopen(url2) as response2:
        print("Visited protected page. Cookies are now in the jar.")
        print("Content:", response2.read()[:200].decode('utf-8'))
except HTTPError as e:
    print("Error accessing protected page:", e.code, e.reason)

Handling HTTP Authentication (Basic Auth)

For websites that require username/password authentication, you can use HTTPBasicAuthHandler.

import urllib2
import base64 # For encoding the username:password string
# The URL that requires authentication
url = "http://www.example.com/protected"
# Create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password for the URL
# The realm can be None if you don't know it
top_level_url = "http://www.example.com/"
username = "user"
password = "pass"
password_mgr.add_password(None, top_level_url, username, password)
# Create the handler for basic authentication
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# Build an opener with this handler
opener = urllib2.build_opener(auth_handler)
# Install the opener
urllib2.install_opener(opener)
# Now, the request will include the authentication headers
try:
    response = urllib2.urlopen(url)
    print "Authentication successful."
    print "Content:", response.read()[:200]
    response.close()
except urllib2.HTTPError as e:
    print "Authentication failed. Error:", e.code, e.reason

Python 3 Equivalent:

from urllib.request import build_opener, install_opener, Request, urlopen
from urllib.error import HTTPError
from urllib.parse import urlparse
from http.cookiejar import CookieJar
from http.client import HTTPBasicAuthHandler
url = "http://www.example.com/protected"
top_level_url = "http://www.example.com/"
username = "user"
password = "pass"
password_mgr = HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, top_level_url, username, password)
auth_handler = HTTPBasicAuthHandler(password_mgr)
opener = build_opener(auth_handler)
install_opener(opener)
try:
    with urlopen(url) as response:
        print("Authentication successful.")
        print("Content:", response.read()[:200].decode('utf-8'))
except HTTPError as e:
    print("Authentication failed. Error:", e.code, e.reason)

Summary Table: urllib2 vs. urllib.request (Python 3)

Task Python 2 (urllib2) Python 3 (urllib.request)
Open URL urllib2.urlopen(url) urllib.request.urlopen(url)
Create Request urllib2.Request(url) urllib.request.Request(url)
Add Header req.add_header('Key', 'Val') req.add_header('Key', 'Val')
Encode Data urllib.urlencode(dict) urllib.parse.urlencode(dict).encode()
Handle Cookies cookielib.CookieJar
urllib2.HTTPCookieProcessor
http.cookiejar.CookieJar
urllib.request.HTTPCookieProcessor
Handle Auth urllib2.HTTPBasicAuthHandler urllib.request.HTTPBasicAuthHandler
Build Opener urllib2.build_opener(handler) urllib.request.build_opener(handler)
Install Opener urllib2.install_opener(opener) urllib.request.install_opener(opener)
HTTP Error urllib2.HTTPError urllib.error.HTTPError
URL Error urllib2.URLError urllib.error.URLError
分享:
扫描分享到社交APP
上一篇
下一篇