Of course. Let's dive into urllib in Python 3.2.
First, and most importantly, a critical piece of context:
Python 3.2 is a very old and unsupported version, released in 2011. It does not receive security updates, and its standard library, including urllib, is missing many features, bug fixes, and security patches present in modern Python.
The information below is for understanding legacy code or running on very old systems. For any new project, you must use a modern version of Python (3.6+) and the modern urllib library.
Overview of urllib in Python 3.2
In Python 3, urllib was reorganized into several submodules to provide a cleaner separation of concerns. In Python 3.2, you will primarily interact with these four:
urllib.request: For opening and reading URLs (likehttp:,ftp:,file:).urllib.error: Contains exception classes raised byurllib.request.urllib.parse: For parsing URLs into components (scheme, netloc, path, etc.).urllib.robotparser: For parsingrobots.txtfiles.
urllib.request - Making HTTP Requests
This is the core module for fetching data from the web.
Basic GET Request
The most common task is to download the content of a webpage.
import urllib.request
# The URL you want to fetch
url = 'http://example.com'
try:
# urlopen() returns a file-like object
with urllib.request.urlopen(url) as response:
# Read the response content
html = response.read()
# The content is returned as bytes, so we decode it to a string
html_string = html.decode('utf-8')
print(f"Successfully fetched {len(html_string)} characters from {url}")
# print(html_string) # Uncomment to see the HTML
except urllib.error.URLError as e:
print(f"Failed to open URL: {e.reason}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Key points:
urllib.request.urlopen(url)opens the URL.- It's best practice to use a
withstatement, as it automatically handles closing the connection. response.read()returns the entire content as abytesobject.- You must explicitly decode the bytes into a string (e.g., using
.decode('utf-8')).
Adding Headers (e.g., User-Agent)
Many websites block requests that don't have a proper User-Agent header. In Python 3.2, you do this by creating a Request object.
import urllib.request
url = 'http://httpbin.org/user-agent' # A site that echoes back your headers
# Create a dictionary of headers
headers = {
'User-Agent': 'MyCoolPythonScript/1.0 (http://mywebsite.com)',
'Accept': 'text/html'
}
# Create a Request object with the URL and headers
req = urllib.request.Request(url, headers=headers)
try:
with urllib.request.urlopen(req) as response:
html = response.read().decode('utf-8')
print(html)
# Expected output: {"user-agent": "MyCoolPythonScript/1.0 (http://mywebsite.com)"}
except urllib.error.URLError as e:
print(f"Failed to open URL: {e.reason}")
Making a POST Request
To send data via a POST request, you need to encode your data into bytes and pass it to the Request object.
import urllib.request
import urllib.parse
url = 'http://httpbin.org/post' # A site that echoes back POST data
# Data to be sent in the POST request
# It must be a dictionary of string keys and string values
post_data = {
'username': 'test_user',
'message': 'Hello from Python 3.2!'
}
# Encode the data into bytes
# The 'utf-8' encoding is standard
encoded_data = urllib.parse.urlencode(post_data).encode('utf-8')
# Create a Request object, passing the encoded data
req = urllib.request.Request(url, data=encoded_data, method='POST')
try:
with urllib.request.urlopen(req) as response:
response_body = response.read().decode('utf-8')
print("POST request successful!")
# print(response_body) # To see the server's response
except urllib.error.URLError as e:
print(f"Failed to make POST request: {e.reason}")
urllib.error - Handling Errors
This module defines exceptions that urllib.request can raise.
urllib.error.URLError: A general error. It has a.reasonattribute that tells you what went wrong (e.g., "connection timed out", "not found").urllib.error.HTTPError: A more specific error for HTTP status codes like 404 (Not Found) or 500 (Server Error). It's a subclass ofURLErrorand has additional attributes like.code(the status code) and.headers(the response headers).
import urllib.request
import urllib.error
url = 'http://example.com/nonexistent-page'
try:
with urllib.request.urlopen(url) as response:
print(response.read())
except urllib.error.HTTPError as e:
print(f"HTTP Error occurred: {e.code} {e.reason}")
# You can access headers like this:
# print(e.headers)
except urllib.error.URLError as e:
print(f"URL Error occurred: {e.reason}")
urllib.parse - Parsing URLs
This module is for breaking down URLs into their components or building them from parts.
import urllib.parse
url = 'http://www.example.com:80/path/to/page;params?query=name#fragment'
# Parse a URL into a 6-tuple (scheme, netloc, path, params, query, fragment)
parsed_url = urllib.parse.urlparse(url)
print(f"Scheme: {parsed_url.scheme}")
print(f"Netloc (domain + port): {parsed_url.netloc}")
print(f"Path: {parsed_url.path}")
print(f"Query (after ?): {parsed_url.query}")
print(f"Fragment (after #): {parsed_url.fragment}")
# --- Building a URL from components ---
# urlunparse() takes a 6-tuple
new_parts = ('https', 'newsite.com', '/search', '', 'q=python', '')
new_url = urllib.parse.urlunparse(new_parts)
print(f"\nNew URL: {new_url}")
# --- Encoding data for URLs ---
# Use urlencode for query parameters
query_params = {'q': 'python tutorial', 'page': '2'}
encoded_query = urllib.parse.urlencode(query_params)
print(f"\nEncoded Query String: {encoded_query}")
urllib.robotparser - Parsing robots.txt
This module helps you check if you are allowed to crawl a specific URL on a website.
import urllib.robotparser
# Create a RobotFileParser object
rp = urllib.robotparser.RobotFileParser()
# Set the URL for the website's robots.txt file
rp.set_url('http://example.com/robots.txt')
# Read and parse the robots.txt file
rp.read()
# Now you can check if you can access a URL
user_agent = 'MyCoolCrawler'
url_to_check = 'http://example.com/some-page/'
can_fetch = rp.can_fetch(user_agent, url_to_check)
if can_fetch:
print(f"'{user_agent}' is allowed to fetch '{url_to_check}'")
else:
print(f"'{user_agent}' is NOT allowed to fetch '{url_to_check}'")
# You can also get the crawl delay
# crawl_delay = rp.crawl_delay(user_agent)
# if crawl_delay:
# print(f"Crawl delay for '{user_agent}': {crawl_delay} seconds")
Critical Recommendation: Use Modern Python and requests
For any serious development, you should use a modern Python version (3.6+) and the requests library. It is vastly superior to urllib in terms of simplicity, readability, and features.
Example of the same tasks using requests:
# First, install requests: pip install requests
import requests
# --- Basic GET Request ---
try:
response = requests.get('http://example.com')
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html = response.text # Text is automatically decoded
print(f"Successfully fetched {len(html)} characters from http://example.com")
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
# --- POST Request with Headers ---
url = 'http://httpbin.org/post'
data = {'username': 'test_user', 'message': 'Hello from requests!'}
headers = {'User-Agent': 'MyCoolPythonScript/1.0'}
try:
response = requests.post(url, data=data, headers=headers)
response.raise_for_status()
print("\nPOST request successful!")
print(response.json()) # .json() automatically parses the JSON response
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
As you can see, requests handles encoding, headers, and JSON parsing automatically, making the code much cleaner and easier to write.
