Of course. The urlparse module in Python is a standard library tool for breaking down (parsing) Uniform Resource Locators (URLs) into their constituent parts.
Since you specified Python 3.6, it's important to note that urlparse is part of the standard library and has been stable for a very long time. The functionality in Python 3.6 is the same as in modern versions, though modern Python has added more features like urllib.parse.urljoin for relative URLs and urllib.parse.urlsplit which is slightly different.
Here's a comprehensive guide to using urlparse in Python 3.6.
The Basics: urlparse()
The main function is urlparse(). It takes a URL string and returns a special object called a ParseResult.
How to Import
from urllib.parse import urlparse
What it Returns
A ParseResult object is a named tuple with six attributes:
scheme: The protocol (e.g.,http,https,ftp).netloc: The network location (domain name and port).path: The path to the resource on the server.params: Parameters for the last path element (rarely used).query: The query string, which contains key-value pairs.fragment: The fragment identifier, used to link to a specific part of a page.
Example: Parsing a Simple URL
Let's break down a common URL.
from urllib.parse import urlparse
# A typical URL
url = "https://www.example.com:8080/path/to/page?query=search#section1"
# Parse the URL
parsed_url = urlparse(url)
# The result is a ParseResult object
print(f"Full ParseResult object: {parsed_url}\n")
# You can access each part by its attribute name
print(f"Scheme: {parsed_url.scheme}")
print(f"Netloc: {parsed_url.netloc}")
print(f"Path: {parsed_url.path}")
print(f"Params: {parsed_url.params}") # Often empty
print(f"Query: {parsed_url.query}")
print(f"Fragment: {parsed_url.fragment}")
Output:
Full ParseResult object: ParseResult(scheme='https', netloc='www.example.com:8080', path='/path/to/page', params='', query='query=search', fragment='section1')
Scheme: https
Netloc: www.example.com:8080
Path: /path/to/page
Params:
Query: query=search
Fragment: section1
The netloc Attribute: Deeper Dive
The netloc (network location) part often contains more than just the domain name. You might need to separate the domain, port, and user info.
from urllib.parse import urlparse
url_with_auth = "ftp://user:pass@sub.domain.com:21/path/to/file"
parsed = urlparse(url_with_auth)
netloc = parsed.netloc
print(f"Full netloc: '{netloc}'")
print(f"Username: '{parsed.username}'") # A convenient attribute
print(f"Password: '{parsed.password}'") # A convenient attribute
print(f"Hostname: '{parsed.hostname}'") # A convenient attribute
print(f"Port: {parsed.port}") # Returns an integer or None
Output:
Full netloc: 'user:pass@sub.domain.com:21'
Username: 'user'
Password: 'pass'
Hostname: 'sub.domain.com'
Port: 21
Note: parsed.port is very useful because it automatically converts the port number from a string to an integer. If no port is specified, it returns None.
Working with the Query String (query)
The query string is a list of key=value pairs separated by &. The urlparse module provides parse_qs and parse_qsl to handle this.
parse_qs(): Parses into a Dictionary
It returns a dictionary where keys are the parameter names and values are lists of all values for that key (since a key can appear multiple times).
from urllib.parse import urlparse, parse_qs
url = "https://example.com/search?q=python&sort=desc&q=urlparse"
parsed = urlparse(url)
query_string = parsed.query
print(f"Original query string: '{query_string}'")
# Parse the query string into a dictionary
query_params = parse_qs(query_string)
print(f"Parsed query params: {query_params}")
# Accessing a specific value
# Note that the value is always a list!
search_terms = query_params['q']
print(f"\nThe value for 'q' is a list: {search_terms}")
print(f"The first search term is: {search_terms[0]}")
Output:
Original query string: 'q=python&sort=desc&q=urlparse'
Parsed query params: {'q': ['python', 'urlparse'], 'sort': ['desc']}
The value for 'q' is a list: ['python', 'urlparse']
The first search term is: python
parse_qsl(): Parses into a List of Tuples
This is useful if you need to preserve the order of the parameters or if you prefer working with a simple list of (key, value) tuples.
from urllib.parse import parse_qsl
url = "https://example.com/search?q=python&sort=desc&q=urlparse"
query_string = urlparse(url).query
# Parse the query string into a list of tuples
query_list = parse_qsl(query_string)
print(f"Parsed query list: {query_list}")
Output:
Parsed query list: [('q', 'python'), ('sort', 'desc'), ('q', 'urlparse')]
Reconstructing a URL (urlunparse)
The urlunparse function does the reverse of urlparse. It takes a ParseResult (or a 6-tuple) and reconstructs a URL string.
from urllib.parse import urlunparse
# Create a new ParseResult object
# Note: The 'params' part is included but is often an empty string.
new_url_parts = (
'https', # scheme
'new.site.com', # netloc
'/api/v1/data', # path
'', # params
'id=123&format=json', # query
'results' # fragment
)
# Reconstruct the URL
reconstructed_url = urlunparse(new_url_parts)
print(reconstructed_url)
Output:
https://new.site.com/api/v1/data?id=123&format=json#results
Common Pitfalls and Best Practices
a. Relative URLs
urlparse works on absolute URLs. If you have a relative URL like /path/to/page, the scheme and netloc will be empty.
relative_url = "/path/to/page?query=1" parsed = urlparse(relative_url) print(parsed.scheme) # Output: '' print(parsed.netloc) # Output: '' print(parsed.path) # Output: '/path/to/page'
To resolve relative URLs against a base URL, use urllib.parse.urljoin().
b. Malformed URLs
urlparse is very forgiving. It will try its best to parse even malformed URLs without raising an error.
# A URL with an invalid scheme and no slashes bad_url = "mailto:someone@example.com?subject=Hello" parsed = urlparse(bad_url) print(parsed)
Output:
ParseResult(scheme='mailto', netloc='', path='someone@example.com', params='', query='subject=Hello', fragment='')
This is expected behavior. The scheme is mailto, and the rest of the string is treated as the path.
Summary Table
| Function | Purpose | Example |
|---|---|---|
urlparse(url) |
Breaks a URL string into 6 components. | p = urlparse("https://example.com/path?q=val") |
p.scheme |
Gets the protocol (e.g., https). |
p.scheme -> 'https' |
p.netloc |
Gets the domain and port. | p.netloc -> 'example.com' |
p.path |
Gets the resource path. | p.path -> '/path' |
p.query |
Gets the query string. | p.query -> 'q=val' |
p.fragment |
Gets the fragment identifier. | p.fragment -> |
parse_qs(p.query) |
Parses query string into a dictionary. | parse_qs('q=val&q=val2') -> {'q': ['val', 'val2']} |
parse_qsl(p.query) |
Parses query string into a list of tuples. | parse_qsl('q=val&q=val2') -> [('q', 'val'), ('q', 'val2')] |
urlunparse(parts) |
Reconstructs a URL from a 6-tuple. | urlunparse(('https', 'ex.com', '/', '', '', '')) |
