杰瑞科技汇

Python的bs4 findall怎么用?

Of course! Let's dive deep into using BeautifulSoup's find_all() method, which is one of the most fundamental and powerful tools for web scraping with Python.

What is find_all()?

In short, find_all() searches the entire HTML document for all tags that match a given criteria and returns them as a list of Tag objects.

Think of it like using "Ctrl+F" in your browser, but instead of finding a single occurrence, it finds all of them and lets you grab them for processing.


Basic Syntax

The basic syntax is straightforward:

soup.find_all(name, attrs, recursive, string, limit, **kwargs)

Don't be intimidated by all the arguments. You'll typically only use a few of them in your day-to-day scraping.


The name Argument: Finding by Tag Name

This is the most common use case. You pass the name of the HTML tag you're looking for as a string.

Example HTML: Let's use this simple HTML for our examples:

<html>
<head>A Simple Page</title>
</head>
<body>
    <h1>An H1 Header</h1>
    <p class="main-paragraph">This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <div>
        <p>A paragraph inside a div.</p>
    </div>
    <a href="https://example.com">A Link</a>
</body>
</html>

Python Code:

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>A Simple Page</title>
</head>
<body>
    <h1>An H1 Header</h1>
    <p class="main-paragraph">This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <div>
        <p>A paragraph inside a div.</p>
    </div>
    <a href="https://example.com">A Link</a>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 1. Find all <p> tags
all_paragraphs = soup.find_all('p')
print(f"Found {len(all_paragraphs)} <p> tags.")
# Output: Found 3 <p> tags
# 2. Find all <a> tags
all_links = soup.find_all('a')
print(f"Found {len(all_links)} <a> tags.")
# Output: Found 1 <a> tags

The attrs Argument: Finding by Attributes

This is where the real power lies. You can filter tags based on their attributes like class, id, href, data-*, etc.

Finding by class

Important: Since class is a reserved keyword in Python, you must use class_ in BeautifulSoup.

Example HTML:

<p class="main-paragraph intro">This is a special paragraph.</p>
<p class="intro">This is another special one.</p>
<p>This is a normal paragraph.</p>

Python Code:

html_doc = """
<p class="main-paragraph intro">This is a special paragraph.</p>
<p class="intro">This is another special one.</p>
<p>This is a normal paragraph.</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all tags with the class 'intro'
paragraphs_with_intro_class = soup.find_all(class_='intro')
print(paragraphs_with_intro_class)
# Output:
# [<p class="main-paragraph intro">This is a special paragraph.</p>,
#  <p class="intro">This is another special one.</p>]
# You can also search for multiple classes. BeautifulSoup will find tags
# that have ALL of the specified classes.
# (Note: The class 'main-paragraph intro' has both 'main-paragraph' and 'intro')
main_and_intro = soup.find_all(class_=['main-paragraph', 'intro'])
print(main_and_intro)
# Output:
# [<p class="main-paragraph intro">This is a special paragraph.</p>,
#  <p class="intro">This is another special one.</p>]

Finding by id

id attributes are supposed to be unique, but find_all will still return a list (which will contain one item).

Example HTML:

<div id="main-content">...</div>
<div id="sidebar">...</div>

Python Code:

html_doc = """
<div id="main-content">Main content goes here.</div>
<div id="sidebar">Sidebar content.</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the tag with id 'main-content'
main_content = soup.find_all(id='main-content')
print(main_content)
# Output: [<div id="main-content">Main content goes here.</div>]

Finding by Any Attribute

You can pass any attribute as a keyword argument. For attributes with hyphens (like data-id), you can use either data-id or data_id.

Example HTML:

<span data-identifier="123">Item 1</span>
<span data-identifier="456">Item 2</span>

Python Code:

html_doc = """
<span data-identifier="123">Item 1</span>
<span data-identifier="456">Item 2</span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all spans with the data-identifier attribute
items = soup.find_all('span', attrs={'data-identifier': True})
print(items)
# Output:
# [<span data-identifier="123">Item 1</span>,
#  <span data-identifier="456">Item 2</span>]
# Or, find a specific value for the attribute
item_123 = soup.find_all('span', attrs={'data-identifier': '123'})
print(item_123)
# Output: [<span data-identifier="123">Item 1</span>]

The string Argument: Finding by Text Content

You can search for tags based on the text they contain.

Example HTML:

<p>Hello World</p>
<p>Goodbye World</p>
<p>Hello Python</p>

Python Code:

html_doc = """
<p>Hello World</p>
<p>Goodbye World</p>
<p>Hello Python</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all <p> tags that contain the string "Hello"
hello_paragraphs = soup.find_all('p', string='Hello')
print(hello_paragraphs)
# Output: [<p>Hello World</p>, <p>Hello Python</p>]
# You can also use regular expressions for more flexible matching
import re
# Find all <p> tags whose text starts with "Hello"
hello_start_paragraphs = soup.find_all('p', string=re.compile('^Hello'))
print(hello_start_paragraphs)
# Output: [<p>Hello World</p>, <p>Hello Python</p>]
# Find all <p> tags whose text contains "World"
world_paragraphs = soup.find_all('p', string=re.compile('World'))
print(world_paragraphs)
# Output: [<p>Hello World</p>, <p>Goodbye World</p>]

Combining Arguments for Powerful Searches

The real magic happens when you combine these arguments.

Example HTML:

<a href="https://google.com">Google</a>
<a href="https://facebook.com">Facebook</a>
<a href="http://old-site.com" class="archived">Old Site</a>

Python Code:

html_doc = """
<a href="https://google.com">Google</a>
<a href="https://facebook.com">Facebook</a>
<a href="http://old-site.com" class="archived">Old Site</a>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all <a> tags that have the class 'archived'
archived_links = soup.find_all('a', class_='archived')
print(archived_links)
# Output: [<a href="http://old-site.com" class="archived">Old Site</a>]
# Find all <a> tags whose href attribute starts with 'https'
secure_links = soup.find_all('a', href=re.compile('^https'))
print(secure_links)
# Output:
# [<a href="https://google.com">Google</a>,
#  <a href="https://facebook.com">Facebook</a>]
# Find all tags that contain the text "Google"
google_tags = soup.find_all(string='Google')
print(google_tags)
# Output: ['Google']

Other Useful Arguments

  • limit: If you only want the first N results, you can use limit. This is much more efficient than getting all results and then slicing the list.

    # Find the first 2 <p> tags
    first_two_paragraphs = soup.find_all('p', limit=2)
    print(first_two_paragraphs)
    # Output: [<p class="main-paragraph">This is the first paragraph.</p>, <p>This is the second paragraph.</p>]
  • recursive: By default, find_all() searches the entire document (recursively). If you set recursive=False, it will only search the direct children of the current tag.

    # Get the <body> tag
    body = soup.body
    # Find all <p> tags within the body (default behavior)
    all_p_in_body_recursive = body.find_all('p')
    print(f"Recursive search found {len(all_p_in_body_recursive)} <p> tags.")
    # Find all <p> tags that are direct children of the body only
    all_p_in_body_direct = body.find_all('p', recursive=False)
    print(f"Direct search found {len(all_p_in_body_direct)} <p> tags.")
    # Output:
    # Recursive search found 3 <p> tags.
    # Direct search found 2 <p> tags.

The find() Method vs. find_all()

It's crucial to understand the difference:

Method find_all() find()
What it does Finds all matching tags. Finds the first matching tag.
Return Type Returns a list of Tag objects. Returns a single Tag object.
Use Case When you expect multiple matches and need to loop through them. When you are looking for a specific, unique element (e.g., by id).
What if none found? Returns an empty list []. Returns None.
# Using find() to get the first <p> tag
first_p_tag = soup.find('p')
print(first_p_tag)
# Output: <p class="main-paragraph">This is the first paragraph.</p>
# Trying to find a tag that doesn't exist
non_existent_tag = soup.find('xyz')
print(non_existent_tag)
# Output: None

Summary and Best Practices

  1. Start Broad, Then Narrow Down: First, find a unique parent tag (e.g., div with a specific id), then use find_all() on that parent. This makes your scraper more resilient to changes in the rest of the page.

    # Good practice
    product_list = soup.find('div', id='product-list')
    products = product_list.find_all('div', class_='product-item')
  2. Be Specific: Use combinations of tag names, classes, and other attributes to make your selectors as specific as possible. Avoid relying on the position of an element (e.g., "the third paragraph").

  3. Handle Errors Gracefully: Since find() returns None and find_all() returns an empty list when nothing is found, always check for this before trying to access attributes or text.

    link = soup.find('a', class_='main-link')
    if link:
        print(link['href'])
    else:
        print("Could not find the main link.")
  4. Use CSS Selectors: For very complex queries, BeautifulSoup also supports CSS selectors via the select() and select_one() methods, which can be more concise and powerful.

    # Find all <p> tags with class 'intro' inside a <div> with id 'content'
    # Using find_all
    p_tags = soup.find('div', id='content').find_all('p', class_='intro')
    # Using CSS selectors (often more readable)
    p_tags_css = soup.select('div#content p.intro')
分享:
扫描分享到社交APP
上一篇
下一篇