Python的bs4 findall怎么用？-杰瑞科技汇

Of course! Let's dive deep into using BeautifulSoup's find_all() method, which is one of the most fundamental and powerful tools for web scraping with Python.

What is `find_all()`?

In short, find_all() searches the entire HTML document for all tags that match a given criteria and returns them as a list of Tag objects.

Think of it like using "Ctrl+F" in your browser, but instead of finding a single occurrence, it finds all of them and lets you grab them for processing.

Basic Syntax

The basic syntax is straightforward:

soup.find_all(name, attrs, recursive, string, limit, **kwargs)

Don't be intimidated by all the arguments. You'll typically only use a few of them in your day-to-day scraping.

The `name` Argument: Finding by Tag Name

This is the most common use case. You pass the name of the HTML tag you're looking for as a string.

Example HTML: Let's use this simple HTML for our examples:

<html>
<head>A Simple Page</title>
</head>
<body>
    <h1>An H1 Header</h1>
    <p class="main-paragraph">This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <div>
        <p>A paragraph inside a div.</p>
    </div>
    <a href="https://example.com">A Link</a>
</body>
</html>

Python Code:

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>A Simple Page</title>
</head>
<body>
    <h1>An H1 Header</h1>
    <p class="main-paragraph">This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <div>
        <p>A paragraph inside a div.</p>
    </div>
    <a href="https://example.com">A Link</a>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 1. Find all <p> tags
all_paragraphs = soup.find_all('p')
print(f"Found {len(all_paragraphs)} <p> tags.")
# Output: Found 3 <p> tags
# 2. Find all <a> tags
all_links = soup.find_all('a')
print(f"Found {len(all_links)} <a> tags.")
# Output: Found 1 <a> tags

The `attrs` Argument: Finding by Attributes

This is where the real power lies. You can filter tags based on their attributes like class, id, href, data-*, etc.

Finding by `class`

Important: Since class is a reserved keyword in Python, you must use class_ in BeautifulSoup.

Example HTML:

<p class="main-paragraph intro">This is a special paragraph.</p>
<p class="intro">This is another special one.</p>
<p>This is a normal paragraph.</p>

Python Code:

html_doc = """
<p class="main-paragraph intro">This is a special paragraph.</p>
<p class="intro">This is another special one.</p>
<p>This is a normal paragraph.</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all tags with the class 'intro'
paragraphs_with_intro_class = soup.find_all(class_='intro')
print(paragraphs_with_intro_class)
# Output:
# [<p class="main-paragraph intro">This is a special paragraph.</p>,
#  <p class="intro">This is another special one.</p>]
# You can also search for multiple classes. BeautifulSoup will find tags
# that have ALL of the specified classes.
# (Note: The class 'main-paragraph intro' has both 'main-paragraph' and 'intro')
main_and_intro = soup.find_all(class_=['main-paragraph', 'intro'])
print(main_and_intro)
# Output:
# [<p class="main-paragraph intro">This is a special paragraph.</p>,
#  <p class="intro">This is another special one.</p>]

Finding by `id`

id attributes are supposed to be unique, but find_all will still return a list (which will contain one item).

Example HTML:

<div id="main-content">...</div>
<div id="sidebar">...</div>

Python Code:

html_doc = """
<div id="main-content">Main content goes here.</div>
<div id="sidebar">Sidebar content.</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the tag with id 'main-content'
main_content = soup.find_all(id='main-content')
print(main_content)
# Output: [<div id="main-content">Main content goes here.</div>]

Finding by Any Attribute

You can pass any attribute as a keyword argument. For attributes with hyphens (like data-id), you can use either data-id or data_id.

Example HTML:

<span data-identifier="123">Item 1</span>
<span data-identifier="456">Item 2</span>

Python Code:

html_doc = """
<span data-identifier="123">Item 1</span>
<span data-identifier="456">Item 2</span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all spans with the data-identifier attribute
items = soup.find_all('span', attrs={'data-identifier': True})
print(items)
# Output:
# [<span data-identifier="123">Item 1</span>,
#  <span data-identifier="456">Item 2</span>]
# Or, find a specific value for the attribute
item_123 = soup.find_all('span', attrs={'data-identifier': '123'})
print(item_123)
# Output: [<span data-identifier="123">Item 1</span>]

The `string` Argument: Finding by Text Content

You can search for tags based on the text they contain.

Example HTML:

<p>Hello World</p>
<p>Goodbye World</p>
<p>Hello Python</p>

Python Code:

html_doc = """
<p>Hello World</p>
<p>Goodbye World</p>
<p>Hello Python</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all <p> tags that contain the string "Hello"
hello_paragraphs = soup.find_all('p', string='Hello')
print(hello_paragraphs)
# Output: [<p>Hello World</p>, <p>Hello Python</p>]
# You can also use regular expressions for more flexible matching
import re
# Find all <p> tags whose text starts with "Hello"
hello_start_paragraphs = soup.find_all('p', string=re.compile('^Hello'))
print(hello_start_paragraphs)
# Output: [<p>Hello World</p>, <p>Hello Python</p>]
# Find all <p> tags whose text contains "World"
world_paragraphs = soup.find_all('p', string=re.compile('World'))
print(world_paragraphs)
# Output: [<p>Hello World</p>, <p>Goodbye World</p>]

Combining Arguments for Powerful Searches

The real magic happens when you combine these arguments.

Example HTML:

<a href="https://google.com">Google</a>
<a href="https://facebook.com">Facebook</a>
<a href="http://old-site.com" class="archived">Old Site</a>

Python Code:

html_doc = """
<a href="https://google.com">Google</a>
<a href="https://facebook.com">Facebook</a>
<a href="http://old-site.com" class="archived">Old Site</a>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all <a> tags that have the class 'archived'
archived_links = soup.find_all('a', class_='archived')
print(archived_links)
# Output: [<a href="http://old-site.com" class="archived">Old Site</a>]
# Find all <a> tags whose href attribute starts with 'https'
secure_links = soup.find_all('a', href=re.compile('^https'))
print(secure_links)
# Output:
# [<a href="https://google.com">Google</a>,
#  <a href="https://facebook.com">Facebook</a>]
# Find all tags that contain the text "Google"
google_tags = soup.find_all(string='Google')
print(google_tags)
# Output: ['Google']

Other Useful Arguments

limit: If you only want the first N results, you can use limit. This is much more efficient than getting all results and then slicing the list.

# Find the first 2 <p> tags
first_two_paragraphs = soup.find_all('p', limit=2)
print(first_two_paragraphs)
# Output: [<p class="main-paragraph">This is the first paragraph.</p>, <p>This is the second paragraph.</p>]

recursive: By default, find_all() searches the entire document (recursively). If you set recursive=False, it will only search the direct children of the current tag.

# Get the <body> tag
body = soup.body
# Find all <p> tags within the body (default behavior)
all_p_in_body_recursive = body.find_all('p')
print(f"Recursive search found {len(all_p_in_body_recursive)} <p> tags.")
# Find all <p> tags that are direct children of the body only
all_p_in_body_direct = body.find_all('p', recursive=False)
print(f"Direct search found {len(all_p_in_body_direct)} <p> tags.")
# Output:
# Recursive search found 3 <p> tags.
# Direct search found 2 <p> tags.

The `find()` Method vs. `find_all()`

It's crucial to understand the difference:

Method	`find_all()`	`find()`
What it does	Finds all matching tags.	Finds the first matching tag.
Return Type	Returns a list of `Tag` objects.	Returns a single `Tag` object.
Use Case	When you expect multiple matches and need to loop through them.	When you are looking for a specific, unique element (e.g., by `id`).
What if none found?	Returns an empty list `[]`.	Returns `None`.

# Using find() to get the first <p> tag
first_p_tag = soup.find('p')
print(first_p_tag)
# Output: <p class="main-paragraph">This is the first paragraph.</p>
# Trying to find a tag that doesn't exist
non_existent_tag = soup.find('xyz')
print(non_existent_tag)
# Output: None

Summary and Best Practices

Start Broad, Then Narrow Down: First, find a unique parent tag (e.g., div with a specific id), then use find_all() on that parent. This makes your scraper more resilient to changes in the rest of the page.
```
# Good practice
product_list = soup.find('div', id='product-list')
products = product_list.find_all('div', class_='product-item')
```
Be Specific: Use combinations of tag names, classes, and other attributes to make your selectors as specific as possible. Avoid relying on the position of an element (e.g., "the third paragraph").
Handle Errors Gracefully: Since find() returns None and find_all() returns an empty list when nothing is found, always check for this before trying to access attributes or text.
```
link = soup.find('a', class_='main-link')
if link:
    print(link['href'])
else:
    print("Could not find the main link.")
```

Use CSS Selectors: For very complex queries, BeautifulSoup also supports CSS selectors via the select() and select_one() methods, which can be more concise and powerful.

# Find all <p> tags with class 'intro' inside a <div> with id 'content'
# Using find_all
p_tags = soup.find('div', id='content').find_all('p', class_='intro')
# Using CSS selectors (often more readable)
p_tags_css = soup.select('div#content p.intro')

Python的bs4 findall怎么用？

What is `find_all()`?

Basic Syntax

The `name` Argument: Finding by Tag Name

The `attrs` Argument: Finding by Attributes

Finding by `class`

Finding by `id`

Finding by Any Attribute

The `string` Argument: Finding by Text Content

Combining Arguments for Powerful Searches

Other Useful Arguments

The `find()` Method vs. `find_all()`

Summary and Best Practices

99ANYc3cd6

Java中response如何使用与配置？

Java POI如何高效写入Word文档？

python mysql 异常处理

Java BCD转ASCII的具体方法是什么？

java string转枚举

Java Socket编程实例具体怎么实现？

哪里能下载zbrush教程？

Ubuntu下Java环境配置步骤是怎样的？

Python Labelframe如何添加并管理子组件？

Java如何实现字符串的gzip压缩？

java date获取当前时间

Python编程导论，如何快速入门程序设计？

Python教程百度云资源哪里找？

Java数组初始化有几种方式？

Surface拆机教程？关键步骤有哪些？

if else if Java如何判断优先级？

Python的bs4 findall怎么用？

What is find_all()?

Basic Syntax

The name Argument: Finding by Tag Name

The attrs Argument: Finding by Attributes

Finding by class

Finding by id

Finding by Any Attribute

The string Argument: Finding by Text Content

Combining Arguments for Powerful Searches

Other Useful Arguments

The find() Method vs. find_all()

Summary and Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

What is `find_all()`?

The `name` Argument: Finding by Tag Name

The `attrs` Argument: Finding by Attributes

Finding by `class`

Finding by `id`

The `string` Argument: Finding by Text Content

The `find()` Method vs. `find_all()`