杰瑞科技汇

Python executemany如何高效批量插入数据?

Of course! Let's dive deep into executemany() in Python, specifically when used with the sqlite3 module. This is a fundamental concept for efficient database operations.

Python executemany如何高效批量插入数据?-图1
(图片来源网络,侵删)

What is executemany()?

executemany() is a method provided by database cursor objects in Python's database API (like sqlite3, psycopg2 for PostgreSQL, mysql-connector-python for MySQL, etc.). Its primary purpose is to execute a single SQL statement multiple times, with different parameters for each execution.

This is the standard, high-performance way to insert, update, or delete multiple rows in a database.


The Core Problem: Why Not Use a for Loop with execute()?

Imagine you want to insert 1,000 users into a database. You could do this:

# The INEFFICIENT way
import sqlite3
conn = sqlite3.connect('mydatabase.db')
cursor = conn.cursor()
users = [
    ('Alice', 30),
    ('Bob', 25),
    ('Charlie', 35),
    # ... 997 more users
]
for user in users:
    cursor.execute("INSERT INTO users (name, age) VALUES (?, ?)", user)
conn.commit()
conn.close()

This approach works, but it's very slow for a few key reasons:

Python executemany如何高效批量插入数据?-图2
(图片来源网络,侵删)
  1. Network Overhead (Client-Server): If your database is on a different server, each execute() call involves a round-trip over the network. For 1,000 users, that's 1,000 network trips, which is extremely slow.
  2. Server Overhead (Local DB): Even with a local database like SQLite, the database server has to parse, compile, and plan the execution of the SQL statement 1,000 separate times. This creates significant overhead.

The Solution: executemany()

executemany() solves these problems by batching the operations. It sends the data to the database server in a single operation.

The syntax is:

cursor.executemany(sql, seq_of_parameters)
  • sql: The SQL statement string. It must contain placeholders for the data.
  • seq_of_parameters: A sequence (like a list or tuple) of sequences (like tuples or lists) where each inner sequence contains the parameters for one execution.

Example: Inserting Multiple Users with executemany()

Let's use the same example of inserting users, but this time the correct and efficient way.

Setup the Database

First, let's create a database table to work with.

import sqlite3
# Connect to the database (it will be created if it doesn't exist)
conn = sqlite3.connect('users.db')
cursor = conn.cursor()
# Create a table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT NOT NULL,
        age INTEGER
    )
''')
# Commit the schema change
conn.commit()
print("Table 'users' is ready.")

Use executemany() to Insert Data

Now, let's insert a list of users efficiently.

# Data to be inserted
users_to_insert = [
    ('Alice', 30),
    ('Bob', 25),
    ('Charlie', 35),
    ('David', 28),
    ('Eve', 40)
]
# The SQL statement with placeholders
sql_insert = "INSERT INTO users (name, age) VALUES (?, ?)"
# Use executemany to insert all users at once
try:
    cursor.executemany(sql_insert, users_to_insert)
    conn.commit()
    print(f"{cursor.rowcount} records inserted successfully.")
except sqlite3.Error as e:
    print(f"An error occurred: {e}")
finally:
    # Close the connection
    conn.close()

Key Takeaways from this example:

  • Efficiency: The entire executemany operation is typically handled in a single communication with the database.
  • cursor.rowcount: After the operation, cursor.rowcount tells you how many rows were affected. In this case, it will be 5.
  • Placeholder Style: We used which is the standard placeholder style for sqlite3. Other database drivers may use different styles, like %s (which is a common convention, though not a Python string formatting directive in this context) or named placeholders (name).

executemany() vs. executemany() with a Transaction

For maximum performance, especially with a very large number of inserts, you should wrap the executemany() call in a transaction. A transaction groups multiple database operations into a single unit.

By default, sqlite3 runs in "auto-commit" mode for DML statements (like INSERT), meaning each executemany() is its own transaction. We can manually control this for better performance.

import sqlite3
import time
# --- Setup (same as before) ---
conn = sqlite3.connect('users_fast.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT NOT NULL,
        age INTEGER
    )
''')
conn.commit()
# --- End Setup ---
# Generate a large amount of dummy data
print("Generating 10,000 dummy records...")
large_dataset = [(f'User_{i}', i % 100) for i in range(10000)]
print("Generation complete.")
# --- Method 1: executemany with auto-commit (slower) ---
print("\n--- Method 1: executemany with auto-commit ---")
start_time = time.time()
cursor.executemany("INSERT INTO users (name, age) VALUES (?, ?)", large_dataset)
conn.commit() # This commit is for the single executemany call
end_time = time.time()
print(f"Inserted {cursor.rowcount} records in {end_time - start_time:.4f} seconds.")
# --- Method 2: executemany inside a manual transaction (faster) ---
print("\n--- Method 2: executemany inside a manual transaction ---")
start_time = time.time()
# Start a transaction
conn.execute("BEGIN TRANSACTION")
# Perform the operation
cursor.executemany("INSERT INTO users (name, age) VALUES (?, ?)", large_dataset)
# Commit the transaction
conn.commit()
end_time = time.time()
print(f"Inserted {cursor.rowcount} records in {end_time - start_time:.4f} seconds.")
conn.close()

When you run this, you will see a significant performance boost with Method 2. This is because the database engine can optimize the entire batch of inserts as a single atomic unit, drastically reducing disk I/O and transaction log overhead.


Advanced: Customizing executemany with a Generator

What if your data isn't in a neat list of tuples? executemany is flexible. You can pass any iterable, including a generator function.

Let's say you have a function that yields user data:

def generate_users(num_users):
    """A generator function that yields user data."""
    for i in range(num_users):
        yield (f'GeneratedUser_{i}', i % 50)
# Use the generator with executemany
conn = sqlite3.connect('users_generator.db')
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS users (name TEXT, age INTEGER)")
# executemany will consume the generator one item at a time
cursor.executemany("INSERT INTO users (name, age) VALUES (?, ?)", generate_users(100))
conn.commit()
print(f"Inserted {cursor.rowcount} records from a generator.")
conn.close()

This pattern is very powerful for processing large datasets from files or other sources without loading everything into memory at once.

Summary

Feature execute() in a loop executemany()
Performance Slow. High network/server overhead. Fast. Batches operations into a single call.
Use Case For running many different SQL statements. For running the same SQL statement with many different parameters.
Code Style More verbose. More concise and readable for bulk operations.
Transactions Each execute() is often a separate transaction. Can be wrapped in a single transaction for maximum speed.
Flexibility Can handle any SQL string in each iteration. Requires the SQL statement to be the same for all executions.

In short, whenever you need to perform the same SQL operation (especially INSERT, UPDATE, DELETE) on multiple sets of data, always prefer executemany(). It is the idiomatic, efficient, and professional way to handle bulk database operations in Python.

分享:
扫描分享到社交APP
上一篇
下一篇