杰瑞科技汇

Python pyexcel io如何高效读写数据?

Of course! Let's dive into pyexcel-io. This is a foundational library within the pyexcel ecosystem, and understanding it is key to understanding how the whole system works.

Python pyexcel io如何高效读写数据?-图1
(图片来源网络,侵删)

What is pyexcel-io?

At its core, pyexcel-io is a low-level library for reading and writing data from/to various sources in a uniform way. It acts as the engine that handles the actual I/O operations, abstracting away the complexities of different file formats.

Think of it this way:

  • pyexcel (the main package): The high-level, user-friendly interface. You use this for 95% of your tasks. It's like the steering wheel and pedals of a car.
  • pyexcel-io: The engine and transmission. It does the hard work of converting data into a standardized format and writing it to the disk, or reading from the disk and converting it back. You rarely interact with it directly unless you're creating a new source or format.

The main idea is separation of concerns:

  1. Data Representation: pyexcel represents all tabular data as a simple sequence of sequences (a list of lists, or more specifically, a pyexcel._sheet.Sheet object).
  2. I/O Handling: pyexcel-io is responsible for taking this data and serializing it to a specific format (like CSV, XLSX) or deserializing it from a format into the data representation.

Key Concepts: Sources and Targets

The most important concepts in pyexcel-io are sources and targets.

Python pyexcel io如何高效读写数据?-图2
(图片来源网络,侵删)
  • A Source: Anything you can read data from. This could be a file on your disk, a URL, a string in memory, or even a database connection.
  • A Target: Anything you can write data to. This is typically a file or a stream in memory.

pyexcel-io provides a registry system where different source and target types are registered with specific "reader" and "writer" classes.


The Relationship: pyexcel -> pyexcel-io

When you use the main pyexcel library, it uses pyexcel-io behind the scenes.

Example: Reading a file with pyexcel

import pyexcel as p
# 1. User calls a high-level pyexcel function
# pyexcel looks at the file extension ".csv"
sheet = p.get_sheet(file_name="my_data.csv")
# 2. pyexcel-io is invoked
# - It looks up the ".csv" extension in its registry.
# - It finds the registered CSV source reader.
# - It uses that reader to parse the file content.
# - It returns the data to pyexcel in its standard format (a Sheet object).
print(sheet)
# Output:
# pyexcel sheet:
# name:my_data.csv
+---------+---------+
| Name    | Age     |
+---------+---------+
| Alice   | 30      |
| Bob     | 25      |
+---------+---------+

When Would You Use pyexcel-io Directly?

You would typically use pyexcel-io directly if you want to:

Python pyexcel io如何高效读写数据?-图3
(图片来源网络,侵删)
  1. Create a custom data source: For example, read data from an API response, a specific database table, or a log file that isn't a standard spreadsheet format.
  2. Create a custom data target: Write data to a specific database, a compressed stream, or a custom binary format.
  3. Understand the internals of how pyexcel works.

Practical Example: Creating a Custom Source

Let's create a simple custom source that reads data from a Python dictionary. We'll register it with pyexcel-io so that the main pyexcel library can use it.

Step 1: Define Your Custom Source Reader

You need to create a class that inherits from pyexcel_io.manager.NamedStream and implements the necessary methods.

import pyexcel_io
from pyexcel_io import manager
from pyexcel_io.plugin_api import ISheetReader, IReader
# Our data source
data_from_dict = {
    "Sheet 1": [
        ["Name", "Age"],
        ["Alice", 30],
        ["Bob", 25],
    ]
}
# This class tells pyexcel-io how to read from our custom source
class DictReader(IReader):
    def __init__(self, file_content, **keywords):
        # file_content will be our dictionary
        self._file_content = file_content
        self._sheet_names = list(file_content.keys())
    def read_sheet(self, sheet_index):
        # This method is called for each sheet
        sheet_name = self._sheet_names[sheet_index]
        # The reader must return a generator of lists (rows)
        return (row for row in self._file_content[sheet_name])
    def get_sheet_stream(self):
        # This is the main entry point for the reader
        # It should return a generator of (sheet_name, sheet_reader)
        for name in self._sheet_names:
            yield name, self.read_sheet(self._sheet_names.index(name))
# This class registers our reader with a specific "type"
class DictSource(manager.Source):
    def __init__(self, file_type, file_content, **keywords):
        super().__init__(file_type, file_content, **keywords)
    def get_reader(self):
        # This method is called by pyexcel-io to get an instance of our reader
        return DictReader(self._file_content, **self._keywords)
# The "type" can be anything, e.g., "dict", "my_api"
CUSTOM_SOURCE_TYPE = "dict"

Step 2: Register Your Custom Source

Now, we tell pyexcel-io that when it sees a source of type "dict", it should use our DictSource class.

# Register the source
pyexcel_io.register_reader(CUSTOM_SOURCE_TYPE, DictSource)

Step 3: Use It with the Main pyexcel Library

Now, you can use pyexcel.get_sheet() as if it were a native file type!

# Use the custom source with the main pyexcel library
sheet = p.get_sheet(
    file_type=CUSTOM_SOURCE_TYPE,
    file_content=data_from_dict
)
print(sheet)
# Output:
# pyexcel sheet:
# name:Sheet 1
+-------+-----+
| Name  | Age |
+-------+-----+
| Alice | 30  |
| Bob   | 25  |
+-------+-----+
# You can now use all pyexcel features on this data
print("Alice's age:", sheet[1, 1])
# Output:
# Alice's age: 30
# You can even write it to a real file
p.save_as(
    dest_file_name="from_dict_output.xlsx",
    dest_sheet_name="New Data",
    array=sheet.to_array()
)

Summary: pyexcel vs. pyexcel-io

Feature pyexcel (Main Package) pyexcel-io (Engine)
Purpose High-level, easy-to-use API for data manipulation. Low-level I/O handling engine.
Abstraction Hides file formats and sources. You just pass a filename or URL. Manages sources and targets. It knows about CSV, XLSX, etc.
Primary Users End-users, data analysts, scientists. Library authors, developers extending pyexcel.
Key Functions get_sheet(), save_as(), get_book() register_reader(), register_writer(), source/target classes.
Data Flow Accepts/returns Sheet and Book objects. Accepts/returns raw data (generators of lists) and streams.

Installation

To install pyexcel-io, simply use pip:

pip install pyexcel-io

Note: You don't usually need to install pyexcel-io by itself if you install pyexcel with all its extras, as it will be included as a dependency.

# To get all supported formats
pip install pyexcel[all]
分享:
扫描分享到社交APP
上一篇
下一篇