杰瑞科技汇

Python如何用xlrd读取Linux下的Excel文件?

Of course! Here's a comprehensive guide to using the xlrd library in Python on a Linux system.

Python如何用xlrd读取Linux下的Excel文件?-图1
(图片来源网络,侵删)

Overview

xlrd is a Python library for reading data and formatting information from Excel files (.xls and .xlsx). It's a fast and efficient library, but it's important to note its main limitation:

xlrd can only READ .xls files (Excel 97-2003) and .xlsx files (Excel 2007+) in "read-only" mode. It cannot create or modify Excel files.

For writing to Excel files, you would typically use a library like openpyxl or xlsxwriter.


Installation

First, you need to have Python and pip (Python's package installer) installed on your Linux system. If you don't, you can usually install them with your distribution's package manager (e.g., sudo apt-get install python3 python3-pip for Debian/Ubuntu).

Python如何用xlrd读取Linux下的Excel文件?-图2
(图片来源网络,侵删)

Then, install xlrd using pip:

# It's best practice to use pip3 for Python 3
pip3 install xlrd

If you need to install it for a specific virtual environment, activate the environment first:

source /path/to/your/venv/bin/activate
pip install xlrd

Basic Usage: Reading an Excel File

Let's assume you have an Excel file named data.xlsx with the following content in the first sheet:

A B C
1 Name Age City
2 Alice 30 New York
3 Bob 25 London
4 Charlie 35 Paris

Here is the Python code to read this file.

import xlrd
# Define the path to your Excel file
file_path = 'data.xlsx'
try:
    # Open the workbook
    # The 'on_demand=True' parameter can be used to load only specific sheets,
    # which is memory-efficient for large files.
    workbook = xlrd.open_workbook(file_path, on_demand=False)
    # Get the first sheet (by index or name)
    # sheet = workbook.sheet_by_index(0)
    sheet = workbook.sheet_by_name('Sheet1') # Make sure the sheet name is correct
    print(f"Sheet Name: {sheet.name}")
    print(f"Number of Rows: {sheet.nrows}")
    print(f"Number of Columns: {sheet.ncols}")
    print("-" * 20)
    # --- Method 1: Iterate through all rows and cells ---
    print("Iterating through all rows:")
    for row_idx in range(sheet.nrows):
        # Each row is a list of cell values
        row = sheet.row_values(row_idx)
        print(f"Row {row_idx}: {row}")
    print("-" * 20)
    # --- Method 2: Access specific cells by coordinates (row, col) ---
    # Note: Row and column indices are 0-based.
    print("Accessing specific cells:")
    # Get the value from cell A2 (row 1, col 0)
    name = sheet.cell_value(1, 0)
    print(f"Name from A2: {name}")
    # Get the value from cell B3 (row 2, col 1)
    age = sheet.cell_value(2, 1)
    print(f"Age from B3: {age}")
    # --- Method 3: Get a full row or column as a list ---
    print("\nGetting full rows and columns:")
    header_row = sheet.row_values(0)
    print(f"Header Row: {header_row}")
    # Get all values from the 'City' column (column index 2)
    city_column = sheet.col_values(2)
    print(f"City Column: {city_column}")
except xlrd.XLRDError as e:
    print(f"Error opening or reading the Excel file: {e}")
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")

Explanation of Key Methods:

  • xlrd.open_workbook(file_path): Opens the Excel file.
  • workbook.sheet_by_index(0): Gets a sheet by its position (0 is the first sheet).
  • workbook.sheet_by_name('Sheet1'): Gets a sheet by its name.
  • sheet.nrows: The total number of rows in the sheet.
  • sheet.ncols: The total number of columns in the sheet.
  • sheet.row_values(row_idx): Returns all cell values in a specific row as a list.
  • sheet.col_values(col_idx): Returns all cell values in a specific column as a list.
  • sheet.cell_value(row_idx, col_idx): Returns the value of a single cell.

Handling Cell Types

Excel stores different types of data (text, numbers, dates, etc.). xlrd provides a way to check the type of a cell.

  • xlrd.XL_CELL_TEXT (0)
  • xlrd.XL_CELL_NUMBER (1)
  • xlrd.XL_CELL_DATE (3)
  • xlrd.XL_CELL_BOOLEAN (4)
  • xlrd.XL_CELL_BLANK (5)

Important: Dates are tricky. xlrd stores them as a floating-point number (the number of days since 1900-01-01). You need to convert it to a Python datetime object.

Here's an example of handling different cell types:

import xlrd
from datetime import datetime, date
file_path = 'data_with_types.xlsx' # Assume this file has mixed data types
workbook = xlrd.open_workbook(file_path)
sheet = workbook.sheet_by_index(0)
# Iterate through rows, skipping the header
for row_idx in range(1, sheet.nrows):
    for col_idx in range(sheet.ncols):
        cell_value = sheet.cell_value(row_idx, col_idx)
        cell_type = sheet.cell_type(row_idx, col_idx)
        if cell_type == xlrd.XL_CELL_TEXT:
            print(f"Cell ({row_idx},{col_idx}): Text -> '{cell_value}'")
        elif cell_type == xlrd.XL_CELL_NUMBER:
            print(f"Cell ({row_idx},{col_idx}): Number -> {cell_value}")
        elif cell_type == xlrd.XL_CELL_DATE:
            # Convert Excel date number to a Python datetime object
            # The `xlrd.xldate_as_datetime` function is very helpful
            date_obj = xlrd.xldate_as_datetime(cell_value, workbook.datemode)
            print(f"Cell ({row_idx},{col_idx}): Date -> {date_obj} (Formatted: {date_obj.strftime('%Y-%m-%d')})")
        elif cell_type == xlrd.XL_CELL_BOOLEAN:
            print(f"Cell ({row_idx},{col_idx}): Boolean -> {bool(cell_value)}")
        elif cell_type == xlrd.XL_CELL_BLANK:
            print(f"Cell ({row_idx},{col_idx}): Blank")
        print("-")

Common Pitfalls and Best Practices

  1. xlrd Cannot Write to .xlsx Files: This is the most common mistake. If you try to use xlrd to create or modify an .xlsx file, it will fail silently or raise an error. Use openpyxl for writing.
  2. Date Handling: As shown above, dates are not handled out-of-the-box. You must use xlrd.xldate_as_datetime() to convert them correctly.
  3. File Not Found: Always wrap your file operations in a try...except FileNotFoundError block to handle cases where the Excel file doesn't exist.
  4. Sheet Name is Case-Sensitive: sheet_by_name('Sheet1') is different from sheet_by_name('sheet1'). Be precise with the name.
  5. Large Files: For very large Excel files, loading the entire workbook into memory can be slow. Use the on_demand=True parameter when opening the workbook. This allows you to load sheets one by one as needed, saving memory.
# Memory-efficient way for large files
workbook = xlrd.open_workbook('huge_file.xlsx', on_demand=True)
# You can't get all sheet names immediately with on_demand=True
# You have to access them by index first
sheet = workbook.sheet_by_index(0)
print(f"Loaded sheet: {sheet.name}")
# ... do work on the sheet ...
# When you're done with a sheet, you can release its memory
workbook.unload_sheet(0)

Alternatives to xlrd

While xlrd is great for reading, the modern Python ecosystem offers other powerful libraries.

Library Can Read .xls? Can Read .xlsx? Can Write? Notes
xlrd Fast and lightweight for reading. The classic choice.
openpyxl The modern standard. Slower than xlrd for reading but excellent for both reading and writing .xlsx files.
pandas The best choice for data analysis. Uses xlrd or openpyxl under the hood. df = pd.read_excel('file.xlsx') is all you need.

Example using pandas:

import pandas as pd
# This is the easiest way to read an Excel file into a data structure
# pandas will automatically use xlrd or openpyxl as needed.
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print("DataFrame from Excel:")
print(df)
# Accessing data is very intuitive
print("\nNames column:")
print(df['Name'])

Summary

  • Use xlrd when you only need to read data from Excel files and performance is a critical concern.
  • Use openpyxl when you need to both read and write .xlsx files.
  • Use pandas when your goal is data analysis, manipulation, or scientific computing. It's the most convenient and powerful tool for this purpose.
分享:
扫描分享到社交APP
上一篇
下一篇