杰瑞科技汇

Python Pandas如何高效读取FTP文件数据?

Of course! Using Python with Pandas to interact with an FTP server is a common task for downloading data files (like CSVs or Excel files) for analysis. The process involves two main steps:

Python Pandas如何高效读取FTP文件数据?-图1
(图片来源网络,侵删)
  1. Connect to the FTP server and download the file. This is done using an FTP client library. The most standard one is Python's built-in ftplib.
  2. Read the downloaded file into a Pandas DataFrame. This is done using pd.read_csv(), pd.read_excel(), etc.

Here’s a complete guide, from basic to more advanced scenarios.


Step 1: Prerequisites

You need to have pandas installed. If you don't, you can install it via pip:

pip install pandas

You don't need to install ftplib as it comes pre-installed with Python.


Step 2: Basic Workflow - Download a CSV and Read it into Pandas

This is the most straightforward scenario. We'll connect to a public FTP server, download a file, and then load it.

Python Pandas如何高效读取FTP文件数据?-图2
(图片来源网络,侵删)

Let's use a sample public FTP server for this example. The process will be:

  1. Import necessary libraries.
  2. Connect to the FTP server.
  3. Download the file to a temporary location in memory or to your disk.
  4. Close the FTP connection.
  5. Read the file from that location into a Pandas DataFrame.
  6. Clean up (if you downloaded to a temporary file).

Method A: Download to a Temporary File (Recommended for large files)

This approach is memory-efficient because it doesn't load the entire file into RAM before Pandas starts reading it.

import pandas as pd
import ftplib
import io # Used for in-memory handling
import os # Used for file operations
# --- Configuration ---
FTP_HOST = "ftp.example.com" # Replace with a real FTP host
FTP_USER = "anonymous"
FTP_PASS = "anonymous@example.com"
FILE_TO_DOWNLOAD = "data/sales_report.csv"
LOCAL_PATH = "sales_report_downloaded.csv"
# --- 1. Connect to FTP Server ---
try:
    print(f"Connecting to {FTP_HOST}...")
    ftp = ftplib.FTP(FTP_HOST, FTP_USER, FTP_PASS)
    print("Connection successful.")
    ftp.cwd(os.path.dirname(FILE_TO_DOWNLOAD)) # Change to the directory of the file
    # --- 2. Download the file ---
    # Get the file in binary mode and write it to a local file
    with open(LOCAL_PATH, 'wb') as local_file:
        ftp.retrbinary(f"RETR {os.path.basename(FILE_TO_DOWNLOAD)}", local_file.write)
    print(f"File '{FILE_TO_DOWNLOAD}' downloaded successfully as '{LOCAL_PATH}'.")
    # --- 3. Read the downloaded file into a Pandas DataFrame ---
    df = pd.read_csv(LOCAL_PATH)
    print("\nDataFrame Head:")
    print(df.head())
except ftplib.all_errors as e:
    print(f"FTP error: {e}")
except FileNotFoundError:
    print(f"Error: The file '{FILE_TO_DOWNLOAD}' was not found on the server.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
finally:
    # --- 4. Close the FTP connection ---
    if 'ftp' in locals() and ftp.sock is not None:
        ftp.quit()
        print("\nFTP connection closed.")
    # --- 5. Clean up the downloaded file ---
    if os.path.exists(LOCAL_PATH):
        os.remove(LOCAL_PATH)
        print(f"Temporary file '{LOCAL_PATH}' deleted.")

Method B: Download into Memory (Great for small files)

This method avoids creating a temporary file on your disk. The entire file content is held in a BytesIO buffer, which Pandas can read directly.

import pandas as pd
import ftplib
import io
# --- Configuration ---
FTP_HOST = "ftp.example.com" # Replace with a real FTP host
FTP_USER = "anonymous"
FTP_PASS = "anonymous@example.com"
FILE_TO_DOWNLOAD = "data/sales_report.csv"
# --- 1. Connect to FTP Server ---
try:
    print(f"Connecting to {FTP_HOST}...")
    ftp = ftplib.FTP(FTP_HOST, FTP_USER, FTP_PASS)
    print("Connection successful.")
    ftp.cwd(os.path.dirname(FILE_TO_DOWNLOAD))
    # --- 2. Download the file into memory ---
    # Create an in-memory binary buffer
    memory_file = io.BytesIO()
    # Use retrbinary to write the file content directly into the buffer
    ftp.retrbinary(f"RETR {os.path.basename(FILE_TO_DOWNLOAD)}", memory_file.write)
    print(f"File '{FILE_TO_DOWNLOAD}' downloaded into memory.")
    # --- 3. Read the in-memory file into a Pandas DataFrame ---
    # IMPORTANT: Reset the buffer's position to the beginning before reading
    memory_file.seek(0)
    # Pandas can read directly from the in-memory buffer
    df = pd.read_csv(memory_file)
    print("\nDataFrame Head:")
    print(df.head())
except ftplib.all_errors as e:
    print(f"FTP error: {e}")
except FileNotFoundError:
    print(f"Error: The file '{FILE_TO_DOWNLOAD}' was not found on the server.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
finally:
    # --- 4. Close the FTP connection ---
    if 'ftp' in locals() and ftp.sock is not None:
        ftp.quit()
        print("\nFTP connection closed.")

Step 3: Advanced Scenarios

Scenario 1: Listing Files on the FTP Server

Before downloading, you might want to see what files are available in a directory.

import ftplib
FTP_HOST = "ftp.example.com"
FTP_USER = "anonymous"
FTP_PASS = "anonymous@example.com"
DIRECTORY_PATH = "/pub/data"
try:
    ftp = ftplib.FTP(FTP_HOST, FTP_USER, FTP_PASS)
    ftp.cwd(DIRECTORY_PATH)
    print(f"Files in directory '{DIRECTORY_PATH}':")
    # dir() returns a list of filenames in the current directory
    file_list = ftp.dir()
    print(file_list)
    # A more structured way to get the list
    # lines = []
    # ftp.retrlines('LIST', lines.append)
    # for line in lines:
    #     print(line)
except ftplib.all_errors as e:
    print(f"FTP error: {e}")
finally:
    if 'ftp' in locals() and ftp.sock is not None:
        ftp.quit()

Scenario 2: Downloading Multiple Files

You can easily extend the logic to download all files of a certain type (e.g., all .csv files) from a directory.

import pandas as pd
import ftplib
import os
FTP_HOST = "ftp.example.com"
FTP_USER = "anonymous"
FTP_PASS = "anonymous@example.com"
FTP_DIR = "/pub/data"
LOCAL_DOWNLOAD_DIR = "ftp_downloads"
FILE_EXTENSION = ".csv"
# Create a local directory to save files
if not os.path.exists(LOCAL_DOWNLOAD_DIR):
    os.makedirs(LOCAL_DOWNLOAD_DIR)
try:
    ftp = ftplib.FTP(FTP_HOST, FTP_USER, FTP_PASS)
    ftp.cwd(FTP_DIR)
    # Get a list of all files in the directory
    files = ftp.nlst() # nlst() is like 'ls' and returns a list of filenames
    csv_files_to_download = [f for f in files if f.endswith(FILE_EXTENSION)]
    if not csv_files_to_download:
        print(f"No {FILE_EXTENSION} files found in {FTP_DIR}.")
    else:
        print(f"Found {len(csv_files_to_download)} CSV files. Starting download...")
        for file_name in csv_files_to_download:
            local_path = os.path.join(LOCAL_DOWNLOAD_DIR, file_name)
            print(f"Downloading {file_name}...")
            with open(local_path, 'wb') as local_file:
                ftp.retrbinary(f"RETR {file_name}", local_file.write)
            print(f"Successfully downloaded to {local_path}")
            # Now read it into a DataFrame (example for the first file)
            if file_name == csv_files_to_download[0]:
                df = pd.read_csv(local_path)
                print("\n--- First File DataFrame Info ---")
                print(df.info())
except ftplib.all_errors as e:
    print(f"FTP error: {e}")
finally:
    if 'ftp' in locals() and ftp.sock is not None:
        ftp.quit()
        print("\nFTP connection closed.")

Important Considerations

  • Security: For production environments, avoid using anonymous FTP. Use credentials with the necessary permissions and prefer SFTP (SSH File Transfer Protocol) or FTPS (FTP over SSL/TLS) for encrypted connections. Libraries like paramiko (for SFTP) are excellent choices for secure transfers.
  • Error Handling: Network connections can be unreliable. Always wrap your FTP logic in try...except blocks to handle connection errors, timeouts, and file-not-found errors gracefully.
  • Performance: For very large files, downloading to a temporary file (Method A) is generally better for memory management than loading the entire file into memory (Method B).
  • Context Managers: You can use with ftplib.FTP(...) as ftp: for automatic connection closing, which is a cleaner pattern than a try...finally block.
分享:
扫描分享到社交APP
上一篇
下一篇