杰瑞科技汇

Python DataFrame如何高效操作?

Of course! It sounds like you're asking about DataFrames in Python. This is a fundamental concept in data analysis and manipulation.

Python DataFrame如何高效操作?-图1
(图片来源网络,侵删)

Let's break it down from the basics to more advanced usage.

What is a DataFrame?

At its core, a DataFrame is a two-dimensional, labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.

It's the most important and commonly used data structure in the Pandas library, which is the go-to library for data manipulation in Python.

Key Characteristics of a DataFrame:

Python DataFrame如何高效操作?-图2
(图片来源网络,侵删)
  • Labeled Axes: Both rows and columns have labels (an "index" for rows, and "column names" for columns).
  • Heterogeneous Data: Columns can hold different data types (e.g., integers, floats, strings, dates).
  • Size-Mutable: You can add or remove columns and rows.
  • Handles Missing Data: It has built-in support for representing missing data (typically as NaN or NaT).

Why are DataFrames so Popular?

DataFrames make data analysis in Python incredibly easy and efficient. They provide a powerful set of tools to:

  • Load data from files like CSV, Excel, SQL databases, and JSON.
  • Clean data by handling missing values, filtering rows, and correcting data types.
  • Explore and summarize data with descriptive statistics (mean, median, std, etc.).
  • Transform and reshape data using grouping, pivoting, and merging.
  • Visualize data by integrating with libraries like Matplotlib and Seaborn.

How to Use DataFrames (with Pandas)

Here is a step-by-step guide with code examples.

Installation

First, you need to install the Pandas library. If you don't have it, open your terminal or command prompt and run:

pip install pandas

Creating a DataFrame

You can create a DataFrame from various sources, like a Python dictionary.

Python DataFrame如何高效操作?-图3
(图片来源网络,侵删)
import pandas as pd
import numpy as np # Often used for creating sample data
# Create a dictionary to hold the data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 120000, 75000, 90000]
}
# Create the DataFrame from the dictionary
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago  120000
3    David   28      Houston   75000
4      Eva   32       Phoenix   90000

Basic DataFrame Operations

Here are the most common things you'll do with a DataFrame.

Viewing Data

# See the first 5 rows (default)
print(df.head())
# See the last 3 rows
print(df.tail(3))
# Get a concise summary of the DataFrame (data types, non-null values, memory usage)
print(df.info())
# Get descriptive statistics for numerical columns
print(df.describe())

Selecting Data

# Select a single column (returns a Pandas Series)
ages = df['Age']
print(ages)
# Select multiple columns (returns a new DataFrame)
person_info = df[['Name', 'City']]
print(person_info)
# Select rows by label (index)
# .loc is primarily label-based indexing
first_person = df.loc[0]
print(first_person)
# Select rows by integer position
# .iloc is primarily integer-position based indexing
first_two_rows = df.iloc[0:2]
print(first_two_rows)

Filtering Data (Conditional Selection)

This is one of the most powerful features. You can filter rows based on a condition.

# Get all people older than 30
older_than_30 = df[df['Age'] > 30]
print(older_than_30)
# Get people from New York with a salary greater than 75000
ny_high_earners = df[(df['City'] == 'New York') & (df['Salary'] > 75000)]
print(ny_high_earners)

Note: Use & for "AND" and for "OR". You must wrap each condition in parentheses .

Adding/Modifying Columns

# Add a new column for 'Bonus' (10% of salary)
df['Bonus'] = df['Salary'] * 0.10
# Modify an existing column (convert salary to thousands for readability)
df['Salary (in thousands)'] = df['Salary'] / 1000
print(df)

Handling Missing Data

Real-world data is often messy. Let's introduce some missing values.

# Introduce a missing value
df.loc[2, 'Salary'] = np.nan
# Check for missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
# Fill missing values with a specific number (e.g., the mean salary)
mean_salary = df['Salary'].mean()
df_filled = df.fillna({'Salary': mean_salary})
print("\nDataFrame after filling missing values:")
print(df_filled)

A Complete Workflow Example

Let's put it all together in a common scenario: loading data, cleaning it, and performing an analysis.

import pandas as pd
# 1. Load data from a CSV file
# (Assuming you have a file 'employees.csv' with the same data as before)
# For this example, we'll create the CSV in memory.
csv_data = """Name,Age,City,Salary
Alice,25,New York,70000
Bob,30,Los Angeles,80000
Charlie,35,Chicago,120000
David,28,Houston,75000
Eva,32,Phoenix,90000
Frank,40,New York,110000
Grace,29,Los Angeles,85000
"""
from io import StringIO
df = pd.read_csv(StringIO(csv_data))
print("--- Original DataFrame ---")
print(df)
# 2. Clean the data
# Let's say we found out Charlie's salary was a typo and should be 125000
df.loc[df['Name'] == 'Charlie', 'Salary'] = 125000
# 3. Analyze the data
# Calculate the average salary
average_salary = df['Salary'].mean()
print(f"\nAverage Salary: ${average_salary:,.2f}")
# Find the employee with the highest salary
highest_paid_employee = df.loc[df['Salary'].idxmax()]
print(f"\nHighest Paid Employee:\n{highest_paid_employee}")
# Group data by city and calculate the average salary per city
salary_by_city = df.groupby('City')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by City:")
print(salary_by_city)

Key DataFrame Methods and Attributes

Here's a quick cheat sheet of essential commands:

Category Method/Attribute Description
Viewing .head(n) First n rows.
.tail(n) Last n rows.
.info() Summary of DataFrame.
.describe() Descriptive statistics.
.shape Tuple of (rows, columns).
Selection df['col'] Select a single column.
df[['col1', 'col2']] Select multiple columns.
df.loc[label] Select by row label.
df.iloc[position] Select by row integer position.
Filtering df[df['col'] > value] Filter rows based on a condition.
Modification df['new_col'] = ... Add a new column.
df.drop('col', axis=1) Drop a column.
df.drop(index) Drop rows by index.
df.sort_values('col') Sort DataFrame by a column.
Grouping df.groupby('col').agg(...) Group data and perform aggregation.
Missing Data .isnull() Check for missing values (returns True/False).
.dropna() Drop rows with missing values.
.fillna(value) Fill missing values with a specific value.

DataFrame vs. Other Structures

  • DataFrame vs. Numpy Array: NumPy arrays are powerful for numerical computation but are homogeneous (all elements must be the same type) and lack labeled axes. DataFrames are heterogeneous, have labels, and are built on top of NumPy, making them much more suitable for tabular data.
  • DataFrame vs. Python List: A list is a simple one-dimensional collection. A DataFrame is a two-dimensional, labeled, and highly optimized structure for data analysis. Trying to perform data analysis on a list of lists is much more cumbersome and less efficient.

In summary, if you are working with structured, tabular data in Python, Pandas DataFrames are the essential tool you need to learn.

分享:
扫描分享到社交APP
上一篇
下一篇