Python DataFrame如何高效操作？-杰瑞科技汇

Of course! It sounds like you're asking about DataFrames in Python. This is a fundamental concept in data analysis and manipulation.

（图片来源网络，侵删）

Let's break it down from the basics to more advanced usage.

What is a DataFrame?

At its core, a DataFrame is a two-dimensional, labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.

It's the most important and commonly used data structure in the Pandas library, which is the go-to library for data manipulation in Python.

Key Characteristics of a DataFrame:

（图片来源网络，侵删）

Labeled Axes: Both rows and columns have labels (an "index" for rows, and "column names" for columns).
Heterogeneous Data: Columns can hold different data types (e.g., integers, floats, strings, dates).
Size-Mutable: You can add or remove columns and rows.
Handles Missing Data: It has built-in support for representing missing data (typically as NaN or NaT).

Why are DataFrames so Popular?

DataFrames make data analysis in Python incredibly easy and efficient. They provide a powerful set of tools to:

Load data from files like CSV, Excel, SQL databases, and JSON.
Clean data by handling missing values, filtering rows, and correcting data types.
Explore and summarize data with descriptive statistics (mean, median, std, etc.).
Transform and reshape data using grouping, pivoting, and merging.
Visualize data by integrating with libraries like Matplotlib and Seaborn.

How to Use DataFrames (with Pandas)

Here is a step-by-step guide with code examples.

Installation

First, you need to install the Pandas library. If you don't have it, open your terminal or command prompt and run:

pip install pandas

Creating a DataFrame

You can create a DataFrame from various sources, like a Python dictionary.

（图片来源网络，侵删）

import pandas as pd
import numpy as np # Often used for creating sample data
# Create a dictionary to hold the data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 120000, 75000, 90000]
}
# Create the DataFrame from the dictionary
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago  120000
3    David   28      Houston   75000
4      Eva   32       Phoenix   90000

Basic DataFrame Operations

Here are the most common things you'll do with a DataFrame.

Viewing Data

# See the first 5 rows (default)
print(df.head())
# See the last 3 rows
print(df.tail(3))
# Get a concise summary of the DataFrame (data types, non-null values, memory usage)
print(df.info())
# Get descriptive statistics for numerical columns
print(df.describe())

Selecting Data

# Select a single column (returns a Pandas Series)
ages = df['Age']
print(ages)
# Select multiple columns (returns a new DataFrame)
person_info = df[['Name', 'City']]
print(person_info)
# Select rows by label (index)
# .loc is primarily label-based indexing
first_person = df.loc[0]
print(first_person)
# Select rows by integer position
# .iloc is primarily integer-position based indexing
first_two_rows = df.iloc[0:2]
print(first_two_rows)

Filtering Data (Conditional Selection)

This is one of the most powerful features. You can filter rows based on a condition.

# Get all people older than 30
older_than_30 = df[df['Age'] > 30]
print(older_than_30)
# Get people from New York with a salary greater than 75000
ny_high_earners = df[(df['City'] == 'New York') & (df['Salary'] > 75000)]
print(ny_high_earners)

Note: Use & for "AND" and for "OR". You must wrap each condition in parentheses .

Adding/Modifying Columns

# Add a new column for 'Bonus' (10% of salary)
df['Bonus'] = df['Salary'] * 0.10
# Modify an existing column (convert salary to thousands for readability)
df['Salary (in thousands)'] = df['Salary'] / 1000
print(df)

Handling Missing Data

Real-world data is often messy. Let's introduce some missing values.

# Introduce a missing value
df.loc[2, 'Salary'] = np.nan
# Check for missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
# Fill missing values with a specific number (e.g., the mean salary)
mean_salary = df['Salary'].mean()
df_filled = df.fillna({'Salary': mean_salary})
print("\nDataFrame after filling missing values:")
print(df_filled)

A Complete Workflow Example

Let's put it all together in a common scenario: loading data, cleaning it, and performing an analysis.

import pandas as pd
# 1. Load data from a CSV file
# (Assuming you have a file 'employees.csv' with the same data as before)
# For this example, we'll create the CSV in memory.
csv_data = """Name,Age,City,Salary
Alice,25,New York,70000
Bob,30,Los Angeles,80000
Charlie,35,Chicago,120000
David,28,Houston,75000
Eva,32,Phoenix,90000
Frank,40,New York,110000
Grace,29,Los Angeles,85000
"""
from io import StringIO
df = pd.read_csv(StringIO(csv_data))
print("--- Original DataFrame ---")
print(df)
# 2. Clean the data
# Let's say we found out Charlie's salary was a typo and should be 125000
df.loc[df['Name'] == 'Charlie', 'Salary'] = 125000
# 3. Analyze the data
# Calculate the average salary
average_salary = df['Salary'].mean()
print(f"\nAverage Salary: ${average_salary:,.2f}")
# Find the employee with the highest salary
highest_paid_employee = df.loc[df['Salary'].idxmax()]
print(f"\nHighest Paid Employee:\n{highest_paid_employee}")
# Group data by city and calculate the average salary per city
salary_by_city = df.groupby('City')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by City:")
print(salary_by_city)

Key DataFrame Methods and Attributes

Here's a quick cheat sheet of essential commands:

Category	Method/Attribute	Description
Viewing	`.head(n)`	First `n` rows.
	`.tail(n)`	Last `n` rows.
	`.info()`	Summary of DataFrame.
	`.describe()`	Descriptive statistics.
	`.shape`	Tuple of (rows, columns).
Selection	`df['col']`	Select a single column.
	`df[['col1', 'col2']]`	Select multiple columns.
	`df.loc[label]`	Select by row label.
	`df.iloc[position]`	Select by row integer position.
Filtering	`df[df['col'] > value]`	Filter rows based on a condition.
Modification	`df['new_col'] = ...`	Add a new column.
	`df.drop('col', axis=1)`	Drop a column.
	`df.drop(index)`	Drop rows by index.
	`df.sort_values('col')`	Sort DataFrame by a column.
Grouping	`df.groupby('col').agg(...)`	Group data and perform aggregation.
Missing Data	`.isnull()`	Check for missing values (returns True/False).
	`.dropna()`	Drop rows with missing values.
	`.fillna(value)`	Fill missing values with a specific value.

DataFrame vs. Other Structures

DataFrame vs. Numpy Array: NumPy arrays are powerful for numerical computation but are homogeneous (all elements must be the same type) and lack labeled axes. DataFrames are heterogeneous, have labels, and are built on top of NumPy, making them much more suitable for tabular data.
DataFrame vs. Python List: A list is a simple one-dimensional collection. A DataFrame is a two-dimensional, labeled, and highly optimized structure for data analysis. Trying to perform data analysis on a list of lists is much more cumbersome and less efficient.

In summary, if you are working with structured, tabular data in Python, Pandas DataFrames are the essential tool you need to learn.

Python DataFrame如何高效操作？

What is a DataFrame?

Why are DataFrames so Popular?

How to Use DataFrames (with Pandas)

Installation

Creating a DataFrame

Basic DataFrame Operations

Viewing Data

Selecting Data

Filtering Data (Conditional Selection)

Adding/Modifying Columns

Handling Missing Data

A Complete Workflow Example

Key DataFrame Methods and Attributes

DataFrame vs. Other Structures

99ANYc3cd6

Java POI如何高效读取Excel数据？

Java API 1.7 中文版哪里找？

Testcenter Python是什么？如何使用？

南方Cass新手入门教程该怎么学？

Java 1.7 API中文版在哪里能找到？

Java Socket编程实例具体怎么实现？

mandelbrot python

Java字节数组如何转字符串？

PageOffice Java如何集成使用？

Abaqus网格划分教程，关键步骤与常见问题解析？

Java checkbox选中状态如何获取或判断？

amesim13安装教程

Python如何实现MySQL ping连接检测？

王者荣耀bug刷点卷教程可行吗？

python中validation

Processing教程，零基础如何快速入门？

Python DataFrame如何高效操作？

What is a DataFrame?

Why are DataFrames so Popular?

How to Use DataFrames (with Pandas)

Installation

Creating a DataFrame

Basic DataFrame Operations

Viewing Data

Selecting Data

Filtering Data (Conditional Selection)

Adding/Modifying Columns

Handling Missing Data

A Complete Workflow Example

Key DataFrame Methods and Attributes

DataFrame vs. Other Structures

相关推荐

Java Socket编程实例具体怎么实现？