Python.dataframe如何高效操作与处理？-杰瑞科技汇

Of course! Let's dive deep into Python's DataFrame. It's one of the most fundamental and powerful tools for data analysis, and it's at the heart of the Pandas library.

（图片来源网络，侵删）

What is a DataFrame?

Think of a DataFrame as a spreadsheet or a SQL table in Python. It's a two-dimensional, labeled data structure with columns of potentially different types.

You can visualize it like this:

Name	Age	City	Country
Alice	24	New York	USA
Bob	27	London	UK
Charlie	22	Paris	France

Key features of a DataFrame:

2D Structure: It has rows and columns.
Labeled Axes: Both rows and columns have labels (an index for rows, and column names).
Heterogeneous Data: Columns can contain different data types (e.g., integers, floats, strings, dates).
Size-Mutable: You can add or remove columns and rows.
Rich in Functionality: It comes with a massive set of built-in functions for data manipulation, filtering, grouping, and analysis.

Getting Started: Installation and Import

First, you need to have the Pandas library installed. If you don't, open your terminal or command prompt and run:

（图片来源网络，侵删）

pip install pandas

Now, in your Python script or Jupyter Notebook, you can import it. The standard convention is to import it with the alias pd.

import pandas as pd
import numpy as np # Often used alongside Pandas

Creating a DataFrame

You can create a DataFrame in several ways. The most common is from a dictionary.

From a Dictionary of Lists

Each key in the dictionary becomes a column name, and the corresponding list becomes the column's data.

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo'],
    'Country': ['USA', 'UK', 'France', 'Japan']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City Country
0    Alice   24  New York     USA
1      Bob   27   London      UK
2  Charlie   22    Paris  France
3    David   32     Tokyo   Japan

From a List of Dictionaries

This is useful when your data is structured as a collection of records.

data_list = [
    {'Name': 'Alice', 'Age': 24, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 27, 'City': 'London'},
    {'Name': 'Charlie', 'Age': 22, 'City': 'Paris'}
]
df_from_list = pd.DataFrame(data_list)
print(df_from_list)

From a CSV File (Most Common in Practice)

This is how you'll usually load data. Pandas makes it incredibly easy.

# Assume you have a file 'data.csv' with the same data
# df = pd.read_csv('data.csv')

Exploring and Inspecting a DataFrame

Once you have a DataFrame, the first thing you want to do is understand its contents.

# View the first 5 rows (default)
print(df.head())
# View the last 3 rows
print(df.tail(3))
# Get a concise summary of the DataFrame
# Shows column names, non-null counts, and data types
print(df.info())
# Get descriptive statistics for numerical columns
print(df.describe())
# Get the dimensions (rows, columns) of the DataFrame
print(df.shape) # Output: (4, 4)
# Get the column names
print(df.columns)
# Get the row index labels
print(df.index)

Selecting Data (Indexing and Slicing)

This is a core operation. Pandas offers several ways to select data.

Selecting a Single Column

This returns a Pandas Series (a 1D labeled array).

ages = df['Age']
print(ages)
print(type(ages)) # <class 'pandas.core.series.Series'>

Selecting Multiple Columns

Pass a list of column names. This returns a new DataFrame.

subset = df[['Name', 'City']]
print(subset)

Selecting Rows by Label (`.loc`)

.loc is label-based indexing. You use it to select data using row and column labels.

# Select the row with index label 1
print(df.loc[1])
# Select a specific cell: row '1', column 'City'
print(df.loc[1, 'City'])
# Select a slice of rows and columns
print(df.loc[0:2, ['Name', 'City']])

Note: The slice 0:2 with .loc is inclusive of the stop index (2).

Selecting Rows by Integer Position (`.iloc`)

.iloc is position-based indexing. It works like standard Python list indexing (0-based, and the stop index is exclusive).

# Select the row at integer position 1
print(df.iloc[1])
# Select a specific cell by position: row 1, column 2
print(df.iloc[1, 2])
# Select a slice of rows and columns by position
print(df.iloc[0:2, 0:2])

Note: The slice 0:2 with .iloc is exclusive of the stop index (it gets rows 0 and 1).

Filtering Data (Boolean Indexing)

This is how you select rows based on a condition. It's one of the most powerful features.

# Get all people older than 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)
# Combine multiple conditions using & (and) or | (or)
# IMPORTANT: Use parentheses () around each condition!
adults_in_london = df[(df['Age'] >= 25) & (df['City'] == 'London')]
print(adults_in_london)

Adding and Modifying Columns

It's very easy to add new columns or modify existing ones.

# Add a new column based on an existing one
df['Age_in_5_Years'] = df['Age'] + 5
# Add a new column with a constant value
df['Status'] = 'Active'
# Modify a column using a function (e.g., make city names uppercase)
df['City'] = df['City'].str.upper()
print(df)

Handling Missing Data

In real-world data, you'll often encounter missing values, represented as NaN (Not a Number).

# Create a DataFrame with missing values
data_with_nan = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [1, 2, 3]}
df_nan = pd.DataFrame(data_with_nan)
# Check for missing values
print(df_nan.isnull())
# Drop rows with any missing values
df_dropped = df_nan.dropna()
print("\nDropped rows with NaN:")
print(df_dropped)
# Fill missing values with a specific number (e.g., 0)
df_filled = df_nan.fillna(0)
print("\nFilled NaN with 0:")
print(df_filled)
# Fill missing values with the mean of the column
df_filled_mean = df_nan.fillna(df_nan.mean())
print("\nFilled NaN with column mean:")
print(df_filled_mean)

Grouping and Aggregating Data (`.groupby`)

This is equivalent to SQL's GROUP BY. It allows you to split your data into groups based on some criteria, apply a function to each group, and then combine the results.

Let's add a 'Department' column to our original DataFrame to demonstrate.

df['Department'] = ['HR', 'Engineering', 'Engineering', 'Marketing']
# Group by 'Department' and calculate the mean age for each department
avg_age_by_dept = df.groupby('Department')['Age'].mean()
print("\nAverage age by department:")
print(avg_age_by_dept)
# You can also use the .agg() method for multiple aggregations
dept_stats = df.groupby('Department').agg(
    Avg_Age=('Age', 'mean'),
    Max_Age=('Age', 'max'),
    Employee_Count=('Name', 'count')
)
print("\nDepartment statistics:")
print(dept_stats)

Combining DataFrames

You can combine DataFrames using concat (for stacking) or merge (for joining, like SQL).

`pd.concat()` - Stacking

df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})
# Concatenate along rows (axis=0)
concatenated_rows = pd.concat([df1, df2])
print(concatenated_rows)
# Concatenate along columns (axis=1)
df3 = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']})
concatenated_cols = pd.concat([df1, df3], axis=1)
print(concatenated_cols)

`pd.merge()` - Joining

This is for combining DataFrames based on a common key or keys.

# Left DataFrame
left_df = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                        'A': ['A0', 'A1', 'A2']})
# Right DataFrame
right_df = pd.DataFrame({'key': ['K0', 'K1', 'K3'],
                         'B': ['B0', 'B1', 'B3']})
# Inner join (only keeps keys that exist in BOTH DataFrames)
inner_merged = pd.merge(left_df, right_df, on='key', how='inner')
print("\nInner Merge:")
print(inner_merged)
# Left join (keeps all keys from the LEFT DataFrame)
left_merged = pd.merge(left_df, right_df, on='key', how='left')
print("\nLeft Merge:")
print(left_merged)

Summary of Key Concepts

Concept	Common Method(s)	Description
Creation	`pd.DataFrame()`, `pd.read_csv()`	Create from dictionaries, lists, or load from files.
Inspection	`.head()`, `.tail()`, `.info()`, `.describe()`	Understand the data's structure and content.
Selection	`df['col']`, `df[['col1', 'col2']]`, `.loc`, `.iloc`	Select columns and rows.
Filtering	`df[df['col'] > value]`	Select rows based on conditions.
Modification	`df['new_col'] = ...`	Add or change columns.
Missing Data	`.isnull()`, `.dropna()`, `.fillna()`	Handle `NaN` values.
Grouping	`.groupby().agg()`	Split data into groups and apply aggregate functions.
Combining	`pd.concat()`, `pd.merge()`	Join or stack DataFrames together.

This covers the absolute essentials of working with Pandas DataFrames. Mastering these operations will allow you to perform a vast range of data analysis tasks efficiently.

Python.dataframe如何高效操作与处理？

What is a DataFrame?

Getting Started: Installation and Import

Creating a DataFrame

From a Dictionary of Lists

From a List of Dictionaries

From a CSV File (Most Common in Practice)

Exploring and Inspecting a DataFrame

Selecting Data (Indexing and Slicing)

Selecting a Single Column

Selecting Multiple Columns

Selecting Rows by Label (`.loc`)

Selecting Rows by Integer Position (`.iloc`)

Filtering Data (Boolean Indexing)

Adding and Modifying Columns

Handling Missing Data

Grouping and Aggregating Data (`.groupby`)

Combining DataFrames

`pd.concat()` - Stacking

`pd.merge()` - Joining

Summary of Key Concepts

99ANYc3cd6

Java如何高效获取Map的value值？

android与java的区别

Java如何判断一个值是否为double类型？

phpstorm视频教程

Python minidom模块如何解析XML？

Java Socket编程实例具体怎么实现？

Python GUI中StringVar如何实时更新显示？

thrift python server

Python thread join 如何正确使用？

Java如何调用SQL Server存储过程？

Redis如何实现Java Session共享？

GarageBand iPhone教程，新手如何快速上手？

MySQL与Java时间戳如何转换？

Python free argument是什么？

Python MySQLdb连接10061错误怎么解决？

python unicode print

Python.dataframe如何高效操作与处理？

What is a DataFrame?

Getting Started: Installation and Import

Creating a DataFrame

From a Dictionary of Lists

From a List of Dictionaries

From a CSV File (Most Common in Practice)

Exploring and Inspecting a DataFrame

Selecting Data (Indexing and Slicing)

Selecting a Single Column

Selecting Multiple Columns

Selecting Rows by Label (.loc)

Selecting Rows by Integer Position (.iloc)

Filtering Data (Boolean Indexing)

Adding and Modifying Columns

Handling Missing Data

Grouping and Aggregating Data (.groupby)

Combining DataFrames

pd.concat() - Stacking

pd.merge() - Joining

Summary of Key Concepts

相关推荐

Java Socket编程实例具体怎么实现？

Selecting Rows by Label (`.loc`)

Selecting Rows by Integer Position (`.iloc`)

Grouping and Aggregating Data (`.groupby`)

`pd.concat()` - Stacking

`pd.merge()` - Joining