杰瑞科技汇

Python如何用PivotTable做数据透视?

Of course! Using Python for pivot tables is a common and powerful task, primarily accomplished with the Pandas library. Pandas' pivot_table() function is a direct equivalent to the pivot table feature found in spreadsheet software like Excel or Google Sheets.

Python如何用PivotTable做数据透视?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering everything from the basics to more advanced use cases.

The Core Concept: What is a Pivot Table?

A pivot table is a data summarization tool. It allows you to:

  • Reorganize (pivot) your data.
  • Aggregate (sum, average, count, etc.) your data based on specific criteria.
  • Analyze and gain insights from large datasets quickly.

Imagine a spreadsheet of sales data. A pivot table can instantly show you the total sales for each product in each region, without you having to write complex formulas.


Setup: Installing and Importing Libraries

First, you need to have Pandas installed. If you don't, you can install it using pip or conda.

Python如何用PivotTable做数据透视?-图2
(图片来源网络,侵删)
# Using pip
pip install pandas
# Using conda
conda install pandas

Then, in your Python script or notebook, import the necessary libraries.

import pandas as pd
import numpy as np # Often used for creating sample data

The pivot_table() Function Syntax

The key function is pd.pivot_table(). Here are its most important parameters:

  • data: The DataFrame you want to pivot.
  • values: The column(s) whose values you want to aggregate (e.g., 'sales', 'profit').
  • index: The column(s) that will become the new rows in the pivot table.
  • columns: The column(s) that will become the new columns in the pivot table.
  • aggfunc: The aggregation function to use (e.g., 'sum', 'mean', 'count', 'max'). You can also pass a list of functions.
  • fill_value: The value to use when the aggregation results in a missing value (NaN).
  • margins: If True, adds all rows/columns (subtotals/grand totals).
  • dropna: If True, excludes columns with all NaN values.

Step-by-Step Examples

Let's create a sample DataFrame to work with.

# Sample DataFrame
data = {
    'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Salesperson': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie', 'David'],
    'Sales': [500, 450, 600, 700, 550, 480, 650, 520],
    'Units': [10, 9, 12, 14, 11, 9, 13, 10]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:

   Region Product Salesperson  Sales  Units
0    East       A      Alice    500     10
1    West       B        Bob    450      9
2    East       A    Charlie    600     12
3   South       C      David    700     14
4    West       B      Alice    550     11
5   South       A        Bob    480      9
6    East       C    Charlie    650     13
7    West       B      David    520     10

Example 1: Basic Pivot Table

Goal: Find the total Sales for each Region and Product.

  • values: 'Sales' (the numbers we want to sum)
  • index: 'Region' (the new rows)
  • columns: 'Product' (the new columns)
  • aggfunc: 'sum' (the operation to perform)
# Basic Pivot Table: Total Sales by Region and Product
pivot1 = pd.pivot_table(
    df,
    values='Sales',
    index='Region',
    columns='Product',
    aggfunc='sum'
)
print("\nPivot Table 1: Total Sales by Region and Product")
print(pivot1)

Result:

Pivot Table 1: Total Sales by Region and Product
Product     A     B      C
Region                    
East      1100   NaN   650
South      480   NaN   700
West        NaN  1520    NaN

Notice the NaN (Not a Number) values. This is because, for example, the 'East' region didn't sell any 'B' products.

Example 2: Handling Missing Values (fill_value)

Let's fill those NaNs with 0 to make the table cleaner.

# Pivot Table with fill_value=0
pivot2 = pd.pivot_table(
    df,
    values='Sales',
    index='Region',
    columns='Product',
    aggfunc='sum',
    fill_value=0
)
print("\nPivot Table 2: With fill_value=0")
print(pivot2)

Result:

Pivot Table 2: With fill_value=0
Product     A     B      C
Region                    
East      1100    0   650
South      480    0   700
West         0  1520     0

Example 3: Multiple Aggregation Functions

Goal: Find both the total Sales and the average Units sold.

  • values: Now we need a list: ['Sales', 'Units']
  • aggfunc: Now we need a list: ['sum', 'np.mean'] (using NumPy's mean function)
# Pivot Table with multiple values and aggfuncs
pivot3 = pd.pivot_table(
    df,
    values=['Sales', 'Units'],
    index='Region',
    columns='Product',
    aggfunc={'Sales': 'sum', 'Units': np.mean}, # Different aggs for different values
    fill_value=0
)
print("\nPivot Table 3: Sum of Sales and Mean of Units")
print(pivot3)

Result:

Pivot Table 3: Sum of Sales and Mean of Units
         Sales          Units          
Product       A     B      C         A    B     C
Region                                      
East       1100     0    650  11.000000  0.0  13.0
South       480     0    700   9.000000  0.0  14.0
West          0  1520      0  10.000000  10.0   0.0

Example 4: Adding Subtotals and Grand Totals (margins)

Goal: Get the same table as Example 2, but with subtotals for each region and a grand total.

# Pivot Table with margins (subtotals and grand total)
pivot4 = pd.pivot_table(
    df,
    values='Sales',
    index='Region',
    columns='Product',
    aggfunc='sum',
    fill_value=0,
    margins=True # This adds 'All' rows and columns
)
print("\nPivot Table 4: With Margins (Subtotals and Grand Total)")
print(pivot4)

Result:

Pivot Table 4: With Margins (Subtotals and Grand Total)
Product     A     B      C    All
Region                          
East      1100     0   650  1750
South      480     0   700  1180
West         0  1520     0  1520
All       1580  1520  1350  4450

The All row/column shows the total across all other categories.

Example 5: Using Multiple Index and Columns Levels

You can pass a list to index or columns to create a multi-level (hierarchical) pivot table.

Goal: Find total Sales by Region and Salesperson for each Product.

  • index: ['Region', 'Salesperson'] (hierarchical rows)
  • columns: ['Product']
# Pivot Table with multiple index levels
pivot5 = pd.pivot_table(
    df,
    values='Sales',
    index=['Region', 'Salesperson'],
    columns='Product',
    aggfunc='sum',
    fill_value=0
)
print("\nPivot Table 5: Multi-level Index")
print(pivot5)

Result:

Pivot Table 5: Multi-level Index
Product        A     B      C
Region Salesperson               
East   Alice     500     0     0
       Charlie   600     0   650
South  Bob       480     0     0
       David       0     0   700
West   Alice       0   550     0
       Bob         0   450     0
       David       0   520     0

Comparison with pivot() and groupby()

It's helpful to understand how pivot_table relates to other Pandas functions.

  • df.pivot(): This is for reshaping data, not for aggregation. It's a more direct analogue to Excel's "Pivot" feature, where you expect one value for each (index, column) pair. If you have multiple values, it will raise an error.

    # This would fail because there are multiple sales for Alice in the East region
    # df.pivot(index='Region', columns='Product', values='Sales')
    # ValueError: Index contains duplicate entries, cannot reshape
  • df.groupby(): This is the more fundamental "split-apply-combine" tool in Pandas. pivot_table is essentially a convenient, high-level wrapper around groupby. You can achieve the same result as a pivot table with groupby.

    # Achieving the same result as Example 1 using groupby
    grouped = df.groupby(['Region', 'Product'])['Sales'].sum().unstack(fill_value=0)
    print("\nEquivalent result using groupby:")
    print(grouped)

    This shows that pivot_table is just a more user-friendly way to perform a common groupby operation.


When to Use a Pivot Table

Use a pivot table when you need to:

  • Summarize large datasets into a more digestible format.
  • Quickly compare values across different categories.
  • Answer questions like "What were the total sales per quarter for each product line?" or "What was the average score for each student in each subject?".
  • Create a structured summary that is easy to read and share.

Summary of Key Functions

Function Purpose Primary Use Case
pd.pivot_table() Aggregation and Reshaping Summarizing data by calculating sums, means, counts, etc., across categories. This is the most common pivot table use case.
df.pivot() Reshaping Only Reformatting data from "long" to "wide" when you have exactly one value for each combination of index and column. No aggregation.
df.groupby() Aggregation The fundamental tool for splitting data into groups, applying a function, and combining the results. More flexible but can be more verbose for simple pivot-like tasks.
分享:
扫描分享到社交APP
上一篇
下一篇