Python如何创建交叉表？-杰瑞科技汇

Of course! In Python, a "crosstable" is most commonly known as a contingency table or a cross-tabulation. It's a table that displays the frequency distribution of variables. It's used to see how different categorical variables interact with each other.

（图片来源网络，侵删）

The most popular and powerful library for creating crosstabs in Python is Pandas. The pandas.crosstab() function is specifically designed for this.

Here's a comprehensive guide, from the basics to more advanced examples.

Setup: Sample Data

First, let's create some sample data to work with. We'll use a Pandas DataFrame.

import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
    'AgeGroup': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25', '36-45', '18-25', '26-35', '36-45'],
    'City': ['New York', 'London', 'New York', 'Paris', 'London', 'New York', 'Paris', 'London', 'New York', 'Paris'],
    'Purchased': ['Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:

（图片来源网络，侵删）

   Gender AgeGroup     City Purchased
0    Male    18-25  New York       Yes
1  Female    26-35    London        No
2  Female    18-25  New York       Yes
3    Male    36-45     Paris       Yes
4  Female    26-35    London        No
5    Male    18-25  New York        No
6    Male    36-45     Paris       Yes
7  Female    18-25    London       Yes
8  Female    26-35    London        No
9    Male    36-45     Paris       Yes

Basic Crosstab: Two Variables

The simplest crosstab shows the frequency of combinations between two categorical columns.

# Create a crosstab of Gender vs. Purchased
crosstab_basic = pd.crosstab(df['Gender'], df['Purchased'])
print("\n--- Basic Crosstab ---")
print(crosstab_basic)

Output:

--- Basic Crosstab ---
Purchased  No  Yes
Gender           
Female     3    3
Male       2    4

This table shows that there are 3 females who did not purchase and 3 who did, and 2 males who did not purchase and 4 who did.

Adding Margins (Row and Column Totals)

To get a better overview, you can add row and column totals using the margins=True argument.

# Create a crosstab with margins
crosstab_margins = pd.crosstab(df['Gender'], df['Purchased'], margins=True, margins_name="Total")
print("\n--- Crosstab with Margins ---")
print(crosstab_margins)

Output:

--- Crosstab with Margins ---
Purchased  No  Yes  Total
Gender                  
Female     3    3      6
Male       2    4      6
Total      5    7     11

The "Total" row and column give you the grand totals for each category.

Normalizing: Percentages Instead of Counts

Sometimes, you're more interested in proportions than raw counts. You can normalize the results.

a) Normalize by Index (Rows)

This shows the percentage distribution within each row.

# Normalize by rows (index)
crosstab_normalize_index = pd.crosstab(df['Gender'], df['Purchased'], normalize='index')
print("\n--- Normalized by Index (Rows) ---")
print(crosstab_normalize_index)

Output:

--- Normalized by Index (Rows) ---
Purchased        No       Yes
Gender                     
Female    0.500000  0.500000
Male      0.333333  0.666667

Interpretation: 50% of females did not purchase, while 66.7% of males did purchase.

b) Normalize by Columns

This shows the percentage distribution within each column.

# Normalize by columns
crosstab_normalize_cols = pd.crosstab(df['Gender'], df['Purchased'], normalize='columns')
print("\n--- Normalized by Columns ---")
print(crosstab_normalize_cols)

Output:

--- Normalized by Columns ---
Purchased        No       Yes
Gender                     
Female    0.600000  0.428571
Male      0.400000  0.571429

Interpretation: 60% of "No" purchases were made by females, while 57.1% of "Yes" purchases were made by males.

c) Normalize All Values (Grand Total)

This shows each cell's value as a percentage of the total number of observations.

# Normalize by all values (grand total)
crosstab_normalize_all = pd.crosstab(df['Gender'], df['Purchased'], normalize='all')
print("\n--- Normalized by All Values (Grand Total) ---")
print(crosstab_normalize_all)

Output:

--- Normalized by All Values (Grand Total) ---
Purchased         No        Yes
Gender                         
Female     0.272727  0.272727
Male       0.181818  0.363636

Interpretation: 27.3% of all people surveyed were females who did not purchase.

Multi-dimensional Crosstabs

You can easily extend crosstabs to more than two variables by passing lists to the index and columns parameters.

# Crosstab with Gender as index and AgeGroup/Purchased as columns
crosstab_multi = pd.crosstab(
    index=df['Gender'],
    columns=[df['AgeGroup'], df['Purchased']]
)
print("\n--- Multi-dimensional Crosstab ---")
print(crosstab_multi)

Output:

--- Multi-dimensional Crosstab ---
AgeGroup   18-25       26-35       36-45
Purchased    No  Yes       No  Yes       No  Yes
Gender                                   
Female       1   2       2   0       0   1
Male         1   1       0   0       1   3

This gives a much more detailed breakdown. For example, we can see that there is 1 female in the 18-25 age group who did not purchase.

Adding Values (Aggregation)

You can also use crosstab to perform aggregations. For example, let's say we have a Spend column and we want to see the total amount spent by each gender/purchase combination.

# Add a 'Spend' column to our DataFrame
df['Spend'] = np.random.randint(10, 100, size=len(df))
print("\n--- DataFrame with Spend Column ---")
print(df)
# Create a crosstab that sums the 'Spend' for each combination
crosstab_agg = pd.crosstab(
    index=df['Gender'],
    columns=df['Purchased'],
    values=df['Spend'],  # The values to aggregate
    aggfunc='sum'        # The aggregation function (sum, mean, count, etc.)
)
print("\n--- Aggregated Crosstab (Sum of Spend) ---")
print(crosstab_agg)

Output (will vary due to random Spend data):

--- DataFrame with Spend Column ---
   Gender AgeGroup     City Purchased  Spend
0    Male    18-25  New York       Yes     45
1  Female    26-35    London        No     12
2  Female    18-25  New York       Yes     88
3    Male    36-45     Paris       Yes     67
4  Female    26-35    London        No     34
5    Male    18-25  New York        No     91
6    Male    36-45     Paris       Yes     23
7  Female    18-25    London       Yes     56
8  Female    26-35    London        No     78
9    Male    36-45     Paris       Yes     50
--- Aggregated Crosstab (Sum of Spend) ---
Purchased   No   Yes
Gender             
Female     124   144
Male       182   140

This table shows the total spending for males and females, broken down by whether they purchased or not.

Visualization: Heatmaps

A crosstab is perfect for visualizing with a heatmap. The seaborn library makes this incredibly easy.

import seaborn as sns
import matplotlib.pyplot as plt
# Use the basic crosstab from before
crosstab_vis = pd.crosstab(df['Gender'], df['Purchased'])
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(crosstab_vis, annot=True, cmap="YlGnBu", fmt="d") # annot=True adds numbers, fmt="d" formats as integer'Crosstab Heatmap: Gender vs. Purchased')
plt.xlabel('Purchased')
plt.ylabel('Gender')
plt.show()

This will generate a color-coded heatmap where darker colors represent higher counts, making it easy to spot patterns at a glance.

Summary: `pd.crosstab()` Arguments

Argument	Description
`index`	Array-like or list of arrays. The values to group by in the rows.
`columns`	Array-like or list of arrays. The values to group by in the columns.
`values`	Array-like. Optional. An array of values to aggregate. If not provided, counts are used.
`rownames`	List. Optional. Names for the index levels.
`colnames`	List. Optional. Names for the column levels.
`aggfunc`	Function or list of functions. Optional. The function to use for aggregating `values` (e.g., `'sum'`, `'mean'`, `'count'`). Default is `count`.
`margins`	Boolean. Default `False`. Add row/column margins (subtotals).
`margins_name`	String. Default `'All'`. The name of the row/column that will contain the totals when margins is True.
`normalize`	`{'all', 'index', 'columns'}` or `0`, `1`. Optional. Normalize the crosstab over the entire table (`all`), the rows (`index`), or the columns (`columns`).

Python如何创建交叉表？

Setup: Sample Data

Basic Crosstab: Two Variables

Adding Margins (Row and Column Totals)

Normalizing: Percentages Instead of Counts

a) Normalize by Index (Rows)

b) Normalize by Columns

c) Normalize All Values (Grand Total)

Multi-dimensional Crosstabs

Adding Values (Aggregation)

Visualization: Heatmaps

Summary: `pd.crosstab()` Arguments

99ANYc3cd6

PHP微信支付视频教程怎么学？

contextlib如何简化资源管理？

Python oripackage是什么？

Python properties如何使用？

Python failunless怎么用？

Java Socket编程实例具体怎么实现？

Proe工程图视频教程怎么学？

信息系统分析与设计教程

哪里能找到Python教程PDF资源？

Python中如何解析日期？

Java JSONObject如何使用？

Python交互模式如何高效使用？

Java decompiler如何反编译class文件？

Winhex数据恢复教程怎么用？

Spring MVC如何实现Java MVC开发？

小米2如何刷入MIUI6？

Python如何创建交叉表？

Setup: Sample Data

Basic Crosstab: Two Variables

Adding Margins (Row and Column Totals)

Normalizing: Percentages Instead of Counts

a) Normalize by Index (Rows)

b) Normalize by Columns

c) Normalize All Values (Grand Total)

Multi-dimensional Crosstabs

Adding Values (Aggregation)

Visualization: Heatmaps

Summary: pd.crosstab() Arguments

相关推荐

Java Socket编程实例具体怎么实现？

Summary: `pd.crosstab()` Arguments