杰瑞科技汇

Python如何创建交叉表?

Of course! In Python, a "crosstable" is most commonly known as a contingency table or a cross-tabulation. It's a table that displays the frequency distribution of variables. It's used to see how different categorical variables interact with each other.

Python如何创建交叉表?-图1
(图片来源网络,侵删)

The most popular and powerful library for creating crosstabs in Python is Pandas. The pandas.crosstab() function is specifically designed for this.

Here's a comprehensive guide, from the basics to more advanced examples.

Setup: Sample Data

First, let's create some sample data to work with. We'll use a Pandas DataFrame.

import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
    'AgeGroup': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25', '36-45', '18-25', '26-35', '36-45'],
    'City': ['New York', 'London', 'New York', 'Paris', 'London', 'New York', 'Paris', 'London', 'New York', 'Paris'],
    'Purchased': ['Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:

Python如何创建交叉表?-图2
(图片来源网络,侵删)
   Gender AgeGroup     City Purchased
0    Male    18-25  New York       Yes
1  Female    26-35    London        No
2  Female    18-25  New York       Yes
3    Male    36-45     Paris       Yes
4  Female    26-35    London        No
5    Male    18-25  New York        No
6    Male    36-45     Paris       Yes
7  Female    18-25    London       Yes
8  Female    26-35    London        No
9    Male    36-45     Paris       Yes

Basic Crosstab: Two Variables

The simplest crosstab shows the frequency of combinations between two categorical columns.

# Create a crosstab of Gender vs. Purchased
crosstab_basic = pd.crosstab(df['Gender'], df['Purchased'])
print("\n--- Basic Crosstab ---")
print(crosstab_basic)

Output:

--- Basic Crosstab ---
Purchased  No  Yes
Gender           
Female     3    3
Male       2    4

This table shows that there are 3 females who did not purchase and 3 who did, and 2 males who did not purchase and 4 who did.


Adding Margins (Row and Column Totals)

To get a better overview, you can add row and column totals using the margins=True argument.

# Create a crosstab with margins
crosstab_margins = pd.crosstab(df['Gender'], df['Purchased'], margins=True, margins_name="Total")
print("\n--- Crosstab with Margins ---")
print(crosstab_margins)

Output:

--- Crosstab with Margins ---
Purchased  No  Yes  Total
Gender                  
Female     3    3      6
Male       2    4      6
Total      5    7     11

The "Total" row and column give you the grand totals for each category.


Normalizing: Percentages Instead of Counts

Sometimes, you're more interested in proportions than raw counts. You can normalize the results.

a) Normalize by Index (Rows)

This shows the percentage distribution within each row.

# Normalize by rows (index)
crosstab_normalize_index = pd.crosstab(df['Gender'], df['Purchased'], normalize='index')
print("\n--- Normalized by Index (Rows) ---")
print(crosstab_normalize_index)

Output:

--- Normalized by Index (Rows) ---
Purchased        No       Yes
Gender                     
Female    0.500000  0.500000
Male      0.333333  0.666667

Interpretation: 50% of females did not purchase, while 66.7% of males did purchase.

b) Normalize by Columns

This shows the percentage distribution within each column.

# Normalize by columns
crosstab_normalize_cols = pd.crosstab(df['Gender'], df['Purchased'], normalize='columns')
print("\n--- Normalized by Columns ---")
print(crosstab_normalize_cols)

Output:

--- Normalized by Columns ---
Purchased        No       Yes
Gender                     
Female    0.600000  0.428571
Male      0.400000  0.571429

Interpretation: 60% of "No" purchases were made by females, while 57.1% of "Yes" purchases were made by males.

c) Normalize All Values (Grand Total)

This shows each cell's value as a percentage of the total number of observations.

# Normalize by all values (grand total)
crosstab_normalize_all = pd.crosstab(df['Gender'], df['Purchased'], normalize='all')
print("\n--- Normalized by All Values (Grand Total) ---")
print(crosstab_normalize_all)

Output:

--- Normalized by All Values (Grand Total) ---
Purchased         No        Yes
Gender                         
Female     0.272727  0.272727
Male       0.181818  0.363636

Interpretation: 27.3% of all people surveyed were females who did not purchase.


Multi-dimensional Crosstabs

You can easily extend crosstabs to more than two variables by passing lists to the index and columns parameters.

# Crosstab with Gender as index and AgeGroup/Purchased as columns
crosstab_multi = pd.crosstab(
    index=df['Gender'],
    columns=[df['AgeGroup'], df['Purchased']]
)
print("\n--- Multi-dimensional Crosstab ---")
print(crosstab_multi)

Output:

--- Multi-dimensional Crosstab ---
AgeGroup   18-25       26-35       36-45
Purchased    No  Yes       No  Yes       No  Yes
Gender                                   
Female       1   2       2   0       0   1
Male         1   1       0   0       1   3

This gives a much more detailed breakdown. For example, we can see that there is 1 female in the 18-25 age group who did not purchase.


Adding Values (Aggregation)

You can also use crosstab to perform aggregations. For example, let's say we have a Spend column and we want to see the total amount spent by each gender/purchase combination.

# Add a 'Spend' column to our DataFrame
df['Spend'] = np.random.randint(10, 100, size=len(df))
print("\n--- DataFrame with Spend Column ---")
print(df)
# Create a crosstab that sums the 'Spend' for each combination
crosstab_agg = pd.crosstab(
    index=df['Gender'],
    columns=df['Purchased'],
    values=df['Spend'],  # The values to aggregate
    aggfunc='sum'        # The aggregation function (sum, mean, count, etc.)
)
print("\n--- Aggregated Crosstab (Sum of Spend) ---")
print(crosstab_agg)

Output (will vary due to random Spend data):

--- DataFrame with Spend Column ---
   Gender AgeGroup     City Purchased  Spend
0    Male    18-25  New York       Yes     45
1  Female    26-35    London        No     12
2  Female    18-25  New York       Yes     88
3    Male    36-45     Paris       Yes     67
4  Female    26-35    London        No     34
5    Male    18-25  New York        No     91
6    Male    36-45     Paris       Yes     23
7  Female    18-25    London       Yes     56
8  Female    26-35    London        No     78
9    Male    36-45     Paris       Yes     50
--- Aggregated Crosstab (Sum of Spend) ---
Purchased   No   Yes
Gender             
Female     124   144
Male       182   140

This table shows the total spending for males and females, broken down by whether they purchased or not.


Visualization: Heatmaps

A crosstab is perfect for visualizing with a heatmap. The seaborn library makes this incredibly easy.

import seaborn as sns
import matplotlib.pyplot as plt
# Use the basic crosstab from before
crosstab_vis = pd.crosstab(df['Gender'], df['Purchased'])
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(crosstab_vis, annot=True, cmap="YlGnBu", fmt="d") # annot=True adds numbers, fmt="d" formats as integer'Crosstab Heatmap: Gender vs. Purchased')
plt.xlabel('Purchased')
plt.ylabel('Gender')
plt.show()

This will generate a color-coded heatmap where darker colors represent higher counts, making it easy to spot patterns at a glance.

Summary: pd.crosstab() Arguments

Argument Description
index Array-like or list of arrays. The values to group by in the rows.
columns Array-like or list of arrays. The values to group by in the columns.
values Array-like. Optional. An array of values to aggregate. If not provided, counts are used.
rownames List. Optional. Names for the index levels.
colnames List. Optional. Names for the column levels.
aggfunc Function or list of functions. Optional. The function to use for aggregating values (e.g., 'sum', 'mean', 'count'). Default is count.
margins Boolean. Default False. Add row/column margins (subtotals).
margins_name String. Default 'All'. The name of the row/column that will contain the totals when margins is True.
normalize {'all', 'index', 'columns'} or 0, 1. Optional. Normalize the crosstab over the entire table (all), the rows (index), or the columns (columns).
分享:
扫描分享到社交APP
上一篇
下一篇