Python Pandas中rank()函数如何使用？-杰瑞科技汇

Of course! The rank() method in pandas is a powerful and versatile tool for ranking data within a DataFrame or Series. It's much more than a simple "1st, 2nd, 3rd" counter; it can handle ties, account for descending order, and use different ranking methodologies.

（图片来源网络，侵删）

Here’s a comprehensive guide to using pandas.DataFrame.rank() and pandas.Series.rank().

The Basic Idea

At its core, rank() assigns a rank to each value in a group of values. The default behavior is to assign the rank 1 to the smallest value.

Let's start with a simple example.

import pandas as pd
import numpy as np
# Create a simple Series
data = pd.Series([10, 50, 20, 50, 40])
print("Original Data:")
print(data)

Original Data:

（图片来源网络，侵删）

0    10
1    50
2    20
3    50
4    40
dtype: int64

Now, let's rank the values. The smallest value (10) gets rank 1, the next (20) gets rank 2, and so on.

# Rank the series (default is 'average' for ties)
ranked_data = data.rank()
print("\nRanked Data (default method):")
print(ranked_data)

Ranked Data (default method):

0    1.0
1    4.5
2    2.0
3    4.5
4    3.0
dtype: float64

Notice the result:

10 is the smallest, so it gets rank 0.
20 is next, so it gets rank 0.
40 is next, so it gets rank 0.
The two 50s are tied for the highest value. The default method='average' gives them the average of the ranks they would have occupied (ranks 4 and 5). So, (4 + 5) / 2 = 4.5.

Key Parameters of the `rank()` Method

The rank() method has several important parameters that control its behavior.

`method`: How to Handle Ties

This is the most crucial parameter. It determines how to assign ranks when values are identical.

Method	Description	Example for `[10, 50, 20, 50, 40]`
`'average'` (Default)	Assigns the average of the ranks.	`[1.0, 4.5, 2.0, 4.5, 3.0]`
`'min'`	Assigns the minimum of the ranks.	`[1.0, 4.0, 2.0, 4.0, 3.0]`
`'max'`	Assigns the maximum of the ranks.	`[1.0, 5.0, 2.0, 5.0, 3.0]`
`'first'`	Assigns the rank based on the order they appear in the data.	`[1.0, 4.0, 2.0, 5.0, 3.0]` (The first 50 gets rank 4)
`'dense'`	Like `'min'`, but the rank increases by 1, not by the number of tied elements.	`[1.0, 3.0, 2.0, 3.0, 4.0]`

Demonstration:

print("Original Data:", data.values)
# Using different methods for ties
print("'min' method:     ", data.rank(method='min').values)
print("'max' method:     ", data.rank(method='max').values)
print("'first' method:   ", data.rank(method='first').values)
print("'dense' method:   ", data.rank(method='dense').values)

Output:

Original Data: [10 50 20 50 40]
'min' method:      [1. 4. 2. 4. 3.]
'max' method:      [1. 5. 2. 5. 3.]
'first' method:    [1. 4. 2. 5. 3.]
'dense' method:    [1. 3. 2. 3. 4.]

`ascending`: Rank Order

By default (ascending=True), the smallest value gets rank 1. You can set ascending=False to rank in descending order (the largest value gets rank 1).

# Rank in descending order (largest value is rank 1)
print("Descending Rank (average method):")
print(data.rank(ascending=False))

Descending Rank (average method):

0    5.0
1    2.5
2    4.0
3    2.5
4    3.0
dtype: float64

50 is the largest, so it gets rank 5 (average of 2 and 3).
40 is next, so it gets rank 0.
10 is the smallest, so it gets the highest rank, 0.

`axis`: Rank Along Rows or Columns

axis=0 (default): Ranks values within each column.
axis=1: Ranks values within each row.

# Create a DataFrame
df = pd.DataFrame({
    'Score_A': [88, 92, 85, 92],
    'Score_B': [75, 92, 80, 85]
})
print("Original DataFrame:")
print(df)
# Rank within each column (axis=0)
print("\nRanking by columns (axis=0):")
print(df.rank(axis=0))

Original DataFrame:

   Score_A  Score_B
0       88       75
1       92       92
2       85       80
3       92       85

Ranking by columns (axis=0):

   Score_A  Score_B
0     2.0     1.0
1     3.5     3.5
2     1.0     2.0
3     3.5     3.5

Score_A column: 85 (1st), 88 (2nd), 92 (tied for 3rd/4th -> 3.5).
Score_B column: 75 (1st), 80 (2nd), 85 (3rd), 92 (4th). Wait, why is 92 rank 3.5? Because the two 92s are in rows 1 and 3. They are tied for the highest value, so they get the average of ranks 4 and 3, which is 3.5.

# Rank within each row (axis=1)
print("\nRanking by rows (axis=1):")
print(df.rank(axis=1))

Ranking by rows (axis=1):

   Score_A  Score_B
0     2.0     1.0
1     1.0     2.0
2     2.0     1.0
3     1.0     2.0

Row 0: 75 < 88. So Score_B is 1st, Score_A is 2nd.
Row 1: 92 == 92. They are tied. The default 'average' method gives both the average of ranks 1 and 2, which is 1.5. My previous output was slightly off, let's re-run it correctly. Correction:
```
print(df.rank(axis=1))
```
Corrected Output:
```
   Score_A  Score_B
0       2.0       1.0
1       1.5       1.5  # The two 92s are tied
2       2.0       1.0
3       1.0       2.0
```

`na_option`: Handling Missing Values

What should rank() do with NaN (Not a Number) values?

Option	Description
`'keep'` (Default)	Keeps `NaN` in the same position.
`'top'`	Ranks `NaN` as the smallest value (rank 1).
`'bottom'`` \| Ranks `NaN` as the largest value.

df_with_nan = pd.DataFrame({'col': [10, np.nan, 20, np.nan, 5]})
print("DataFrame with NaNs:")
print(df_with_nan)
print("\nRank with 'keep' (default):")
print(df_with_nan.rank(na_option='keep'))
print("\nRank with 'top':")
print(df_with_nan.rank(na_option='top'))

DataFrame with NaNs:

    col
0  10.0
1   NaN
2  20.0
3   NaN
4   5.0

Rank with 'keep' (default):

     col
0    3.0
1    NaN
2    4.0
3    NaN
4    1.0

Rank with 'top':

     col
0    3.0
1    1.0  # NaN is treated as the smallest
2    4.0
3    2.0  # NaN is treated as the second smallest
4    1.0

`pct`: Rank as a Percentage

If pct=True, the rank is returned as a percentage (from 0.0 to 1.0) instead of an absolute rank. This is useful for statistics.

# Rank as a percentage of the total number of values
print("Rank as percentage:")
print(data.rank(pct=True))

Rank as percentage:

0    0.20  # 1 out of 5 values
1    0.90  # 4.5 out of 5
2    0.40  # 2 out of 5
3    0.90  # 4.5 out of 5
4    0.60  # 3 out of 5
dtype: float64

Practical Example: Ranking Students

Let's put it all together with a more realistic example.

# Student scores DataFrame
students = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Math_Score': [88, 92, 85, 92, 78, 85],
    'Physics_Score': [75, 92, 80, 85, 95, np.nan]
})
print("Student Scores:")
print(students)
# 1. Rank students by Math Score (descending, highest is rank 1)
students['Math_Rank'] = students['Math_Score'].rank(ascending=False, method='min')
print("\nRanking by Math Score (highest is 1):")
print(students[['Student', 'Math_Score', 'Math_Rank']].sort_values('Math_Rank'))
# 2. Rank students by Physics Score, treating NaNs as last
students['Physics_Rank'] = students['Physics_Score'].rank(ascending=False, na_option='bottom')
print("\nRanking by Physics Score (highest is 1, NaNs last):")
print(students[['Student', 'Physics_Score', 'Physics_Rank']].sort_values('Physics_Rank'))
# 3. Get the overall rank (average of Math and Physics ranks)
students['Overall_Rank'] = students[['Math_Rank', 'Physics_Rank']].mean(axis=1)
students['Overall_Rank'] = students['Overall_Rank'].rank(method='min').astype(int) # Rank the average ranks
print("\nFinal DataFrame with Overall Rank:")
print(students.sort_values('Overall_Rank'))

Output:

Student Scores:
      Student  Math_Score  Physics_Score
0       Alice          88             75
1         Bob          92             92
2     Charlie          85             80
3       David          92             85
4         Eve          78             95
5       Frank          85             NaN
Ranking by Math Score (highest is 1):
  Student  Math_Score  Math_Rank
1      Bob          92        1.0
3    David          92        1.0
0    Alice          88        3.0
2  Charlie          85        4.0
5    Frank          85        4.0
4      Eve          78        6.0
Ranking by Physics Score (highest is 1, NaNs last):
  Student  Physics_Score  Physics_Rank
4      Eve             95           1.0
1       Bob             92           2.0
3    David             85           3.0
2  Charlie             80           4.0
0    Alice             75           5.0
5    Frank             NaN           6.0
Final DataFrame with Overall Rank:
  Student  Math_Score  Physics_Score  Math_Rank  Physics_Rank  Overall_Rank
1      Bob          92             92        1.0           2.0             1
3    David          92             85        1.0           3.0             2
0    Alice          88             75        3.0           5.0             3
2  Charlie          85             80        4.0           4.0             4
5    Frank          85             NaN        4.0           6.0             5
4      Eve          78             95        6.0           1.0             6

Common Pitfalls & Best Practices

Method Choice Matters: The method parameter is critical. 'average' is common, but 'min' or 'dense' might be more appropriate depending on your use case (e.g., competition rankings). Be deliberate about your choice.
Axis Confusion: Remember axis=0 is for columns and axis=1 is for rows. It's easy to mix them up.
NaN Handling: Always decide what to do with missing values. The default 'keep' is often safe, but in some cases, you might want to treat them as the best or worst performers.
Data Types: rank() always returns a float, even if the ranks are whole numbers. This is to accommodate ties. You can convert the result to an integer if you're sure there are no ties (e.g., df['rank'].astype(int)).

Python Pandas中rank()函数如何使用？

The Basic Idea

Key Parameters of the `rank()` Method

`method`: How to Handle Ties

`ascending`: Rank Order

`axis`: Rank Along Rows or Columns

`na_option`: Handling Missing Values

`pct`: Rank as a Percentage

Practical Example: Ranking Students

Common Pitfalls & Best Practices

99ANYc3cd6

Oracle数据库入门教程怎么学？

Python API与ArcGIS如何协同开发？

Python continue函数的具体作用和使用场景是什么？

Java Excel如何设置字体？

Python脚本如何生成CHM帮助文件？

Java Socket编程实例具体怎么实现？

python qt designer

Linux下Python2.7安装步骤详解？

powermill教程视频

java ffmpeg mp4

Python元素.index()如何正确使用？

adams2025安装教程具体步骤是什么？

3dmax建筑建模教程，新手如何快速入门？

sketch教程从哪开始学？

3dmax教程全集56，学完能独立做项目吗？

Java Cookie时间如何正确设置与持久化？

Python Pandas中rank()函数如何使用？

The Basic Idea

Key Parameters of the rank() Method

method: How to Handle Ties

ascending: Rank Order

axis: Rank Along Rows or Columns

na_option: Handling Missing Values

pct: Rank as a Percentage

Practical Example: Ranking Students

Common Pitfalls & Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

Key Parameters of the `rank()` Method

`method`: How to Handle Ties

`ascending`: Rank Order

`axis`: Rank Along Rows or Columns

`na_option`: Handling Missing Values

`pct`: Rank as a Percentage