杰瑞科技汇

Python Pandas中rank()函数如何使用?

Of course! The rank() method in pandas is a powerful and versatile tool for ranking data within a DataFrame or Series. It's much more than a simple "1st, 2nd, 3rd" counter; it can handle ties, account for descending order, and use different ranking methodologies.

Python Pandas中rank()函数如何使用?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide to using pandas.DataFrame.rank() and pandas.Series.rank().


The Basic Idea

At its core, rank() assigns a rank to each value in a group of values. The default behavior is to assign the rank 1 to the smallest value.

Let's start with a simple example.

import pandas as pd
import numpy as np
# Create a simple Series
data = pd.Series([10, 50, 20, 50, 40])
print("Original Data:")
print(data)

Original Data:

Python Pandas中rank()函数如何使用?-图2
(图片来源网络,侵删)
0    10
1    50
2    20
3    50
4    40
dtype: int64

Now, let's rank the values. The smallest value (10) gets rank 1, the next (20) gets rank 2, and so on.

# Rank the series (default is 'average' for ties)
ranked_data = data.rank()
print("\nRanked Data (default method):")
print(ranked_data)

Ranked Data (default method):

0    1.0
1    4.5
2    2.0
3    4.5
4    3.0
dtype: float64

Notice the result:

  • 10 is the smallest, so it gets rank 0.
  • 20 is next, so it gets rank 0.
  • 40 is next, so it gets rank 0.
  • The two 50s are tied for the highest value. The default method='average' gives them the average of the ranks they would have occupied (ranks 4 and 5). So, (4 + 5) / 2 = 4.5.

Key Parameters of the rank() Method

The rank() method has several important parameters that control its behavior.

method: How to Handle Ties

This is the most crucial parameter. It determines how to assign ranks when values are identical.

Method Description Example for [10, 50, 20, 50, 40]
'average' (Default) Assigns the average of the ranks. [1.0, 4.5, 2.0, 4.5, 3.0]
'min' Assigns the minimum of the ranks. [1.0, 4.0, 2.0, 4.0, 3.0]
'max' Assigns the maximum of the ranks. [1.0, 5.0, 2.0, 5.0, 3.0]
'first' Assigns the rank based on the order they appear in the data. [1.0, 4.0, 2.0, 5.0, 3.0] (The first 50 gets rank 4)
'dense' Like 'min', but the rank increases by 1, not by the number of tied elements. [1.0, 3.0, 2.0, 3.0, 4.0]

Demonstration:

print("Original Data:", data.values)
# Using different methods for ties
print("'min' method:     ", data.rank(method='min').values)
print("'max' method:     ", data.rank(method='max').values)
print("'first' method:   ", data.rank(method='first').values)
print("'dense' method:   ", data.rank(method='dense').values)

Output:

Original Data: [10 50 20 50 40]
'min' method:      [1. 4. 2. 4. 3.]
'max' method:      [1. 5. 2. 5. 3.]
'first' method:    [1. 4. 2. 5. 3.]
'dense' method:    [1. 3. 2. 3. 4.]

ascending: Rank Order

By default (ascending=True), the smallest value gets rank 1. You can set ascending=False to rank in descending order (the largest value gets rank 1).

# Rank in descending order (largest value is rank 1)
print("Descending Rank (average method):")
print(data.rank(ascending=False))

Descending Rank (average method):

0    5.0
1    2.5
2    4.0
3    2.5
4    3.0
dtype: float64
  • 50 is the largest, so it gets rank 5 (average of 2 and 3).
  • 40 is next, so it gets rank 0.
  • 10 is the smallest, so it gets the highest rank, 0.

axis: Rank Along Rows or Columns

  • axis=0 (default): Ranks values within each column.
  • axis=1: Ranks values within each row.
# Create a DataFrame
df = pd.DataFrame({
    'Score_A': [88, 92, 85, 92],
    'Score_B': [75, 92, 80, 85]
})
print("Original DataFrame:")
print(df)
# Rank within each column (axis=0)
print("\nRanking by columns (axis=0):")
print(df.rank(axis=0))

Original DataFrame:

   Score_A  Score_B
0       88       75
1       92       92
2       85       80
3       92       85

Ranking by columns (axis=0):

   Score_A  Score_B
0     2.0     1.0
1     3.5     3.5
2     1.0     2.0
3     3.5     3.5
  • Score_A column: 85 (1st), 88 (2nd), 92 (tied for 3rd/4th -> 3.5).
  • Score_B column: 75 (1st), 80 (2nd), 85 (3rd), 92 (4th). Wait, why is 92 rank 3.5? Because the two 92s are in rows 1 and 3. They are tied for the highest value, so they get the average of ranks 4 and 3, which is 3.5.
# Rank within each row (axis=1)
print("\nRanking by rows (axis=1):")
print(df.rank(axis=1))

Ranking by rows (axis=1):

   Score_A  Score_B
0     2.0     1.0
1     1.0     2.0
2     2.0     1.0
3     1.0     2.0
  • Row 0: 75 < 88. So Score_B is 1st, Score_A is 2nd.
  • Row 1: 92 == 92. They are tied. The default 'average' method gives both the average of ranks 1 and 2, which is 1.5. My previous output was slightly off, let's re-run it correctly. Correction:
    print(df.rank(axis=1))

    Corrected Output:

       Score_A  Score_B
    0       2.0       1.0
    1       1.5       1.5  # The two 92s are tied
    2       2.0       1.0
    3       1.0       2.0

na_option: Handling Missing Values

What should rank() do with NaN (Not a Number) values?

Option Description
'keep' (Default) Keeps NaN in the same position.
'top' Ranks NaN as the smallest value (rank 1).
`'bottom'`` | Ranks NaN as the largest value.
df_with_nan = pd.DataFrame({'col': [10, np.nan, 20, np.nan, 5]})
print("DataFrame with NaNs:")
print(df_with_nan)
print("\nRank with 'keep' (default):")
print(df_with_nan.rank(na_option='keep'))
print("\nRank with 'top':")
print(df_with_nan.rank(na_option='top'))

DataFrame with NaNs:

    col
0  10.0
1   NaN
2  20.0
3   NaN
4   5.0

Rank with 'keep' (default):

     col
0    3.0
1    NaN
2    4.0
3    NaN
4    1.0

Rank with 'top':

     col
0    3.0
1    1.0  # NaN is treated as the smallest
2    4.0
3    2.0  # NaN is treated as the second smallest
4    1.0

pct: Rank as a Percentage

If pct=True, the rank is returned as a percentage (from 0.0 to 1.0) instead of an absolute rank. This is useful for statistics.

# Rank as a percentage of the total number of values
print("Rank as percentage:")
print(data.rank(pct=True))

Rank as percentage:

0    0.20  # 1 out of 5 values
1    0.90  # 4.5 out of 5
2    0.40  # 2 out of 5
3    0.90  # 4.5 out of 5
4    0.60  # 3 out of 5
dtype: float64

Practical Example: Ranking Students

Let's put it all together with a more realistic example.

# Student scores DataFrame
students = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Math_Score': [88, 92, 85, 92, 78, 85],
    'Physics_Score': [75, 92, 80, 85, 95, np.nan]
})
print("Student Scores:")
print(students)
# 1. Rank students by Math Score (descending, highest is rank 1)
students['Math_Rank'] = students['Math_Score'].rank(ascending=False, method='min')
print("\nRanking by Math Score (highest is 1):")
print(students[['Student', 'Math_Score', 'Math_Rank']].sort_values('Math_Rank'))
# 2. Rank students by Physics Score, treating NaNs as last
students['Physics_Rank'] = students['Physics_Score'].rank(ascending=False, na_option='bottom')
print("\nRanking by Physics Score (highest is 1, NaNs last):")
print(students[['Student', 'Physics_Score', 'Physics_Rank']].sort_values('Physics_Rank'))
# 3. Get the overall rank (average of Math and Physics ranks)
students['Overall_Rank'] = students[['Math_Rank', 'Physics_Rank']].mean(axis=1)
students['Overall_Rank'] = students['Overall_Rank'].rank(method='min').astype(int) # Rank the average ranks
print("\nFinal DataFrame with Overall Rank:")
print(students.sort_values('Overall_Rank'))

Output:

Student Scores:
      Student  Math_Score  Physics_Score
0       Alice          88             75
1         Bob          92             92
2     Charlie          85             80
3       David          92             85
4         Eve          78             95
5       Frank          85             NaN
Ranking by Math Score (highest is 1):
  Student  Math_Score  Math_Rank
1      Bob          92        1.0
3    David          92        1.0
0    Alice          88        3.0
2  Charlie          85        4.0
5    Frank          85        4.0
4      Eve          78        6.0
Ranking by Physics Score (highest is 1, NaNs last):
  Student  Physics_Score  Physics_Rank
4      Eve             95           1.0
1       Bob             92           2.0
3    David             85           3.0
2  Charlie             80           4.0
0    Alice             75           5.0
5    Frank             NaN           6.0
Final DataFrame with Overall Rank:
  Student  Math_Score  Physics_Score  Math_Rank  Physics_Rank  Overall_Rank
1      Bob          92             92        1.0           2.0             1
3    David          92             85        1.0           3.0             2
0    Alice          88             75        3.0           5.0             3
2  Charlie          85             80        4.0           4.0             4
5    Frank          85             NaN        4.0           6.0             5
4      Eve          78             95        6.0           1.0             6

Common Pitfalls & Best Practices

  1. Method Choice Matters: The method parameter is critical. 'average' is common, but 'min' or 'dense' might be more appropriate depending on your use case (e.g., competition rankings). Be deliberate about your choice.
  2. Axis Confusion: Remember axis=0 is for columns and axis=1 is for rows. It's easy to mix them up.
  3. NaN Handling: Always decide what to do with missing values. The default 'keep' is often safe, but in some cases, you might want to treat them as the best or worst performers.
  4. Data Types: rank() always returns a float, even if the ranks are whole numbers. This is to accommodate ties. You can convert the result to an integer if you're sure there are no ties (e.g., df['rank'].astype(int)).
分享:
扫描分享到社交APP
上一篇
下一篇