杰瑞科技汇

Python中searchsorted函数如何使用?

Of course! Let's dive deep into searchsorted, a powerful and highly useful function from the NumPy library.

Python中searchsorted函数如何使用?-图1
(图片来源网络,侵删)

What is searchsorted?

numpy.searchsorted(a, v, side='left', sorter=None) is a function that finds the insertion points for a set of values v into a sorted array a.

Think of it like this: You have a sorted bookshelf (array a), and you want to insert a new book (value v). searchsorted tells you exactly which position (index) on the shelf the new book should go in to keep the shelf sorted.


The Core Concept: Insertion Point

The key idea is the insertion point. For a given value v, the insertion point is the index in the sorted array a where v should be inserted to maintain the sorted order.

The behavior of searchsorted is controlled by the side parameter, which is crucial to understand.

Python中searchsorted函数如何使用?-图2
(图片来源网络,侵删)

side='left' (Default)

This finds the insertion point before any existing entries of v.

Rule: The returned index i is the smallest index such that a[i] >= v.

Example:

import numpy as np
a = np.array([1, 3, 4, 4, 6, 8])
Value v to Insert Sorted Array a Insertion Point (Index) Resulting Array
2 [1, **3**, 4, 4, 6, 8] 1 [1, 2, 3, ...]
4 [1, 3, **4**, 4, 6, 8] 2 [1, 3, 4, 4, ...]
9 [1, 3, 4, 4, 6, 8, **9**] 6 [1, 3, 4, 4, 6, 8, 9]
# For v = 2
print(np.searchsorted(a, 2))  # Output: 1
# For v = 4
print(np.searchsorted(a, 4))  # Output: 2 (the first position where a[i] >= 4)
# For v = 9
print(np.searchsorted(a, 9))  # Output: 6 (appended at the end)

side='right'

This finds the insertion point after any existing entries of v.

Python中searchsorted函数如何使用?-图3
(图片来源网络,侵删)

Rule: The returned index i is the largest index such that a[i] <= v.

Example: Using the same array a = np.array([1, 3, 4, 4, 6, 8]):

Value v to Insert Sorted Array a Insertion Point (Index) Resulting Array
2 [1, **3**, 4, 4, 6, 8] 1 [1, 2, 3, ...]
4 [1, 3, 4, 4, **6**, 8] 4 [1, 3, 4, 4, 4, 6, ...]
9 [1, 3, 4, 4, 6, 8, **9**] 6 [1, 3, 4, 4, 6, 8, 9]
# For v = 2
print(np.searchsorted(a, 2, side='right')) # Output: 1 (same as left)
# For v = 4
print(np.searchsorted(a, 4, side='right')) # Output: 4 (the position after the last 4)
# For v = 9
print(np.searchsorted(a, 9, side='right')) # Output: 6 (same as left)

Key Parameters

  • a: The 1D array that must be sorted in ascending order. If a is not sorted, the results will be incorrect. (NumPy does not check for sortedness for performance reasons).
  • v: The value(s) to find the insertion points for. This can be a single value or an array of values.
  • side: A string, either 'left' or 'right', as described above.
  • sorter: Optional. If a was sorted using a custom key (e.g., argsort), you can provide the array of indices that would sort a. This is useful if you don't want to actually sort a but know its sorted order.

Common Use Cases

Binning (Finding Bin Indices)

This is the most common and powerful use case. searchsorted is extremely fast for this, much faster than a manual loop or list comprehension.

Problem: You have data and you want to assign each data point to a bin. For example, converting ages into age groups.

import numpy as np
# Define the edges of our bins (age groups)
age_bins = [0, 18, 35, 50, 65, 100]
# A list of ages to be binned
ages = np.array([5, 17, 22, 35, 50, 60, 70, 99])
# Find the index of the bin for each age
# We use side='right' because the bins are defined as [0, 18), [18, 35), etc.
bin_indices = np.searchsorted(age_bins, ages, side='right')
print("Age Bins Edges:", age_bins)
print("Ages:", ages)
print("Bin Indices:", bin_indices)
# Output:
# Age Bins Edges: [0, 18, 35, 50, 65, 100]
# Ages: [ 5 17 22 35 50 60 70 99]
# Bin Indices: [1 1 2 2 3 4 5 5]

You can then use these indices to create the final age groups:

# A list of bin labels
bin_labels = ['Child', 'Teen', 'Adult', 'Middle Age', 'Senior', 'Elderly']
# Use the bin_indices to get the corresponding label
age_groups = [bin_labels[i] for i in bin_indices]
print("Age Groups:", age_groups)
# Output:
# Age Groups: ['Child', 'Teen', 'Adult', 'Adult', 'Middle Age', 'Senior', 'Elderly', 'Elderly']

Inserting Multiple Values

searchsorted is vectorized, meaning you can find insertion points for many values at once, which is very efficient.

import numpy as np
a = np.array([10, 20, 30, 40, 50])
values_to_insert = np.array([5, 15, 25, 35, 99])
# Find insertion points for all values at once
insertion_points = np.searchsorted(a, values_to_insert)
print("Original array:", a)
print("Values to insert:", values_to_insert)
print("Insertion points:", insertion_points)
# Output:
# Original array: [10 20 30 40 50]
# Values to insert: [ 5 15 25 35 99]
# Insertion points: [0 1 2 3 5]

Finding Duplicates

You can use side='left' and side='right' together to find the range of indices for a specific value, which is useful for counting duplicates.

import numpy as np
a = np.array([1, 3, 4, 4, 4, 6, 8])
v = 4
left_index = np.searchsorted(a, v, side='left')
right_index = np.searchsorted(a, v, side='right')
print(f"Array: {a}")
print(f"Value to find: {v}")
print(f"Left insertion index: {left_index}")
print(f"Right insertion index: {right_index}")
# The number of occurrences is the difference
num_duplicates = right_index - left_index
print(f"Number of occurrences of {v}: {num_duplicates}")
# Output:
# Array: [1 3 4 4 4 6 8]
# Value to find: 4
# Left insertion index: 2
# Right insertion index: 5
# Number of occurrences of 4: 3

Performance and Alternatives

  • Performance: searchsorted uses a binary search algorithm, which has a time complexity of O(log N) for a single value and O(M log N) for M values. This is extremely fast and much better than a brute-force O(M*N) approach.

  • Alternatives:

    • bisect module: Python's standard library has a bisect module that does the exact same thing. It's a great option if you don't want to depend on NumPy for this single task.
      • bisect.bisect_left(a, v) is equivalent to np.searchsorted(a, v, side='left')
      • bisect.bisect_right(a, v) is equivalent to np.searchsorted(a, v, side='right')
    • pandas: If you are working with pandas DataFrames or Series, the pandas.Series.searchsorted() method is available and works similarly, often integrating more seamlessly with pandas objects.

Summary

Feature Description
What it does Finds insertion points for values into a sorted array.
Core Concept Binary search for fast index lookup.
side='left' Finds the first position where a[i] >= v.
side='right' Finds the last position where a[i] <= v.
Primary Use Case Binning / Histograms - extremely fast for assigning data to categories.
Performance Very fast (O(log N) per value).
Precondition The input array a must be sorted.

searchsorted is an essential tool in any data scientist's or numerical programmer's toolkit, especially for tasks involving categorization and finding positions in sorted data.

分享:
扫描分享到社交APP
上一篇
下一篇