Python Spark cache如何提升性能？-杰瑞科技汇

Of course! Let's dive deep into cache() in PySpark, a fundamental and powerful concept for optimizing performance.

What is `cache()`?

In simple terms, cache() is an optimization mechanism that tells Spark to keep a DataFrame or RDD in memory after it's computed for the first time.

Think of it like a memoization technique:

Without cache(): Every time you perform an action (like count(), collect(), or show()) on a DataFrame, Spark re-calculates the entire DataFrame from scratch.
With cache(): The first time you perform an action, Spark computes the DataFrame and stores it in memory. Any subsequent actions on the same DataFrame will read directly from this in-memory cache, saving a massive amount of computation time.

How Does it Work? The Lazy Evaluation Caveat

This is the most important concept to understand about cache().

Spark uses lazy evaluation. This means that transformations (like .filter(), .select(), .groupBy()) are not executed immediately. Instead, they build up a plan of what needs to be done.

cache() is a transformation, not an action. This means that calling my_df.cache() by itself does nothing. It simply adds a "cache this data" instruction to Spark's execution plan.

The actual caching (and computation) only happens when you trigger an action.

Example:

# Let's assume 'df' is a large DataFrame loaded from a file
# 1. Transformation (lazy)
filtered_df = df.filter(df["age"] > 30)
# 2. Transformation (lazy) - THIS IS WHERE CACHE() IS CALLED
cached_df = filtered_df.cache()
# 3. Action (triggers computation and caching)
print(f"Count of people over 30: {cached_df.count()}")
# 4. Another action (uses the cache!)
cached_df.show(5)

What happens step-by-step:

filtered_df = df.filter(...): Spark creates a plan to filter df. No data is processed.
cached_df = filtered_df.cache(): Spark adds a "cache the result of the filter" step to the plan. Still no data is processed.
cached_df.count(): This is the first action. Spark sees the plan:
- It reads the source data (e.g., from a file).
- It performs the filter operation.
- It sees the cache() instruction. It stores the resulting DataFrame in memory (and potentially on disk, depending on configuration).
- It then performs the count() action on the cached data.
cached_df.show(5): This is the second action. Spark sees the plan:
- It sees the cache() instruction. It checks if the data is already in memory.
- It finds the data is cached!
- It skips the expensive filter and data reading steps.
- It directly reads the cached data and performs the show() action.

The Caching Hierarchy: `persist()`

cache() is actually a shortcut for a more general function called persist().

df.cache() is equivalent to df.persist(StorageLevel.MEMORY_ONLY)

persist() allows you to specify exactly how you want to store the data. This is crucial because memory is a finite resource. You can trade memory for speed or vice-versa.

You import these from pyspark.storagelevel:

from pyspark.sql import SparkSession
from pyspark.storagelevel import StorageLevel
spark = SparkSession.builder.getOrCreate()
# Example DataFrame
df = spark.range(10000000) # A large DataFrame
# --- Different Persistence Levels ---
# 1. Default: Store partitions only on memory. If not enough memory, it will recompute.
df.persist(StorageLevel.MEMORY_ONLY)
# 2. MEMORY_ONLY_SER: Store serialized Java objects in memory. More memory-efficient, but slower to access.
df.persist(StorageLevel.MEMORY_ONLY_SER)
# 3. MEMORY_AND_DISK: Store partitions on memory. If not enough memory, store the remaining partitions on disk (slower).
df.persist(StorageLevel.MEMORY_AND_DISK)
# 4. DISK_ONLY: Store all partitions on disk. Slow, but saves memory.
df.persist(StorageLevel.DISK_ONLY)
# 5. MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as above, but replicates each partition on 2 worker nodes for fault tolerance.
df.persist(StorageLevel.MEMORY_ONLY_2)

Storage Level	Meaning	Use Case
`MEMORY_ONLY`	Default. Deserialized objects in memory.	Fastest access. Use when you have plenty of memory.
`MEMORY_ONLY_SER`	Serialized objects in memory.	Slower access but saves memory. Good for large objects.
`MEMORY_AND_DISK`	In memory, spill to disk if needed.	A good balance. Guarantees you won't recompute, but can be slow if data is swapped to disk.
`DISK_ONLY`	Data is stored only on disk.	When memory is extremely constrained and you can afford the slower I/O.
`_2` suffixes (e.g., `MEMORY_ONLY_2`)	Replicates data across 2 nodes.	Use for high fault tolerance. If one node fails, the other has a copy.

When to Use `cache()` or `persist()`

Good Candidates for Caching:

DataFrames used in multiple, iterative actions: This is the classic use case. For example, in a machine learning pipeline where you cache the training feature set and use it for training, evaluation, and prediction.
Small to medium-sized DataFrames: Caching a massive DataFrame that doesn't fit in memory can lead to "spilling to disk," which is very slow. It's often better to let it recompute than to constantly read from disk.
Intermediate results in complex DAGs: If a complex calculation (a long chain of transformations) is used as input for multiple final steps, caching the intermediate result can save a lot of time.

Bad Candidates for Caching:

DataFrames used only once: If you read a DataFrame, perform one action on it (like write.save()), and then never use it again, caching is useless and wastes memory.
Very large DataFrames: As mentioned, if the data is too big for memory, you might not get a performance benefit and could even slow down the entire cluster by consuming resources.

Important Related Methods

`unpersist()`

Caching consumes cluster resources (memory, disk). When you are done with a cached DataFrame, you should explicitly release those resources.

# After you are done with cached_df
cached_df.unpersist()
# You can also force deletion from memory even if there are references to it
# (Use with caution)
cached_df.unpersist(blocking=True)

`spark.catalog.clearCache()`

If you want to clear the cache for all DataFrames and RDDs registered in the Spark session, you can use this.

# Clears the cache for all DataFrames/RDDs in the current session
spark.catalog.clearCache()

Code Example: A Real-World Scenario

Let's simulate a scenario where caching provides a significant speedup.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import time
# 1. Initialize Spark Session
spark = SparkSession.builder \
    .appName("CacheExample") \
    .getOrCreate()
# 2. Create a large synthetic DataFrame
# Let's create 10 million rows with 3 columns
print("Creating a large DataFrame...")
data = [(i, f"name_{i}", i % 100) for i in range(10_000_000)]
df = spark.createDataFrame(data, ["id", "name", "group_id"])
# 3. Define a complex transformation
print("Defining complex transformations...")
# A multi-step transformation that is expensive to compute
transformed_df = df.filter(col("group_id") > 50) \
                   .select(col("id").alias("user_id"), col("name")) \
                   .groupBy("user_id") \
                   .count() \
                   .filter(col("count") > 1) # This is a rare event, so the final DF is small
# --- SCENARIO 1: WITHOUT CACHING ---
print("\n--- Running WITHOUT CACHE ---")
start_time = time.time()
# First action: compute and count
count_no_cache = transformed_df.count()
print(f"Count (without cache): {count_no_cache}")
# Second action: re-compute and show
transformed_df.show(5)
end_time = time.time()
print(f"Time taken without cache: {end_time - start_time:.2f} seconds")
# --- SCENARIO 2: WITH CACHING ---
print("\n--- Running WITH CACHE ---")
# Cache the result of the expensive transformations
transformed_df.cache() # This is a transformation, nothing happens yet
start_time = time.time()
# First action: compute, cache, and count
count_with_cache = transformed_df.count()
print(f"Count (with cache): {count_with_cache}")
# Second action: read from cache and show
transformed_df.show(5)
end_time = time.time()
print(f"Time taken with cache: {end_time - start_time:.2f} seconds")
# 4. Clean up
print("\nCleaning up cache...")
transformed_df.unpersist() # IMPORTANT: Release the memory
spark.stop()

When you run this code, you will see a dramatic difference in execution time for the second scenario. The first count() will take a while, but the second show() will be nearly instantaneous because it's just reading from the pre-computed cache in memory.

Python Spark cache如何提升性能？

What is `cache()`?

How Does it Work? The Lazy Evaluation Caveat

The Caching Hierarchy: `persist()`

When to Use `cache()` or `persist()`

Important Related Methods

`unpersist()`

`spark.catalog.clearCache()`

Code Example: A Real-World Scenario

99ANYc3cd6

reticulated python

Python如何快速集成QCustomPlot？PyQt绘图实战指南

Python antigravity是什么？如何实现飞行效果？

Python getiterator方法如何使用？

ghost win7教程怎么操作？

Java Socket编程实例具体怎么实现？

Java String转数组有哪些常用方法？

Python requests如何高效处理JSON数据？

Adobe InDesign安装教程在哪找？

Java String转int，如何实现？

AdvancedSkeleton教程怎么学？关键步骤有哪些？

Python VSCode中Pylint如何配置与使用？

GarageBand iPhone教程，新手如何快速上手？

python urllib iframe

Java数组与ArrayList，何时用哪个更合适？

U盘装Windows系统教程，详细步骤是怎样的？

Python Spark cache如何提升性能？

What is cache()?

How Does it Work? The Lazy Evaluation Caveat

The Caching Hierarchy: persist()

When to Use cache() or persist()

Important Related Methods

unpersist()

spark.catalog.clearCache()

Code Example: A Real-World Scenario

相关推荐

Java Socket编程实例具体怎么实现？

What is `cache()`?

The Caching Hierarchy: `persist()`

When to Use `cache()` or `persist()`

`unpersist()`

`spark.catalog.clearCache()`