Python dendrogram如何绘制与解读？-杰瑞科技汇

Of course! Here is a comprehensive guide to creating dendrograms in Python, covering the most popular libraries, their use cases, and a complete workflow.

What is a Dendrogram?

A dendrogram is a tree-like diagram that visualizes the arrangement of clusters produced by hierarchical clustering. It shows how each cluster is composed by drawing links between clusters based on their similarity.

The key to reading a dendrogram is the y-axis (distance):

The vertical lines represent clusters.
The height of a fusion (where two vertical lines meet) represents the distance (or dissimilarity) between the two clusters being joined. A higher fusion means the clusters were more dissimilar and joined later in the process.
The horizontal lines just link clusters, and their length has no special meaning.

Method 1: The Easiest Way with `scipy` and `matplotlib`

This is the most common and straightforward method for creating a dendrogram from a pre-computed distance matrix or data. It's perfect for understanding the clustering structure.

Key Libraries:

scipy.cluster.hierarchy: Provides the core functions for hierarchical clustering (linkage) and dendrogram plotting (dendrogram).
scipy.spatial.distance.pdist: Computes the pairwise distances between observations in a dataset.
matplotlib.pyplot: Used for displaying the plot.

Step-by-Step Example:

Let's create a dendrogram for some sample data.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
# 1. Generate some sample data
# Let's create 3 distinct groups of points
np.random.seed(42)
group1 = np.random.randn(10, 2) + [2, 2]
group2 = np.random.randn(10, 2) + [-2, -2]
group3 = np.random.randn(10, 2) + [2, -2]
# Combine them into a single array
X = np.vstack((group1, group2, group3))
# 2. Compute the linkage matrix
# 'pdist' calculates the pairwise distances.
# 'linkage' performs the hierarchical clustering using a method (e.g., 'ward', 'single', 'complete')
# 'ward' is a popular choice that minimizes the variance within clusters.
distance_matrix = pdist(X, metric='euclidean')
Z = linkage(distance_matrix, method='ward')
# 3. Plot the dendrogram
plt.figure(figsize=(10, 7))"Dendrogram for Sample Data")
plt.xlabel("Sample index")
plt.ylabel("Distance (Ward)")
# The 'dendrogram' function can truncate the display for large datasets
dendrogram(
    Z,
    truncate_mode='lastp',  # Show only the last p merged clusters
    p=12,                  # Show the last 12 merged clusters
    show_leaf_counts=True,  # Show the number of points in each cluster
    leaf_rotation=90.,     # Rotate leaf labels
    leaf_font_size=12.,
    show_contracted=True   # To get a distribution impression in truncated branches
)
plt.show()

Explanation of Key `linkage` Methods:

The choice of method in linkage drastically changes the shape of the dendrogram.

'ward': Minimizes the variance of the clusters being merged. It's a good, default choice.
'single': Uses the minimum distance between any two points in the two clusters. Can lead to "chaining" (long, stringy clusters).
'complete': Uses the maximum distance between any two points in the two clusters. Tends to produce compact, spherical clusters.
'average': Uses the average of all pairwise distances.

Method 2: The Integrated Way with `scikit-learn` and `scipy`

scikit-learn is the go-to library for machine learning. While its AgglomerativeClustering class is excellent for performing hierarchical clustering and getting flat cluster labels, it doesn't have a built-in dendrogram plotter. The standard practice is to use scikit-learn for the clustering logic and scipy for the visualization.

This approach is useful when you are already working within the scikit-learn ecosystem.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# 1. Generate sample data using scikit-learn
X, y_true = make_blobs(n_samples=50, centers=3, cluster_std=1.0, random_state=42)
# 2. Perform hierarchical clustering with scikit-learn
# Note: scikit-learn's AgglomerativeClustering does not return a linkage matrix
# directly, but we can get the children_ attribute.
# However, for plotting, it's often easier to just re-compute the linkage with scipy.
# Let's do it the "easy" way by recomputing the linkage matrix.
Z = linkage(X, method='ward', metric='euclidean')
# 3. Plot the dendrogram
plt.figure(figsize=(12, 6))"Dendrogram from scikit-learn data")
dendrogram(Z, truncate_mode='lastp', p=10, show_leaf_counts=True)
plt.xlabel("Sample index")
plt.ylabel("Distance (Ward)")
plt.show()
# --- Bonus: How to get flat clusters from the dendrogram ---
# You can decide on a distance threshold or a number of clusters (k)
# and cut the dendrogram to get flat clusters.
# Option A: Cut by distance threshold
distance_threshold = 10
clusters = fcluster(Z, t=distance_threshold, criterion='distance')
print(f"Number of clusters for distance threshold {distance_threshold}: {len(np.unique(clusters))}")
# Option B: Cut by number of clusters (k)
k = 3
clusters_k = fcluster(Z, t=k, criterion='maxclust')
print(f"Cluster assignments for k={k}: {clusters_k}")
# Visualize the flat clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=clusters_k, cmap='viridis', s=50, alpha=0.8)f"Flat Clusters (k={k})")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Method 3: The "Publication-Ready" Way with `plotly`

For interactive and visually appealing dendrograms, especially for web-based applications or presentations, plotly is an excellent choice. It allows you to hover over points to see more information.

First, you'll need to install it: pip install plotly

import numpy as np
import plotly.figure_factory as ff
from scipy.cluster.hierarchy import linkage, fcluster
# 1. Generate sample data
np.random.seed(42)
group1 = np.random.randn(20, 2) + [2, 2]
group2 = np.random.randn(20, 2) + [-2, -2]
group3 = np.random.randn(20, 2) + [2, -2]
X = np.vstack((group1, group2, group3))
# 2. Compute the linkage matrix
Z = linkage(X, method='ward', metric='euclidean')
# 3. Get cluster labels for coloring the dendrogram
# Let's say we want to show 3 main clusters
labels = fcluster(Z, t=3, criterion='maxclust')
# 4. Create the Plotly dendrogram
fig = ff.create_dendrogram(
    X,
    linkagefun=lambda x: linkage(x, method='ward', metric='euclidean'),
    color_threshold=10, # Distance at which to color the clusters
    labels=[f'Point {i}' for i in range(X.shape[0])] # Optional: custom labels
)
# Update layout for better appearance
fig.update_layout(
    width=800,
    height=500,text="Interactive Dendrogram with Plotly",
    xaxis_title="Sample Index",
    yaxis_title="Distance (Ward)"
)
# Show the plot
fig.show()

Summary: Which Method to Use?

Method	Best For	Pros	Cons
`scipy` + `matplotlib`	Quick and simple visualization. Standard for data analysis and exploration.	- Easy to use. - No extra dependencies beyond the scientific Python stack. - Highly customizable with `matplotlib`.	- Static plots. - Can be slow with very large datasets.
`scikit-learn` + `scipy`	Machine learning workflows. When you need to perform clustering and get labels for other ML tasks.	- Integrates seamlessly with other `scikit-learn` tools (e.g., `pipelines`, `GridSearchCV`). - Easy to get flat cluster labels (`fcluster`).	- Requires using two libraries for a full workflow. - `scikit-learn` doesn't natively plot dendrograms.
`plotly`	Interactive dashboards and presentations. When you need to explore data by hovering and zooming.	- Fully interactive (zoom, pan, hover). - Modern, publication-quality aesthetics. - Great for web-based applications.	- Requires an extra library installation. - Can be overkill for simple, static plots.

Python dendrogram如何绘制与解读？

What is a Dendrogram?

Method 1: The Easiest Way with `scipy` and `matplotlib`

Key Libraries:

Step-by-Step Example:

Explanation of Key `linkage` Methods:

Method 2: The Integrated Way with `scikit-learn` and `scipy`

Method 3: The "Publication-Ready" Way with `plotly`

Summary: Which Method to Use?

99ANYc3cd6

Java String如何高效映射处理？

卡刷recovery教程，具体步骤是怎样的？

Premiere安装教程，详细步骤是什么？

Python linestyles有哪些常用样式？

Java list add方法如何正确使用？

Java Socket编程实例具体怎么实现？

Java字符串双引号如何正确使用？

Python如何处理MySQL的CLOB类型数据？

Java如何下载服务器文件到本地？

Java正则表达式如何高效解析HTML？

ht HTCM8换电池教程视频，步骤详细吗？

华为Mate7换玻璃步骤有哪些？

Java Android源码如何高效阅读与理解？

java unicode转汉字

Java Android书籍选哪本更合适？

卡西欧TR100怎么用？新手必看教程指南

Python dendrogram如何绘制与解读？

What is a Dendrogram?

Method 1: The Easiest Way with scipy and matplotlib

Key Libraries:

Step-by-Step Example:

Explanation of Key linkage Methods:

Method 2: The Integrated Way with scikit-learn and scipy

Method 3: The "Publication-Ready" Way with plotly

Summary: Which Method to Use?

相关推荐

Java Socket编程实例具体怎么实现？

Method 1: The Easiest Way with `scipy` and `matplotlib`

Explanation of Key `linkage` Methods:

Method 2: The Integrated Way with `scikit-learn` and `scipy`

Method 3: The "Publication-Ready" Way with `plotly`