Scipy stats模块如何进行统计分析？-杰瑞科技汇

Of course! The scipy.stats module is a cornerstone of the scientific Python ecosystem. It provides a vast collection of probability distributions, statistical functions, and tools for statistical testing.

Here's a comprehensive guide covering the most important aspects of scipy.stats, from basic distributions to advanced statistical tests.

Core Concept: The `rv_continuous` and `rv_discrete` Framework

Nearly every distribution in scipy.stats is an instance of a class, either rv_continuous for continuous distributions (like Normal, Uniform) or rv_discrete for discrete distributions (like Binomial, Poisson).

This object-oriented approach is powerful because it gives you a consistent set of methods to work with any distribution.

Let's use the Normal distribution (norm) as our primary example.

Working with a Specific Distribution: The Normal Distribution (`scipy.stats.norm`)

First, you need to import it:

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

A. Creating a "Frozen" Distribution Object

It's best practice to create a "frozen" distribution object. This means you fix the parameters (like mean loc and standard deviation scale) of the distribution, creating a reusable object.

# Create a normal distribution with mean (loc) = 0 and standard deviation (scale) = 1
# This is the standard normal distribution.
std_normal = stats.norm(loc=0, scale=1)
# Create another normal distribution with mean = 10 and std dev = 2
my_normal = stats.norm(loc=10, scale=2)

B. Key Methods of a Distribution Object

Here are the most common methods you'll use, demonstrated with our std_normal object.

Method	Description	Example (`std_normal`)
`.rvs(size)`	Random Variates Sample. Generate random numbers.	`std_normal.rvs(size=5)` -> `array([-0.5, 1.2, -0.1, 0.8, -2.3])`
`.pdf(x)`	Probability Density Function. For continuous distributions. The height of the PDF at any value `x` has no direct probability meaning, but the area under the curve between two points does.	`std_normal.pdf(0)` -> `3989...` (the peak of the standard normal curve)
`.pmf(k)`	Point Mass Function. For discrete distributions. The probability of the random variable taking the exact value `k`.	(Not for Normal, use for Binomial etc.)
`.cdf(x)`	Cumulative Distribution Function. The probability that the random variable is less than or equal to `x`.	`std_normal.cdf(0)` -> `5` (50% of the area is to the left of the mean)
`.ppf(q)`	Percent Point Function. The inverse of the CDF. Finds the value `x` such that the CDF at `x` is `q`.	`std_normal.ppf(0.5)` -> `0` (the median)
`.sf(x)`	Survival Function. The probability that the random variable is greater than `x`. `sf(x) = 1 - cdf(x)`.	`std_normal.sf(1)` -> `~0.1587` (chance of being greater than 1 sigma)
`.isf(q)`	Inverse Survival Function. The inverse of the SF.	`std_normal.isf(0.05)` -> `~1.64` (the value where there's a 5% chance of being greater than it)
`.mean()`, `.var(), .std()`	Get the distribution's theoretical mean, variance, and standard deviation.	`std_normal.mean()` -> `0`

C. Visualizing Distributions

Visualizing PDFs and CDFs is a great way to understand them.

# Generate data for plotting
x = np.linspace(-5, 5, 1000)
pdf_values = std_normal.pdf(x)
cdf_values = std_normal.cdf(x)
# Plot PDF
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, pdf_values, label='PDF')'Probability Density Function')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
# Plot CDF
plt.subplot(1, 2, 2)
plt.plot(x, cdf_values, label='CDF', color='orange')'Cumulative Distribution Function')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.legend()
plt.tight_layout()
plt.show()

Other Common Distributions

The pattern is the same for all distributions.

A. Discrete: Binomial Distribution (`scipy.stats.binom`)

Models the number of successes in n independent trials, each with a success probability p.

# Probability of getting 5 heads in 10 coin flips (p=0.5)
coin_flips = stats.binom(n=10, p=0.5)
print(f"P(X=5): {coin_flips.pmf(5):.4f}") # ~0.2461
# Probability of getting 5 or *fewer* heads
print(f"P(X<=5): {coin_flips.cdf(5):.4f}") # ~0.6230

B. Continuous: Uniform Distribution (`scipy.stats.uniform`)

Models a random variable with equal probability over an interval [a, b].

The parameters are loc (the start, a) and scale (the width, b-a).

# A uniform distribution between 2 and 5
uniform_dist = stats.uniform(loc=2, scale=3) # a=2, b=2+3=5
# Probability of drawing a number between 3 and 4
# P(3 < X < 4) = CDF(4) - CDF(3)
prob = uniform_dist.cdf(4) - uniform_dist.cdf(3)
print(f"P(3 < X < 4): {prob:.4f}") # 0.3333...

Hypothesis Testing (The "Stats" part of SciPy Stats)

This is where scipy.stats is incredibly useful for scientific analysis. It provides easy-to-use functions for common statistical tests.

A. T-Test (`scipy.stats.ttest_ind`)

Used to determine if there is a significant difference between the means of two independent groups.

Scenario: You have two groups of plant heights. Did the fertilizer have a significant effect?

# Sample data: heights of plants with and without fertilizer
group_no_fertilizer = np.random.normal(20, 5, 30)
group_with_fertilizer = np.random.normal(24, 5, 30)
# Perform an independent two-sample t-test
# The null hypothesis is that the means of the two groups are equal.
t_statistic, p_value = stats.ttest_ind(group_no_fertilizer, group_with_fertilizer)
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpret the p-value
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result is significant: We reject the null hypothesis. The means are likely different.")
else:
    print("Result is not significant: We fail to reject the null hypothesis.")

B. Chi-Squared Test for Independence (`scipy.stats.chi2_contingency`)

Used to determine if there is a significant association between two categorical variables.

Scenario: Is there a relationship between gender and preference for a new product?

# Create a contingency table
# Rows: Gender (Male, Female)
# Columns: Preference (Like, Dislike, Neutral)
data = [[50, 30, 20],
        [20, 60, 20]]
chi2, p_value, dof, expected = stats.chi2_contingency(data)
print(f"Chi2 statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}") # This p-value will be extremely small
if p_value < alpha:
    print("Result is significant: There is an association between gender and preference.")
else:
    print("Result is not significant: No association found.")

C. ANOVA (`scipy.stats.f_oneway`)

Used to compare the means of three or more independent groups.

Scenario: Do three different teaching methods result in different exam scores?

method_a_scores = np.random.normal(85, 5, 20)
method_b_scores = np.random.normal(88, 5, 20)
method_c_scores = np.random.normal(92, 5, 20)
f_statistic, p_value = stats.f_oneway(method_a_scores, method_b_scores, method_c_scores)
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < alpha:
    print("Result is significant: At least one group's mean is different from the others.")
else:
    print("Result is not significant: No significant difference between group means.")

Summary Table of Key Functions

Task	Common `scipy.stats` Function	Use Case
Distribution	`scipy.stats.norm`, `binom`, `uniform`, etc.	Define a probability distribution with specific parameters.
Random Sampling	`dist.rvs(size=N)`	Generate random numbers from a distribution.
PDF/PMF	`dist.pdf(x)`, `dist.pmf(k)`	Get the probability density (continuous) or mass (discrete) at a point.
CDF	`dist.cdf(x)`	Get the cumulative probability up to a point.
Quantile (PPF)	`dist.ppf(q)`	Find the value corresponding to a cumulative probability.
T-Test	`scipy.stats.ttest_ind(a, b)`	Compare the means of two independent groups.
Chi-Squared Test	`scipy.stats.chi2_contingency(table)`	Test for independence between two categorical variables.
ANOVA	`scipy.stats.f_oneway(*args)`	Compare the means of three or more independent groups.
Correlation	`scipy.stats.pearsonr(x, y)`	Calculate the Pearson correlation coefficient and p-value.
Linear Regression	`scipy.stats.linregress(x, y)`	Perform a linear regression and return slope, intercept, r-value, etc.
Descriptive Stats	`scipy.stats.describe(data)`	Get a set of descriptive statistics (mean, variance, skew, kurtosis).

When to Use `scipy.stats` vs. `pandas` vs. `statsmodels`

scipy.stats: Your go-to for fundamental statistical calculations. Think of it as a mathematical statistics library. Use it for:
- Working with known probability distributions.
- Performing common hypothesis tests (t-test, chi-squared, etc.).
- Calculating basic descriptive statistics on a NumPy array.
pandas: The data manipulation and analysis library. Its .describe() method is excellent for a quick overview of a DataFrame's columns (mean, std, quartiles, etc.). Use it for:
- Loading, cleaning, and transforming data.
- Grouping data and calculating summary statistics within those groups (e.g., df.groupby('category')['value'].mean()).
- Time series analysis.
statsmodels: A more advanced library for statistical modeling and econometrics. It provides detailed statistical reports. Use it for:
- Building complex linear regression models (OLS).
- Time series analysis (ARIMA, VAR).
- Generalized Linear Models (GLMs).
- Getting detailed ANOVA tables for regression models.

In short: Use scipy.stats for the core statistical calculations and tests. Use pandas to manage and explore your data, and use statsmodels when you need to build and analyze sophisticated statistical models.

Scipy stats模块如何进行统计分析？

Core Concept: The `rv_continuous` and `rv_discrete` Framework

Working with a Specific Distribution: The Normal Distribution (`scipy.stats.norm`)

A. Creating a "Frozen" Distribution Object

B. Key Methods of a Distribution Object

C. Visualizing Distributions

Other Common Distributions

A. Discrete: Binomial Distribution (`scipy.stats.binom`)

B. Continuous: Uniform Distribution (`scipy.stats.uniform`)

Hypothesis Testing (The "Stats" part of SciPy Stats)

A. T-Test (`scipy.stats.ttest_ind`)

B. Chi-Squared Test for Independence (`scipy.stats.chi2_contingency`)

C. ANOVA (`scipy.stats.f_oneway`)

Summary Table of Key Functions

When to Use `scipy.stats` vs. `pandas` vs. `statsmodels`

99ANYc3cd6

Python KafkaClient如何高效连接与使用？

Axure 7.0视频教程哪里找？

Java插入MySQL数据乱码怎么办？

Cool Edit视频教程哪里找？新手怎么学？

123d Design教程怎么学？新手入门指南？

Python数据分析视频教程怎么学更高效？

七彩虹GTX1060怎么安装？

Java HTTP Server如何实现高性能？

Java中double转double，精度会变吗？

Java Web开发范例宝典，实战技巧有哪些？

Java string转number，哪种方式最优？

Bootstrap与Java如何构建高效CMS系统？

java 解析xml string

java 获取 sessionid

Python单元测试如何高效编写与执行？

Java JSP页面跳转如何实现多级跳转？

Scipy stats模块如何进行统计分析？

Core Concept: The rv_continuous and rv_discrete Framework

Working with a Specific Distribution: The Normal Distribution (scipy.stats.norm)

A. Creating a "Frozen" Distribution Object

B. Key Methods of a Distribution Object

C. Visualizing Distributions

Other Common Distributions

A. Discrete: Binomial Distribution (scipy.stats.binom)

B. Continuous: Uniform Distribution (scipy.stats.uniform)

Hypothesis Testing (The "Stats" part of SciPy Stats)

A. T-Test (scipy.stats.ttest_ind)

B. Chi-Squared Test for Independence (scipy.stats.chi2_contingency)

C. ANOVA (scipy.stats.f_oneway)

Summary Table of Key Functions

When to Use scipy.stats vs. pandas vs. statsmodels

相关推荐

Python数据分析视频教程怎么学更高效？

Core Concept: The `rv_continuous` and `rv_discrete` Framework

Working with a Specific Distribution: The Normal Distribution (`scipy.stats.norm`)

A. Discrete: Binomial Distribution (`scipy.stats.binom`)

B. Continuous: Uniform Distribution (`scipy.stats.uniform`)

A. T-Test (`scipy.stats.ttest_ind`)

B. Chi-Squared Test for Independence (`scipy.stats.chi2_contingency`)

C. ANOVA (`scipy.stats.f_oneway`)

When to Use `scipy.stats` vs. `pandas` vs. `statsmodels`