杰瑞科技汇

Scipy stats模块如何进行统计分析?

Of course! The scipy.stats module is a cornerstone of the scientific Python ecosystem. It provides a vast collection of probability distributions, statistical functions, and tools for statistical testing.

Here's a comprehensive guide covering the most important aspects of scipy.stats, from basic distributions to advanced statistical tests.


Core Concept: The rv_continuous and rv_discrete Framework

Nearly every distribution in scipy.stats is an instance of a class, either rv_continuous for continuous distributions (like Normal, Uniform) or rv_discrete for discrete distributions (like Binomial, Poisson).

This object-oriented approach is powerful because it gives you a consistent set of methods to work with any distribution.

Let's use the Normal distribution (norm) as our primary example.

Working with a Specific Distribution: The Normal Distribution (scipy.stats.norm)

First, you need to import it:

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

A. Creating a "Frozen" Distribution Object

It's best practice to create a "frozen" distribution object. This means you fix the parameters (like mean loc and standard deviation scale) of the distribution, creating a reusable object.

# Create a normal distribution with mean (loc) = 0 and standard deviation (scale) = 1
# This is the standard normal distribution.
std_normal = stats.norm(loc=0, scale=1)
# Create another normal distribution with mean = 10 and std dev = 2
my_normal = stats.norm(loc=10, scale=2)

B. Key Methods of a Distribution Object

Here are the most common methods you'll use, demonstrated with our std_normal object.

Method Description Example (std_normal)
.rvs(size) Random Variates Sample. Generate random numbers. std_normal.rvs(size=5) -> array([-0.5, 1.2, -0.1, 0.8, -2.3])
.pdf(x) Probability Density Function. For continuous distributions. The height of the PDF at any value x has no direct probability meaning, but the area under the curve between two points does. std_normal.pdf(0) -> 3989... (the peak of the standard normal curve)
.pmf(k) Point Mass Function. For discrete distributions. The probability of the random variable taking the exact value k. (Not for Normal, use for Binomial etc.)
.cdf(x) Cumulative Distribution Function. The probability that the random variable is less than or equal to x. std_normal.cdf(0) -> 5 (50% of the area is to the left of the mean)
.ppf(q) Percent Point Function. The inverse of the CDF. Finds the value x such that the CDF at x is q. std_normal.ppf(0.5) -> 0 (the median)
.sf(x) Survival Function. The probability that the random variable is greater than x. sf(x) = 1 - cdf(x). std_normal.sf(1) -> ~0.1587 (chance of being greater than 1 sigma)
.isf(q) Inverse Survival Function. The inverse of the SF. std_normal.isf(0.05) -> ~1.64 (the value where there's a 5% chance of being greater than it)
.mean(), .var(), .std() Get the distribution's theoretical mean, variance, and standard deviation. std_normal.mean() -> 0

C. Visualizing Distributions

Visualizing PDFs and CDFs is a great way to understand them.

# Generate data for plotting
x = np.linspace(-5, 5, 1000)
pdf_values = std_normal.pdf(x)
cdf_values = std_normal.cdf(x)
# Plot PDF
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, pdf_values, label='PDF')'Probability Density Function')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
# Plot CDF
plt.subplot(1, 2, 2)
plt.plot(x, cdf_values, label='CDF', color='orange')'Cumulative Distribution Function')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.legend()
plt.tight_layout()
plt.show()

Other Common Distributions

The pattern is the same for all distributions.

A. Discrete: Binomial Distribution (scipy.stats.binom)

Models the number of successes in n independent trials, each with a success probability p.

# Probability of getting 5 heads in 10 coin flips (p=0.5)
coin_flips = stats.binom(n=10, p=0.5)
print(f"P(X=5): {coin_flips.pmf(5):.4f}") # ~0.2461
# Probability of getting 5 or *fewer* heads
print(f"P(X<=5): {coin_flips.cdf(5):.4f}") # ~0.6230

B. Continuous: Uniform Distribution (scipy.stats.uniform)

Models a random variable with equal probability over an interval [a, b].

The parameters are loc (the start, a) and scale (the width, b-a).

# A uniform distribution between 2 and 5
uniform_dist = stats.uniform(loc=2, scale=3) # a=2, b=2+3=5
# Probability of drawing a number between 3 and 4
# P(3 < X < 4) = CDF(4) - CDF(3)
prob = uniform_dist.cdf(4) - uniform_dist.cdf(3)
print(f"P(3 < X < 4): {prob:.4f}") # 0.3333...

Hypothesis Testing (The "Stats" part of SciPy Stats)

This is where scipy.stats is incredibly useful for scientific analysis. It provides easy-to-use functions for common statistical tests.

A. T-Test (scipy.stats.ttest_ind)

Used to determine if there is a significant difference between the means of two independent groups.

Scenario: You have two groups of plant heights. Did the fertilizer have a significant effect?

# Sample data: heights of plants with and without fertilizer
group_no_fertilizer = np.random.normal(20, 5, 30)
group_with_fertilizer = np.random.normal(24, 5, 30)
# Perform an independent two-sample t-test
# The null hypothesis is that the means of the two groups are equal.
t_statistic, p_value = stats.ttest_ind(group_no_fertilizer, group_with_fertilizer)
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpret the p-value
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result is significant: We reject the null hypothesis. The means are likely different.")
else:
    print("Result is not significant: We fail to reject the null hypothesis.")

B. Chi-Squared Test for Independence (scipy.stats.chi2_contingency)

Used to determine if there is a significant association between two categorical variables.

Scenario: Is there a relationship between gender and preference for a new product?

# Create a contingency table
# Rows: Gender (Male, Female)
# Columns: Preference (Like, Dislike, Neutral)
data = [[50, 30, 20],
        [20, 60, 20]]
chi2, p_value, dof, expected = stats.chi2_contingency(data)
print(f"Chi2 statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}") # This p-value will be extremely small
if p_value < alpha:
    print("Result is significant: There is an association between gender and preference.")
else:
    print("Result is not significant: No association found.")

C. ANOVA (scipy.stats.f_oneway)

Used to compare the means of three or more independent groups.

Scenario: Do three different teaching methods result in different exam scores?

method_a_scores = np.random.normal(85, 5, 20)
method_b_scores = np.random.normal(88, 5, 20)
method_c_scores = np.random.normal(92, 5, 20)
f_statistic, p_value = stats.f_oneway(method_a_scores, method_b_scores, method_c_scores)
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < alpha:
    print("Result is significant: At least one group's mean is different from the others.")
else:
    print("Result is not significant: No significant difference between group means.")

Summary Table of Key Functions

Task Common scipy.stats Function Use Case
Distribution scipy.stats.norm, binom, uniform, etc. Define a probability distribution with specific parameters.
Random Sampling dist.rvs(size=N) Generate random numbers from a distribution.
PDF/PMF dist.pdf(x), dist.pmf(k) Get the probability density (continuous) or mass (discrete) at a point.
CDF dist.cdf(x) Get the cumulative probability up to a point.
Quantile (PPF) dist.ppf(q) Find the value corresponding to a cumulative probability.
T-Test scipy.stats.ttest_ind(a, b) Compare the means of two independent groups.
Chi-Squared Test scipy.stats.chi2_contingency(table) Test for independence between two categorical variables.
ANOVA scipy.stats.f_oneway(*args) Compare the means of three or more independent groups.
Correlation scipy.stats.pearsonr(x, y) Calculate the Pearson correlation coefficient and p-value.
Linear Regression scipy.stats.linregress(x, y) Perform a linear regression and return slope, intercept, r-value, etc.
Descriptive Stats scipy.stats.describe(data) Get a set of descriptive statistics (mean, variance, skew, kurtosis).

When to Use scipy.stats vs. pandas vs. statsmodels

  • scipy.stats: Your go-to for fundamental statistical calculations. Think of it as a mathematical statistics library. Use it for:

    • Working with known probability distributions.
    • Performing common hypothesis tests (t-test, chi-squared, etc.).
    • Calculating basic descriptive statistics on a NumPy array.
  • pandas: The data manipulation and analysis library. Its .describe() method is excellent for a quick overview of a DataFrame's columns (mean, std, quartiles, etc.). Use it for:

    • Loading, cleaning, and transforming data.
    • Grouping data and calculating summary statistics within those groups (e.g., df.groupby('category')['value'].mean()).
    • Time series analysis.
  • statsmodels: A more advanced library for statistical modeling and econometrics. It provides detailed statistical reports. Use it for:

    • Building complex linear regression models (OLS).
    • Time series analysis (ARIMA, VAR).
    • Generalized Linear Models (GLMs).
    • Getting detailed ANOVA tables for regression models.

In short: Use scipy.stats for the core statistical calculations and tests. Use pandas to manage and explore your data, and use statsmodels when you need to build and analyze sophisticated statistical models.

分享:
扫描分享到社交APP
上一篇
下一篇