Of course! The scipy.stats module is a cornerstone of the scientific Python ecosystem. It provides a vast collection of probability distributions, statistical functions, and tools for statistical testing.
Here's a comprehensive guide covering the most important aspects of scipy.stats, from basic distributions to advanced statistical tests.
Core Concept: The rv_continuous and rv_discrete Framework
Nearly every distribution in scipy.stats is an instance of a class, either rv_continuous for continuous distributions (like Normal, Uniform) or rv_discrete for discrete distributions (like Binomial, Poisson).
This object-oriented approach is powerful because it gives you a consistent set of methods to work with any distribution.
Let's use the Normal distribution (norm) as our primary example.
Working with a Specific Distribution: The Normal Distribution (scipy.stats.norm)
First, you need to import it:
from scipy import stats import numpy as np import matplotlib.pyplot as plt
A. Creating a "Frozen" Distribution Object
It's best practice to create a "frozen" distribution object. This means you fix the parameters (like mean loc and standard deviation scale) of the distribution, creating a reusable object.
# Create a normal distribution with mean (loc) = 0 and standard deviation (scale) = 1 # This is the standard normal distribution. std_normal = stats.norm(loc=0, scale=1) # Create another normal distribution with mean = 10 and std dev = 2 my_normal = stats.norm(loc=10, scale=2)
B. Key Methods of a Distribution Object
Here are the most common methods you'll use, demonstrated with our std_normal object.
| Method | Description | Example (std_normal) |
|---|---|---|
.rvs(size) |
Random Variates Sample. Generate random numbers. | std_normal.rvs(size=5) -> array([-0.5, 1.2, -0.1, 0.8, -2.3]) |
.pdf(x) |
Probability Density Function. For continuous distributions. The height of the PDF at any value x has no direct probability meaning, but the area under the curve between two points does. |
std_normal.pdf(0) -> 3989... (the peak of the standard normal curve) |
.pmf(k) |
Point Mass Function. For discrete distributions. The probability of the random variable taking the exact value k. |
(Not for Normal, use for Binomial etc.) |
.cdf(x) |
Cumulative Distribution Function. The probability that the random variable is less than or equal to x. |
std_normal.cdf(0) -> 5 (50% of the area is to the left of the mean) |
.ppf(q) |
Percent Point Function. The inverse of the CDF. Finds the value x such that the CDF at x is q. |
std_normal.ppf(0.5) -> 0 (the median) |
.sf(x) |
Survival Function. The probability that the random variable is greater than x. sf(x) = 1 - cdf(x). |
std_normal.sf(1) -> ~0.1587 (chance of being greater than 1 sigma) |
.isf(q) |
Inverse Survival Function. The inverse of the SF. | std_normal.isf(0.05) -> ~1.64 (the value where there's a 5% chance of being greater than it) |
.mean(), .var(), .std() |
Get the distribution's theoretical mean, variance, and standard deviation. | std_normal.mean() -> 0 |
C. Visualizing Distributions
Visualizing PDFs and CDFs is a great way to understand them.
# Generate data for plotting
x = np.linspace(-5, 5, 1000)
pdf_values = std_normal.pdf(x)
cdf_values = std_normal.cdf(x)
# Plot PDF
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, pdf_values, label='PDF')'Probability Density Function')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
# Plot CDF
plt.subplot(1, 2, 2)
plt.plot(x, cdf_values, label='CDF', color='orange')'Cumulative Distribution Function')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.legend()
plt.tight_layout()
plt.show()
Other Common Distributions
The pattern is the same for all distributions.
A. Discrete: Binomial Distribution (scipy.stats.binom)
Models the number of successes in n independent trials, each with a success probability p.
# Probability of getting 5 heads in 10 coin flips (p=0.5)
coin_flips = stats.binom(n=10, p=0.5)
print(f"P(X=5): {coin_flips.pmf(5):.4f}") # ~0.2461
# Probability of getting 5 or *fewer* heads
print(f"P(X<=5): {coin_flips.cdf(5):.4f}") # ~0.6230
B. Continuous: Uniform Distribution (scipy.stats.uniform)
Models a random variable with equal probability over an interval [a, b].
The parameters are loc (the start, a) and scale (the width, b-a).
# A uniform distribution between 2 and 5
uniform_dist = stats.uniform(loc=2, scale=3) # a=2, b=2+3=5
# Probability of drawing a number between 3 and 4
# P(3 < X < 4) = CDF(4) - CDF(3)
prob = uniform_dist.cdf(4) - uniform_dist.cdf(3)
print(f"P(3 < X < 4): {prob:.4f}") # 0.3333...
Hypothesis Testing (The "Stats" part of SciPy Stats)
This is where scipy.stats is incredibly useful for scientific analysis. It provides easy-to-use functions for common statistical tests.
A. T-Test (scipy.stats.ttest_ind)
Used to determine if there is a significant difference between the means of two independent groups.
Scenario: You have two groups of plant heights. Did the fertilizer have a significant effect?
# Sample data: heights of plants with and without fertilizer
group_no_fertilizer = np.random.normal(20, 5, 30)
group_with_fertilizer = np.random.normal(24, 5, 30)
# Perform an independent two-sample t-test
# The null hypothesis is that the means of the two groups are equal.
t_statistic, p_value = stats.ttest_ind(group_no_fertilizer, group_with_fertilizer)
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpret the p-value
alpha = 0.05 # Significance level
if p_value < alpha:
print("Result is significant: We reject the null hypothesis. The means are likely different.")
else:
print("Result is not significant: We fail to reject the null hypothesis.")
B. Chi-Squared Test for Independence (scipy.stats.chi2_contingency)
Used to determine if there is a significant association between two categorical variables.
Scenario: Is there a relationship between gender and preference for a new product?
# Create a contingency table
# Rows: Gender (Male, Female)
# Columns: Preference (Like, Dislike, Neutral)
data = [[50, 30, 20],
[20, 60, 20]]
chi2, p_value, dof, expected = stats.chi2_contingency(data)
print(f"Chi2 statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}") # This p-value will be extremely small
if p_value < alpha:
print("Result is significant: There is an association between gender and preference.")
else:
print("Result is not significant: No association found.")
C. ANOVA (scipy.stats.f_oneway)
Used to compare the means of three or more independent groups.
Scenario: Do three different teaching methods result in different exam scores?
method_a_scores = np.random.normal(85, 5, 20)
method_b_scores = np.random.normal(88, 5, 20)
method_c_scores = np.random.normal(92, 5, 20)
f_statistic, p_value = stats.f_oneway(method_a_scores, method_b_scores, method_c_scores)
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < alpha:
print("Result is significant: At least one group's mean is different from the others.")
else:
print("Result is not significant: No significant difference between group means.")
Summary Table of Key Functions
| Task | Common scipy.stats Function |
Use Case |
|---|---|---|
| Distribution | scipy.stats.norm, binom, uniform, etc. |
Define a probability distribution with specific parameters. |
| Random Sampling | dist.rvs(size=N) |
Generate random numbers from a distribution. |
| PDF/PMF | dist.pdf(x), dist.pmf(k) |
Get the probability density (continuous) or mass (discrete) at a point. |
| CDF | dist.cdf(x) |
Get the cumulative probability up to a point. |
| Quantile (PPF) | dist.ppf(q) |
Find the value corresponding to a cumulative probability. |
| T-Test | scipy.stats.ttest_ind(a, b) |
Compare the means of two independent groups. |
| Chi-Squared Test | scipy.stats.chi2_contingency(table) |
Test for independence between two categorical variables. |
| ANOVA | scipy.stats.f_oneway(*args) |
Compare the means of three or more independent groups. |
| Correlation | scipy.stats.pearsonr(x, y) |
Calculate the Pearson correlation coefficient and p-value. |
| Linear Regression | scipy.stats.linregress(x, y) |
Perform a linear regression and return slope, intercept, r-value, etc. |
| Descriptive Stats | scipy.stats.describe(data) |
Get a set of descriptive statistics (mean, variance, skew, kurtosis). |
When to Use scipy.stats vs. pandas vs. statsmodels
-
scipy.stats: Your go-to for fundamental statistical calculations. Think of it as a mathematical statistics library. Use it for:- Working with known probability distributions.
- Performing common hypothesis tests (t-test, chi-squared, etc.).
- Calculating basic descriptive statistics on a NumPy array.
-
pandas: The data manipulation and analysis library. Its.describe()method is excellent for a quick overview of a DataFrame's columns (mean, std, quartiles, etc.). Use it for:- Loading, cleaning, and transforming data.
- Grouping data and calculating summary statistics within those groups (e.g.,
df.groupby('category')['value'].mean()). - Time series analysis.
-
statsmodels: A more advanced library for statistical modeling and econometrics. It provides detailed statistical reports. Use it for:- Building complex linear regression models (
OLS). - Time series analysis (ARIMA, VAR).
- Generalized Linear Models (GLMs).
- Getting detailed ANOVA tables for regression models.
- Building complex linear regression models (
In short: Use scipy.stats for the core statistical calculations and tests. Use pandas to manage and explore your data, and use statsmodels when you need to build and analyze sophisticated statistical models.
