Of course! Working with Python and Excel is a very common task, especially for data analysis, reporting, and automation. Here’s a comprehensive guide covering the most popular libraries, from simple reading/writing to more advanced operations.
We'll focus on two main libraries:
openpyxl: Excellent for reading, writing, and modifying.xlsxfiles. It gives you fine-grained control over worksheets, cells, styles, and charts.pandas: The go-to library for data analysis. It's incredibly fast and powerful for loading an entire Excel sheet (or multiple sheets) directly into a DataFrame, performing operations, and then saving it back.
The Scenario: A Sample Excel File
Let's assume you have an Excel file named sales_data.xlsx with two sheets: January and February.
sales_data.xlsx Structure:
| January Sheet | |||
|---|---|---|---|
| Product | Sales | Region | Date |
| Laptop | 1200 | North | 2025-01-15 |
| Mouse | 25 | South | 2025-01-20 |
| Keyboard | 75 | East | 2025-01-25 |
| Monitor | 300 | West | 2025-01-30 |
| February Sheet | |||
|---|---|---|---|
| Product | Sales | Region | Date |
| Laptop | 1500 | North | 2025-02-10 |
| Mouse | 30 | South | 2025-02-12 |
| Keyboard | 80 | East | 2025-02-18 |
| Webcam | 50 | West | 2025-02-28 |
Method 1: openpyxl - For Detailed Workbook Manipulation
openpyxl is like a remote control for your Excel file. You can select specific cells, change formatting, insert rows/columns, and more.
Installation
pip install openpyxl
Example: Reading and Writing with openpyxl
from openpyxl import load_workbook
from openpyxl.styles import Font
from datetime import datetime
# --- READING DATA ---
# Load the existing workbook
wb = load_workbook('sales_data.xlsx')
# Select a specific sheet by name
sheet_jan = wb['January']
# Get the value of a specific cell (e.g., B2, which is '1200')
sales_value = sheet_jan['B2'].value
print(f"Sales value from B2: {sales_value}")
# Read all data from a sheet (ignoring the header)
data = []
for row in sheet_jan.iter_rows(min_row=2, values_only=True):
data.append(row)
print("\nData from January sheet (without header):")
print(data)
# --- WRITING DATA ---
# Select the February sheet
sheet_feb = wb['February']
# Modify a specific cell value
sheet_feb['B2'] = 1550 # Update Laptop sales for February
# Add a new row at the end
new_row_data = ['Headphones', 60, 'Central', datetime(2025, 2, 25).date()]
sheet_feb.append(new_row_data)
# Add a header to a new sheet
if 'Summary' not in wb.sheetnames:
wb.create_sheet('Summary')
sheet_summary = wb['Summary']
sheet_summary.append(['Total Sales', 'Total Products'])
sheet_summary['A2'] = '=SUM(January!B:B)' # You can even write Excel formulas!
# Apply styling (e.g., make the header bold)
for cell in sheet_jan[1]: # Iterate through the first row
cell.font = Font(bold=True)
# Save the workbook (this will overwrite the original file)
wb.save('sales_data_modified.xlsx')
print("\n'openpyxl' script finished. 'sales_data_modified.xlsx' has been created.")
Method 2: pandas - For Data Analysis
pandas is designed for data. It reads an entire sheet into a DataFrame, a powerful 2D table-like data structure, making it easy to perform calculations, filter, and group data.
Installation
pip install pandas openpyxl # pandas needs a library like openpyxl to read .xlsx files
Example: Reading and Analyzing with pandas
import pandas as pd
# --- READING DATA ---
# Read the 'January' sheet into a DataFrame
df_jan = pd.read_excel('sales_data.xlsx', sheet_name='January')
print("--- January Data ---")
print(df_jan)
print("\nData types:")
print(df_jan.dtypes)
# Read all sheets into a dictionary of DataFrames
all_sheets = pd.read_excel('sales_data.xlsx', sheet_name=None)
print("\n--- Available Sheets ---")
print(all_sheets.keys())
df_feb = all_sheets['February']
print("\n--- February Data ---")
print(df_feb)
# --- DATA ANALYSIS ---
# Calculate total sales for January
total_sales_jan = df_jan['Sales'].sum()
print(f"\nTotal Sales in January: {total_sales_jan}")
# Find the product with the highest sales in February
max_sales_feb = df_feb.loc[df_feb['Sales'].idxmax()]
print("\nProduct with the highest sales in February:")
print(max_sales_feb)
# Filter data: Get all sales from the 'North' region
north_sales = df_jan[df_jan['Region'] == 'North']
print("\nSales from the North region:")
print(north_sales)
# --- WRITING DATA ---
# Create a new summary DataFrame
summary_data = {
'Month': ['January', 'February'],
'Total Sales': [df_jan['Sales'].sum(), df_feb['Sales'].sum()]
}
df_summary = pd.DataFrame(summary_data)
# Write the new DataFrame to a new Excel file
# index=False prevents pandas from writing the DataFrame index as a column
df_summary.to_excel('sales_summary.xlsx', index=False)
print("\n'pandas' script finished. 'sales_summary.xlsx' has been created.")
Comparison: openpyxl vs. pandas
| Feature | openpyxl |
pandas |
|---|---|---|
| Primary Use | Reading, writing, and modifying existing Excel workbooks. | Data analysis, manipulation, and statistical operations. |
| Data Handling | Works with individual cells, rows, and columns. Loads data row-by-row or cell-by-cell. | Loads entire sheets into DataFrames, optimized for numerical data. |
| Performance | Slower for large datasets as it's not a data analysis library. | Extremely fast and efficient for large datasets. |
| Key Strength | Fine-grained control: Styles, charts, images, formulas, complex layouts. | Ease of analysis: Built-in functions for sum(), mean(), groupby(), merge(), etc. |
| Best For | - Automating reports with specific formatting. - Modifying templates. - Adding charts or images. |
- Cleaning and transforming data. - Performing calculations and aggregations. - Creating simple data exports. |
Which One Should I Use?
-
Use
pandasif:- Your main goal is to analyze the data in Excel.
- You need to perform calculations, filter, or group the data.
- You are exporting data from a program to Excel for others to view.
- You are working with large datasets.
-
Use
openpyxlif:- You need to preserve or modify the exact layout of an existing Excel file.
- You need to apply specific formatting (colors, fonts, borders).
- You need to insert charts, images, or formulas into the spreadsheet.
- You are building a complex, template-based report.
Bonus: Combining Both (The Best of Both Worlds)
You can use pandas for the heavy data lifting and openpyxl for final formatting.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, Alignment
# 1. Use pandas to create the data and write a draft file
df_summary = pd.DataFrame({
'Report': ['Q1 Sales Summary'],
'Total Sales': [15000],
'Generated By': ['Python Automation']
})
df_summary.to_excel('final_report_draft.xlsx', index=False)
# 2. Use openpyxl to load the draft and apply professional formatting
wb = load_workbook('final_report_draft.xlsx')
ws = wb.active
# Style the header row
for cell in ws[1]:
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
cell.alignment = Alignment(horizontal="center")
# Style the main title
ws['A1'].font = Font(size=16, bold=True, name="Calibri")
ws['A1'].alignment = Alignment(horizontal="center")
# Save the final, beautifully formatted report
wb.save('final_report_styled.xlsx')
print("\nCombined script finished. 'final_report_styled.xlsx' has been created.") 