NumPy — Numerical Computing
NumPy is the foundation of Python's data science ecosystem. It provides ndarray — a fast, memory-efficient array type that supports vectorized operations. Instead of looping over elements, you operate on entire arrays at once, which is 10-100x faster than Python loops.
pip install numpy pandas matplotlib
Arrays and Operations
import numpy as np
# Create arrays
a = np.array([1, 2, 3, 4, 5])
b = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
c = np.zeros((3, 3)) # 3x3 matrix of zeros
d = np.ones((2, 4)) # 2x4 matrix of ones
e = np.linspace(0, 1, 5) # 5 evenly spaced values from 0 to 1
print(f"a: {a}")
print(f"b: {b}")
print(f"e: {e}")
# Vectorized operations — no loops needed!
print(f"\na * 2 = {a * 2}")
print(f"a + 10 = {a + 10}")
print(f"a ** 2 = {a ** 2}")
print(f"np.sqrt(a) = {np.sqrt(a)}")
# Statistics
print(f"\nmean: {a.mean()}, std: {a.std():.2f}, sum: {a.sum()}")
a: [1 2 3 4 5] b: [0 2 4 6 8] e: [0. 0.25 0.5 0.75 1. ] a * 2 = [ 2 4 6 8 10] a + 10 = [11 12 13 14 15] a ** 2 = [ 1 4 9 16 25] np.sqrt(a) = [1. 1.41 1.73 2. 2.24] mean: 3.0, std: 1.41, sum: 15
Matrix Operations
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("Matrix multiply:")
print(A @ B) # @ operator for matrix multiplication
print("\nElement-wise multiply:")
print(A * B)
print("\nTranspose:")
print(A.T)
print("\nReshape:")
flat = np.arange(12)
matrix = flat.reshape(3, 4)
print(matrix)
Matrix multiply: [[19 22] [43 50]] Element-wise multiply: [[ 5 12] [21 32]] Transpose: [[1 3] [2 4]] Reshape: [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]]
Pandas — Data Analysis
Pandas provides two main data structures: Series (1D, like a column) and DataFrame (2D, like a spreadsheet/SQL table). It's the go-to tool for data loading, cleaning, analysis, and transformation.
Creating and Exploring DataFrames
import pandas as pd
# Create from dictionary
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [30, 25, 35, 28, 32],
"city": ["NYC", "LA", "NYC", "Chicago", "LA"],
"salary": [95000, 72000, 110000, 85000, 98000],
})
print(df)
print(f"\nShape: {df.shape}") # (5, 4) — rows, cols
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nStats:\n{df.describe()}")
name age city salary
0 Alice 30 NYC 95000
1 Bob 25 LA 72000
2 Charlie 35 NYC 110000
3 Diana 28 Chicago 85000
4 Eve 32 LA 98000
Shape: (5, 4)
Dtypes:
name object
age int64
city object
salary int64
Stats:
age salary
count 5.00000 5.000000
mean 30.00000 92000.000000
std 3.80789 14404.860367
min 25.00000 72000.000000
max 35.00000 110000.000000Filtering and Selecting
# Select columns
print(df["name"]) # Single column (Series)
print(df[["name", "salary"]]) # Multiple columns (DataFrame)
# Filter rows
senior = df[df["age"] > 30]
print(f"\nAge > 30:\n{senior}")
nyc_high_salary = df[(df["city"] == "NYC") & (df["salary"] > 90000)]
print(f"\nNYC + salary > 90k:\n{nyc_high_salary}")
# Sort
by_salary = df.sort_values("salary", ascending=False)
print(f"\nBy salary (desc):\n{by_salary[['name', 'salary']]}")
Age > 30:
name age city salary
2 Charlie 35 NYC 110000
4 Eve 32 LA 98000
NYC + salary > 90k:
name age city salary
0 Alice 30 NYC 95000
2 Charlie 35 NYC 110000
By salary (desc):
name salary
2 Charlie 110000
4 Eve 98000
0 Alice 95000
3 Diana 85000
1 Bob 72000GroupBy and Aggregation
# Group by city, calculate stats
city_stats = df.groupby("city").agg(
avg_salary=("salary", "mean"),
count=("name", "count"),
avg_age=("age", "mean"),
)
print(city_stats)
avg_salary count avg_age city Chicago 85000.0 1 28.0 LA 85000.0 2 28.5 NYC 102500.0 2 32.5
Reading Real Data
import pandas as pd
# Read from CSV
df = pd.read_csv("data.csv")
# Read from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
# Read from JSON
df = pd.read_json("data.json")
# Read from SQL
# import sqlite3
# conn = sqlite3.connect("mydb.sqlite")
# df = pd.read_sql("SELECT * FROM users", conn)
# Quick inspection
print(df.head()) # First 5 rows
print(df.info()) # Column types and null counts
print(df.isnull().sum()) # Count missing values per column
Data Cleaning
Real-world data is messy. Pandas makes cleaning straightforward:
import pandas as pd
# Handle missing values
df.dropna() # Drop rows with any NaN
df.fillna(0) # Replace NaN with 0
df["col"].fillna(df["col"].mean()) # Fill with column mean
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=["email"]) # Based on specific column
# Type conversion
df["date"] = pd.to_datetime(df["date"])
df["price"] = df["price"].astype(float)
# String operations
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()
# Rename columns
df = df.rename(columns={"old_name": "new_name"})
# Add computed column
df["tax"] = df["salary"] * 0.3
df["net_salary"] = df["salary"] - df["tax"]
Visualization with matplotlib
import matplotlib.pyplot as plt
import pandas as pd
# Simple line chart
df = pd.DataFrame({
"month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
"revenue": [10000, 12000, 15000, 14000, 18000, 22000],
"expenses": [8000, 9000, 11000, 10000, 12000, 14000],
})
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(df["month"], df["revenue"], marker="o", label="Revenue")
ax.plot(df["month"], df["expenses"], marker="s", label="Expenses")
ax.set_title("Monthly Revenue vs Expenses")
ax.set_xlabel("Month")
ax.set_ylabel("Amount ($)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("revenue_chart.png", dpi=150)
# plt.show() # Display in window
Pandas also has built-in plotting that wraps matplotlib:
# Bar chart directly from DataFrame
df.plot(x="month", y=["revenue", "expenses"], kind="bar", figsize=(10, 6))
plt.title("Monthly Comparison")
plt.tight_layout()
plt.savefig("bar_chart.png")
# Other chart types
# df["revenue"].plot(kind="hist") # Histogram
# df.plot(kind="scatter", x="age", y="salary") # Scatter
# df["city"].value_counts().plot(kind="pie") # Pie chart
| Library | Best For | Notes |
|---|---|---|
matplotlib | Publication-quality static plots | The foundation; most customizable |
seaborn | Statistical visualization | Built on matplotlib; beautiful defaults |
plotly | Interactive web charts | Zoom, hover, export; great for dashboards |
altair | Declarative statistical charts | Grammar of graphics; concise API |
🔍 Deep Dive: EDA Workflow
Exploratory Data Analysis (EDA) typically follows this pattern: (1) Load data with pd.read_csv(), (2) Inspect with .head(), .info(), .describe(), (3) Check for missing values with .isnull().sum(), (4) Visualize distributions with histograms, (5) Look at correlations with .corr() and heatmaps, (6) Group and aggregate to find patterns. Libraries like ydata-profiling (formerly pandas-profiling) automate this entire process with a single function call.
⚠️ Common Mistake: Chained Assignment Warning
Wrong:
df[df["age"] > 30]["salary"] = 100000 # SettingWithCopyWarning!
Why: This modifies a copy of the data, not the original DataFrame. Pandas warns you because the change is silently lost.
Instead:
df.loc[df["age"] > 30, "salary"] = 100000 # Correct!