Data Science Essentials

NumPy — Numerical Computing

NumPy is the foundation of Python's data science ecosystem. It provides ndarray — a fast, memory-efficient array type that supports vectorized operations. Instead of looping over elements, you operate on entire arrays at once, which is 10-100x faster than Python loops.

pip install numpy pandas matplotlib

Arrays and Operations

import numpy as np

# Create arrays
a = np.array([1, 2, 3, 4, 5])
b = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
c = np.zeros((3, 3))         # 3x3 matrix of zeros
d = np.ones((2, 4))          # 2x4 matrix of ones
e = np.linspace(0, 1, 5)    # 5 evenly spaced values from 0 to 1

print(f"a: {a}")
print(f"b: {b}")
print(f"e: {e}")

# Vectorized operations — no loops needed!
print(f"\na * 2 = {a * 2}")
print(f"a + 10 = {a + 10}")
print(f"a ** 2 = {a ** 2}")
print(f"np.sqrt(a) = {np.sqrt(a)}")

# Statistics
print(f"\nmean: {a.mean()}, std: {a.std():.2f}, sum: {a.sum()}")

Output

a: [1 2 3 4 5]
b: [0 2 4 6 8]
e: [0.   0.25 0.5  0.75 1.  ]

a * 2 = [ 2  4  6  8 10]
a + 10 = [11 12 13 14 15]
a ** 2 = [ 1  4  9 16 25]
np.sqrt(a) = [1.   1.41 1.73 2.   2.24]

mean: 3.0, std: 1.41, sum: 15

Matrix Operations

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix multiply:")
print(A @ B)      # @ operator for matrix multiplication

print("\nElement-wise multiply:")
print(A * B)

print("\nTranspose:")
print(A.T)

print("\nReshape:")
flat = np.arange(12)
matrix = flat.reshape(3, 4)
print(matrix)

Output

Matrix multiply:
[[19 22]
 [43 50]]

Element-wise multiply:
[[ 5 12]
 [21 32]]

Transpose:
[[1 3]
 [2 4]]

Reshape:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Key Takeaway: NumPy operations work on entire arrays at once (vectorization). This is 10-100x faster than Python for-loops because the heavy lifting happens in optimized C code.

Pandas — Data Analysis

Pandas provides two main data structures: Series (1D, like a column) and DataFrame (2D, like a spreadsheet/SQL table). It's the go-to tool for data loading, cleaning, analysis, and transformation.

Creating and Exploring DataFrames

import pandas as pd

# Create from dictionary
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [30, 25, 35, 28, 32],
    "city": ["NYC", "LA", "NYC", "Chicago", "LA"],
    "salary": [95000, 72000, 110000, 85000, 98000],
})

print(df)
print(f"\nShape: {df.shape}")   # (5, 4) — rows, cols
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nStats:\n{df.describe()}")

Output

      name  age     city  salary
0    Alice   30      NYC   95000
1      Bob   25       LA   72000
2  Charlie   35      NYC  110000
3    Diana   28  Chicago   85000
4      Eve   32       LA   98000

Shape: (5, 4)

Dtypes:
name      object
age        int64
city      object
salary     int64

Stats:
             age        salary
count   5.00000      5.000000
mean   30.00000  92000.000000
std     3.80789  14404.860367
min    25.00000  72000.000000
max    35.00000 110000.000000

Filtering and Selecting

# Select columns
print(df["name"])            # Single column (Series)
print(df[["name", "salary"]]) # Multiple columns (DataFrame)

# Filter rows
senior = df[df["age"] > 30]
print(f"\nAge > 30:\n{senior}")

nyc_high_salary = df[(df["city"] == "NYC") & (df["salary"] > 90000)]
print(f"\nNYC + salary > 90k:\n{nyc_high_salary}")

# Sort
by_salary = df.sort_values("salary", ascending=False)
print(f"\nBy salary (desc):\n{by_salary[['name', 'salary']]}")

Output

Age > 30:
      name  age city  salary
2  Charlie   35  NYC  110000
4      Eve   32   LA   98000

NYC + salary > 90k:
      name  age city  salary
0    Alice   30  NYC   95000
2  Charlie   35  NYC  110000

By salary (desc):
      name  salary
2  Charlie  110000
4      Eve   98000
0    Alice   95000
3    Diana   85000
1      Bob   72000

GroupBy and Aggregation

# Group by city, calculate stats
city_stats = df.groupby("city").agg(
    avg_salary=("salary", "mean"),
    count=("name", "count"),
    avg_age=("age", "mean"),
)
print(city_stats)

Output

         avg_salary  count  avg_age
city
Chicago     85000.0      1     28.0
LA          85000.0      2     28.5
NYC        102500.0      2     32.5

Reading Real Data

import pandas as pd

# Read from CSV
df = pd.read_csv("data.csv")

# Read from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Read from JSON
df = pd.read_json("data.json")

# Read from SQL
# import sqlite3
# conn = sqlite3.connect("mydb.sqlite")
# df = pd.read_sql("SELECT * FROM users", conn)

# Quick inspection
print(df.head())        # First 5 rows
print(df.info())        # Column types and null counts
print(df.isnull().sum()) # Count missing values per column

Data Cleaning

Real-world data is messy. Pandas makes cleaning straightforward:

import pandas as pd

# Handle missing values
df.dropna()                          # Drop rows with any NaN
df.fillna(0)                         # Replace NaN with 0
df["col"].fillna(df["col"].mean())   # Fill with column mean

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=["email"]) # Based on specific column

# Type conversion
df["date"] = pd.to_datetime(df["date"])
df["price"] = df["price"].astype(float)

# String operations
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()

# Rename columns
df = df.rename(columns={"old_name": "new_name"})

# Add computed column
df["tax"] = df["salary"] * 0.3
df["net_salary"] = df["salary"] - df["tax"]

Visualization with matplotlib

import matplotlib.pyplot as plt
import pandas as pd

# Simple line chart
df = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
    "revenue": [10000, 12000, 15000, 14000, 18000, 22000],
    "expenses": [8000, 9000, 11000, 10000, 12000, 14000],
})

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(df["month"], df["revenue"], marker="o", label="Revenue")
ax.plot(df["month"], df["expenses"], marker="s", label="Expenses")
ax.set_title("Monthly Revenue vs Expenses")
ax.set_xlabel("Month")
ax.set_ylabel("Amount ($)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("revenue_chart.png", dpi=150)
# plt.show()  # Display in window

Pandas also has built-in plotting that wraps matplotlib:

# Bar chart directly from DataFrame
df.plot(x="month", y=["revenue", "expenses"], kind="bar", figsize=(10, 6))
plt.title("Monthly Comparison")
plt.tight_layout()
plt.savefig("bar_chart.png")

# Other chart types
# df["revenue"].plot(kind="hist")     # Histogram
# df.plot(kind="scatter", x="age", y="salary")  # Scatter
# df["city"].value_counts().plot(kind="pie")     # Pie chart

Library	Best For	Notes
`matplotlib`	Publication-quality static plots	The foundation; most customizable
`seaborn`	Statistical visualization	Built on matplotlib; beautiful defaults
`plotly`	Interactive web charts	Zoom, hover, export; great for dashboards
`altair`	Declarative statistical charts	Grammar of graphics; concise API

🔍 Deep Dive: EDA Workflow

Exploratory Data Analysis (EDA) typically follows this pattern: (1) Load data with pd.read_csv(), (2) Inspect with .head(), .info(), .describe(), (3) Check for missing values with .isnull().sum(), (4) Visualize distributions with histograms, (5) Look at correlations with .corr() and heatmaps, (6) Group and aggregate to find patterns. Libraries like ydata-profiling (formerly pandas-profiling) automate this entire process with a single function call.

⚠️ Common Mistake: Chained Assignment Warning

Wrong:

df[df["age"] > 30]["salary"] = 100000  # SettingWithCopyWarning!

Why: This modifies a copy of the data, not the original DataFrame. Pandas warns you because the change is silently lost.

Instead:

df.loc[df["age"] > 30, "salary"] = 100000  # Correct!

NumPy — Numerical Computing

Arrays and Operations

Matrix Operations

Pandas — Data Analysis

Creating and Exploring DataFrames

Filtering and Selecting

GroupBy and Aggregation

Reading Real Data

Data Cleaning

Visualization with matplotlib

⚠️ Common Mistake: Chained Assignment Warning

Related Topics