Data Science Essentials

Master NumPy arrays, Pandas DataFrames, data cleaning, analysis, and matplotlib visualization.

Advanced 45 min read 🐍 Python

NumPy — Numerical Computing

NumPy is the foundation of Python's data science ecosystem. It provides ndarray — a fast, memory-efficient array type that supports vectorized operations. Instead of looping over elements, you operate on entire arrays at once, which is 10-100x faster than Python loops.

pip install numpy pandas matplotlib

Arrays and Operations

import numpy as np

# Create arrays
a = np.array([1, 2, 3, 4, 5])
b = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
c = np.zeros((3, 3))         # 3x3 matrix of zeros
d = np.ones((2, 4))          # 2x4 matrix of ones
e = np.linspace(0, 1, 5)    # 5 evenly spaced values from 0 to 1

print(f"a: {a}")
print(f"b: {b}")
print(f"e: {e}")

# Vectorized operations — no loops needed!
print(f"\na * 2 = {a * 2}")
print(f"a + 10 = {a + 10}")
print(f"a ** 2 = {a ** 2}")
print(f"np.sqrt(a) = {np.sqrt(a)}")

# Statistics
print(f"\nmean: {a.mean()}, std: {a.std():.2f}, sum: {a.sum()}")
Output
a: [1 2 3 4 5]
b: [0 2 4 6 8]
e: [0.   0.25 0.5  0.75 1.  ]

a * 2 = [ 2  4  6  8 10]
a + 10 = [11 12 13 14 15]
a ** 2 = [ 1  4  9 16 25]
np.sqrt(a) = [1.   1.41 1.73 2.   2.24]

mean: 3.0, std: 1.41, sum: 15

Matrix Operations

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix multiply:")
print(A @ B)      # @ operator for matrix multiplication

print("\nElement-wise multiply:")
print(A * B)

print("\nTranspose:")
print(A.T)

print("\nReshape:")
flat = np.arange(12)
matrix = flat.reshape(3, 4)
print(matrix)
Output
Matrix multiply:
[[19 22]
 [43 50]]

Element-wise multiply:
[[ 5 12]
 [21 32]]

Transpose:
[[1 3]
 [2 4]]

Reshape:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Key Takeaway: NumPy operations work on entire arrays at once (vectorization). This is 10-100x faster than Python for-loops because the heavy lifting happens in optimized C code.

Pandas — Data Analysis

Pandas provides two main data structures: Series (1D, like a column) and DataFrame (2D, like a spreadsheet/SQL table). It's the go-to tool for data loading, cleaning, analysis, and transformation.

Creating and Exploring DataFrames

import pandas as pd

# Create from dictionary
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [30, 25, 35, 28, 32],
    "city": ["NYC", "LA", "NYC", "Chicago", "LA"],
    "salary": [95000, 72000, 110000, 85000, 98000],
})

print(df)
print(f"\nShape: {df.shape}")   # (5, 4) — rows, cols
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nStats:\n{df.describe()}")
Output
      name  age     city  salary
0    Alice   30      NYC   95000
1      Bob   25       LA   72000
2  Charlie   35      NYC  110000
3    Diana   28  Chicago   85000
4      Eve   32       LA   98000

Shape: (5, 4)

Dtypes:
name      object
age        int64
city      object
salary     int64

Stats:
             age        salary
count   5.00000      5.000000
mean   30.00000  92000.000000
std     3.80789  14404.860367
min    25.00000  72000.000000
max    35.00000 110000.000000

Filtering and Selecting

# Select columns
print(df["name"])            # Single column (Series)
print(df[["name", "salary"]]) # Multiple columns (DataFrame)

# Filter rows
senior = df[df["age"] > 30]
print(f"\nAge > 30:\n{senior}")

nyc_high_salary = df[(df["city"] == "NYC") & (df["salary"] > 90000)]
print(f"\nNYC + salary > 90k:\n{nyc_high_salary}")

# Sort
by_salary = df.sort_values("salary", ascending=False)
print(f"\nBy salary (desc):\n{by_salary[['name', 'salary']]}")
Output
Age > 30:
      name  age city  salary
2  Charlie   35  NYC  110000
4      Eve   32   LA   98000

NYC + salary > 90k:
      name  age city  salary
0    Alice   30  NYC   95000
2  Charlie   35  NYC  110000

By salary (desc):
      name  salary
2  Charlie  110000
4      Eve   98000
0    Alice   95000
3    Diana   85000
1      Bob   72000

GroupBy and Aggregation

# Group by city, calculate stats
city_stats = df.groupby("city").agg(
    avg_salary=("salary", "mean"),
    count=("name", "count"),
    avg_age=("age", "mean"),
)
print(city_stats)
Output
         avg_salary  count  avg_age
city
Chicago     85000.0      1     28.0
LA          85000.0      2     28.5
NYC        102500.0      2     32.5

Reading Real Data

import pandas as pd

# Read from CSV
df = pd.read_csv("data.csv")

# Read from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Read from JSON
df = pd.read_json("data.json")

# Read from SQL
# import sqlite3
# conn = sqlite3.connect("mydb.sqlite")
# df = pd.read_sql("SELECT * FROM users", conn)

# Quick inspection
print(df.head())        # First 5 rows
print(df.info())        # Column types and null counts
print(df.isnull().sum()) # Count missing values per column

Data Cleaning

Real-world data is messy. Pandas makes cleaning straightforward:

import pandas as pd

# Handle missing values
df.dropna()                          # Drop rows with any NaN
df.fillna(0)                         # Replace NaN with 0
df["col"].fillna(df["col"].mean())   # Fill with column mean

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=["email"]) # Based on specific column

# Type conversion
df["date"] = pd.to_datetime(df["date"])
df["price"] = df["price"].astype(float)

# String operations
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()

# Rename columns
df = df.rename(columns={"old_name": "new_name"})

# Add computed column
df["tax"] = df["salary"] * 0.3
df["net_salary"] = df["salary"] - df["tax"]

Visualization with matplotlib

import matplotlib.pyplot as plt
import pandas as pd

# Simple line chart
df = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
    "revenue": [10000, 12000, 15000, 14000, 18000, 22000],
    "expenses": [8000, 9000, 11000, 10000, 12000, 14000],
})

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(df["month"], df["revenue"], marker="o", label="Revenue")
ax.plot(df["month"], df["expenses"], marker="s", label="Expenses")
ax.set_title("Monthly Revenue vs Expenses")
ax.set_xlabel("Month")
ax.set_ylabel("Amount ($)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("revenue_chart.png", dpi=150)
# plt.show()  # Display in window

Pandas also has built-in plotting that wraps matplotlib:

# Bar chart directly from DataFrame
df.plot(x="month", y=["revenue", "expenses"], kind="bar", figsize=(10, 6))
plt.title("Monthly Comparison")
plt.tight_layout()
plt.savefig("bar_chart.png")

# Other chart types
# df["revenue"].plot(kind="hist")     # Histogram
# df.plot(kind="scatter", x="age", y="salary")  # Scatter
# df["city"].value_counts().plot(kind="pie")     # Pie chart
LibraryBest ForNotes
matplotlibPublication-quality static plotsThe foundation; most customizable
seabornStatistical visualizationBuilt on matplotlib; beautiful defaults
plotlyInteractive web chartsZoom, hover, export; great for dashboards
altairDeclarative statistical chartsGrammar of graphics; concise API
🔍 Deep Dive: EDA Workflow

Exploratory Data Analysis (EDA) typically follows this pattern: (1) Load data with pd.read_csv(), (2) Inspect with .head(), .info(), .describe(), (3) Check for missing values with .isnull().sum(), (4) Visualize distributions with histograms, (5) Look at correlations with .corr() and heatmaps, (6) Group and aggregate to find patterns. Libraries like ydata-profiling (formerly pandas-profiling) automate this entire process with a single function call.

⚠️ Common Mistake: Chained Assignment Warning

Wrong:

df[df["age"] > 30]["salary"] = 100000  # SettingWithCopyWarning!

Why: This modifies a copy of the data, not the original DataFrame. Pandas warns you because the change is silently lost.

Instead:

df.loc[df["age"] > 30, "salary"] = 100000  # Correct!