The Glitched Goblet Logo

The Glitched Goblet

Where Magic Meets Technology

Pandas for Data Wrangling

May 27, 2025

TL;DR Learn core pandas tricks: creating DataFrames, indexing, filtering, grouping, joins, and more. Perfect as a quick reference for your next project.

Why Pandas?

If NumPy is a calculator, pandas is a spreadsheet on jet fuel. It lets you store, query, and mutate tabular data with ease. Two key objects drive everything: the DataFrame (2‑D table) and the Series (1‑D labeled array).

1. Creating Data

1.1 DataFrames

A DataFrame is like a spreadsheet or SQL table. It’s a collection of columns, each with a name and data type. You can create one from scratch, load from CSV, or even pull from SQL databases.

import pandas as pd

# Simple sales log
sales = pd.DataFrame({
    'Units Sold': [25, 42],
    'Returns':    [3,  1]
})

Output:

Units Sold Returns
0 25 3
1 42 1

Name rows for extra clarity:

pd.feedback = pd.DataFrame({
    'Alice': ['Great quality!', 'Arrived damaged.'],
    'Bob':   ['Fast shipping.',  'Color not as shown.']
}, index=['Order #1001', 'Order #1002'])

1.2 Series

A Series is a sequence of data values. Kinda like a List, but with a label for each value. You can think of it as a single column in a DataFrame.

pd.Series(['North', 'South', 'West'], name='Direction')

Output:

| 0 | North | | 1 | South | | 2 | West |

2. Reading Data

In real work you’ll load from files. Pandas makes this easy with read_csv, read_excel, and more. Here’s how to load a CSV file containing Population stats:

pop = pd.read_csv('data/world_population.csv')
print(pop.shape) # (235, 5)
pop.head()

Key read_csv options:

  • sep – delimiter (default ,)
  • index_col – set a column as the index (e.g., index_col='Name')
  • dtype – force column dtypes (useful for IDs that look numeric but aren’t)

3. Column & Row Access

Before you can use your data, you need to know how to grab specific slices of it. Pandas offers two paradigms:

  • Bracket Access: grab a column by name.
  • iloc: position-based indexing for rows and columns. (row #, col #)
  • loc: label-based indexing for rows and columns. (row label, column label)
employees = pd.DataFrame({
    'Department': ['Sales', 'HR', 'IT', 'Finance'],
    'Name': ['Ann', 'Ben', 'Cara', 'Dan'],
    'Salary': [65_000, 55_000, 75_000, 80_000]
})

3.1 Dot / Bracket Notation

You can access columns like attributes or dictionary keys.

employees['Salary'] # Series of salaries
employees['Salary'][0] # 65000

3.2 iloc (Position‑based)

Position-based selection is similar to how we access the elements of a list or dictionary. We can use the iloc method to access the rows and columns of a DataFrame using their index.

employees.iloc[2, 1] # Row 2, Col 1 ⇒ 'Cara'

3.3 loc (Label‑based)

employees = employees.set_index('Name')
employees.loc['Ben', ['Department', 'Salary']]

Note: Both loc and iloc are row-first, column-second. This is the opposite of how Python, which is column-first, row-second. This is important to remember when using these methods.

Choosing between loc and iloc

  • Use iloc when you want to select rows and columns by their integer index.
    • ilocuses the Python stdlib indexing scheme, where the first element of the range is included and the last one is excluded. So 0:10 will select 0,....9.
  • Use loc when you want to select rows and columns by their label.
    • loc uses the pandas indexing scheme, where the first element of the range is included and the last one is included. So 0:10 will select 0,...,10.

4. Conditional Selection

Build boolean masks to filter rows. This is like using a sql WHERE clause or a spreadsheet filter. You can chain conditions with and or or.

# Employees earning ≥ 70k
high_paid = employees[employees['Salary'] >= 70_000]

# Countries with population between 50 M and 100 M
mid_pop = pop[pop['2023'].between(50_000_000, 100_000_000)]

Pandas comes with a few built-in conditional selection methods. These are useful when we want to select rows based on a condition.

  • isin(): Select rows where the value is in a list.
  • between(): Select rows where the value is between two values.
  • str.contains(): Select rows where the value contains a string.
  • str.startswith(): Select rows where the value starts with a string.
  • str.endswith(): Select rows where the value ends with a string.
  • str.len(): Select rows where the value has a certain length.
  • str.match(): Select rows where the value matches a regular expression.
  • isnull(): Select rows where the value is null.
  • notnull(): Select rows where the value is not null.

5. Assigning & Transforming

Data isn’t static. You’ll often craft derived columns (e.g., a custom "Bulk" metric) or transform existing ones. Pandas lets you vectorize math for speed, or reach for map/apply when you need bespoke logic. It gains new capabilities without altering the original stats sheet.

# Salary after 5 % raise
employees['Salary_2026'] = employees['Salary'] * 1.05

Vectorized math is faster than apply.

5.1 map & apply

The function you pass to map() should expect a single value from the Series and return a transformed version of that value. map() returns a new Series where all the values have been transformed by the function.

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

avg_salary = employees['Salary'].mean()
# De‑mean salaries
employees['CenteredPay'] = employees['Salary'].map(lambda s: s - avg_salary)

# Row‑wise flag: long tenured?
employees['LongTenure'] = employees.apply(lambda r: r.get('Years', 0) > 5, axis='columns')

Note: map and apply return new, transformed Series and DataFrames, respectively. They do not modify the original Series or DataFrame.

6. Grouping & Aggregation

We'll often want to group data together, and then do something specific to the group. For example, we might want to group employees by a salary. We can do this by using the groupby() method. The groupby object acts as an iterator over these boxes, and aggregation functions (mean, count, max, custom lambdas) summarize each group.

# Avg salary by department
employees.groupby('Department')['Salary'].mean()

# Max population per continent
continents = pop.groupby('Continent').apply(lambda df: df.loc[df['2023'].idxmax(), ['Country', '2023']])

agg lets you mix multiple functions:

employees.groupby('Department')['Salary'].agg(['count', 'min', 'median', 'max'])

6.1 Multi‑Index

Pandas allows us to create a multi-index DataFrame. This is useful when we want to group data by more than one column. Grouping by two columns returns a MultiIndex:

g = employees.groupby(['Department', 'Role']).size()
g.reset_index(name='Headcount')  # flatten

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

Note: To convert a multi-index back to a single index, we can use the reset_index() method.

7. Sorting

Sorting is a common operation in data analysis. Pandas provides a number of methods for sorting data. The sort_values() method is used to sort a DataFrame by one or more columns. The sort_index() method is used to sort a DataFrame by its index.

employees.sort_values(by='Salary', ascending=False).head(3)

8. Data Types & Missing Values

Pandas automatically infers data types when reading CSVs, but you can specify them with dtype or convert later. Common types include:

  • int64 – integers
  • float64 – floating point numbers
  • object – strings or mixed types
  • category – categorical data

It's possible to change the type of a column in a DataFrame. This is done by using astype().

Entries missing values are called NaN (Not a Number). These values are always float64 types. Pandas provides some methods for dealing with missing values.

  • isnull(): Returns a boolean Series indicating whether each value is null.
  • notnull(): Returns a boolean Series indicating whether each value is not null.

To help fix these values, pandas provides a few methods for filling in missing data.

  • fillna(): Fill in missing values with a specified value.
employees.dtypes
employees['Salary'] = employees['Salary'].astype('int64')

# Fill NAs in Years column with 0
employees['Years'] = employees['Years'].fillna(0)

isnull, notnull, and replace help clean data.

9. Combining DataFrames

Real-world projects often scatter info across multiple CSVs or APIs—stats in one table, moves in another, competitive tiers in a third. Pandas gives three progressively more flexible tools to stitch these tables together:

  • concat – Stack DataFrames vertically or horizontally when they share the same columns or index.
  • join – Align on the index, great when you’ve already set names or IDs as the index and want a simple left/right join.
  • merge – SQL‑style operation where you match rows on one or more columns (keys).

Choosing the right tool depends on how your data is keyed and whether you’re adding rows, columns, or a relational lookup.

9.1 concat

Use when: datasets have identical columns (or identical indices when concatenating columns) and you want to glue them end‑to‑end.

q1 = pd.read_csv('sales_Q1.csv')
q2 = pd.read_csv('sales_Q2.csv')
all_sales = pd.concat([q1, q2], ignore_index=True)

9.2 join

Use when: both tables share the index you want to join on, and you’d like a fast shorthand for merge(left_index=True, right_index=True).

cust = pd.read_csv('customers.csv').set_index('CustomerID')
orders = pd.read_csv('orders.csv').set_index('CustomerID')
orders = orders.join(cust, how='left')

9.3 merge

Use when: you need fine‑grained control—inner vs. outer joins, multiple key columns, suffix handling. Mirrors SQL’s SELECT … FROM A JOIN B ON ….

#products = pd.read_csv('products.csv')
order_items = pd.read_csv('order_items.csv')
full = order_items.merge(products, on='ProductID', how='inner')

10. Renaming & Re‑indexing

Rename lets you change the index and/or column names. You can do this by specifying index or column keyword parameters. It supports mutliple input formats, but the most common is a dictionary.

employees.rename(columns={'Salary': 'AnnualPay'}, inplace=True)
# Use employee ID as index
employees.set_index('EmpID', inplace=True)

Note: You will very rarely need to update index names. It's more common to update column names. Updating the index name is usually done with set_index().

Final Thoughts

Pandas is a powerful tool for data analysis and manipulation. It provides a wide range of methods for working with tabular data, making it easy to read, write, and transform data. Whether you are working with small datasets or large datasets, pandas can help you get the job done quickly and efficiently.

Outro

Thanks for reading! I know that this is the third reference sheet in a row. Next week I will be throwing everything together with a tutorial on predicting Magic: The Gathering card prices!

Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I write new posts each week, so be sure to check back often!