May 27, 2025
TL;DR Learn core pandas tricks: creating DataFrames, indexing, filtering, grouping, joins, and more. Perfect as a quick reference for your next project.
If NumPy is a calculator, pandas is a spreadsheet on jet fuel. It lets you store, query, and mutate tabular data with ease. Two key objects drive everything: the DataFrame
(2‑D table) and the Series
(1‑D labeled array).
A DataFrame
is like a spreadsheet or SQL table. It’s a collection of columns, each with a name and data type. You can create one from scratch, load from CSV, or even pull from SQL databases.
import pandas as pd
# Simple sales log
sales = pd.DataFrame({
'Units Sold': [25, 42],
'Returns': [3, 1]
})
Output:
Units Sold | Returns | |
---|---|---|
0 | 25 | 3 |
1 | 42 | 1 |
Name rows for extra clarity:
pd.feedback = pd.DataFrame({
'Alice': ['Great quality!', 'Arrived damaged.'],
'Bob': ['Fast shipping.', 'Color not as shown.']
}, index=['Order #1001', 'Order #1002'])
A Series is a sequence of data values. Kinda like a List, but with a label for each value. You can think of it as a single column in a DataFrame.
pd.Series(['North', 'South', 'West'], name='Direction')
Output:
| 0 | North | | 1 | South | | 2 | West |
In real work you’ll load from files. Pandas makes this easy with read_csv
, read_excel
, and more. Here’s how to load a CSV file containing Population stats:
pop = pd.read_csv('data/world_population.csv')
print(pop.shape) # (235, 5)
pop.head()
Key read_csv
options:
sep
– delimiter (default ,
)index_col
– set a column as the index (e.g., index_col='Name'
)dtype
– force column dtypes (useful for IDs that look numeric but aren’t)Before you can use your data, you need to know how to grab specific slices of it. Pandas offers two paradigms:
employees = pd.DataFrame({
'Department': ['Sales', 'HR', 'IT', 'Finance'],
'Name': ['Ann', 'Ben', 'Cara', 'Dan'],
'Salary': [65_000, 55_000, 75_000, 80_000]
})
You can access columns like attributes or dictionary keys.
employees['Salary'] # Series of salaries
employees['Salary'][0] # 65000
iloc
(Position‑based)Position-based selection is similar to how we access the elements of a list or dictionary. We can use the iloc
method to access the rows and columns of a DataFrame using their index.
employees.iloc[2, 1] # Row 2, Col 1 ⇒ 'Cara'
loc
(Label‑based)employees = employees.set_index('Name')
employees.loc['Ben', ['Department', 'Salary']]
Note: Both
loc
andiloc
are row-first, column-second. This is the opposite of how Python, which is column-first, row-second. This is important to remember when using these methods.
loc
and iloc
iloc
when you want to select rows and columns by their integer index.
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one is excluded. So 0:10 will select 0,....9.loc
when you want to select rows and columns by their label.
loc
uses the pandas indexing scheme, where the first element of the range is included and the last one is included. So 0:10 will select 0,...,10.Build boolean masks to filter rows. This is like using a sql WHERE
clause or a spreadsheet filter. You can chain conditions with and
or or
.
# Employees earning ≥ 70k
high_paid = employees[employees['Salary'] >= 70_000]
# Countries with population between 50 M and 100 M
mid_pop = pop[pop['2023'].between(50_000_000, 100_000_000)]
Pandas comes with a few built-in conditional selection methods. These are useful when we want to select rows based on a condition.
isin()
: Select rows where the value is in a list.between()
: Select rows where the value is between two values.str.contains()
: Select rows where the value contains a string.str.startswith()
: Select rows where the value starts with a string.str.endswith()
: Select rows where the value ends with a string.str.len()
: Select rows where the value has a certain length.str.match()
: Select rows where the value matches a regular expression.isnull()
: Select rows where the value is null.notnull()
: Select rows where the value is not null.Data isn’t static. You’ll often craft derived columns
(e.g., a custom "Bulk" metric) or transform existing ones. Pandas lets you vectorize math for speed, or reach for map
/apply
when you need bespoke logic. It gains new capabilities without altering the original stats sheet.
# Salary after 5 % raise
employees['Salary_2026'] = employees['Salary'] * 1.05
Vectorized math is faster than apply
.
map
& apply
The function you pass to map()
should expect a single value from the Series and return a transformed version of that value. map()
returns a new Series where all the values have been transformed by the function.
apply()
is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.
avg_salary = employees['Salary'].mean()
# De‑mean salaries
employees['CenteredPay'] = employees['Salary'].map(lambda s: s - avg_salary)
# Row‑wise flag: long tenured?
employees['LongTenure'] = employees.apply(lambda r: r.get('Years', 0) > 5, axis='columns')
Note:
map
andapply
return new, transformed Series and DataFrames, respectively. They do not modify the original Series or DataFrame.
We'll often want to group data together, and then do something specific to the group. For example, we might want to group employees by a salary. We can do this by using the groupby()
method. The groupby
object acts as an iterator over these boxes, and aggregation functions (mean
, count
, max
, custom lambdas) summarize each group.
# Avg salary by department
employees.groupby('Department')['Salary'].mean()
# Max population per continent
continents = pop.groupby('Continent').apply(lambda df: df.loc[df['2023'].idxmax(), ['Country', '2023']])
agg
lets you mix multiple functions:
employees.groupby('Department')['Salary'].agg(['count', 'min', 'median', 'max'])
Pandas allows us to create a multi-index DataFrame. This is useful when we want to group data by more than one column. Grouping by two columns returns a MultiIndex:
g = employees.groupby(['Department', 'Role']).size()
g.reset_index(name='Headcount') # flatten
Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.
Note: To convert a multi-index back to a single index, we can use the
reset_index()
method.
Sorting is a common operation in data analysis. Pandas provides a number of methods for sorting data. The sort_values()
method is used to sort a DataFrame by one or more columns. The sort_index()
method is used to sort a DataFrame by its index.
employees.sort_values(by='Salary', ascending=False).head(3)
Pandas automatically infers data types when reading CSVs, but you can specify them with dtype
or convert later. Common types include:
int64
– integersfloat64
– floating point numbersobject
– strings or mixed typescategory
– categorical dataIt's possible to change the type of a column in a DataFrame. This is done by using astype()
.
Entries missing values are called NaN
(Not a Number). These values are always float64
types. Pandas provides some methods for dealing with missing values.
isnull()
: Returns a boolean Series indicating whether each value is null.notnull()
: Returns a boolean Series indicating whether each value is not null.To help fix these values, pandas provides a few methods for filling in missing data.
fillna()
: Fill in missing values with a specified value.employees.dtypes
employees['Salary'] = employees['Salary'].astype('int64')
# Fill NAs in Years column with 0
employees['Years'] = employees['Years'].fillna(0)
isnull
, notnull
, and replace
help clean data.
Real-world projects often scatter info across multiple CSVs or APIs—stats in one table, moves in another, competitive tiers in a third. Pandas gives three progressively more flexible tools to stitch these tables together:
concat
– Stack DataFrames vertically or horizontally when they share the same columns or index.join
– Align on the index, great when you’ve already set names or IDs as the index and want a simple left/right join.merge
– SQL‑style operation where you match rows on one or more columns (keys).Choosing the right tool depends on how your data is keyed and whether you’re adding rows, columns, or a relational lookup.
concat
Use when: datasets have identical columns (or identical indices when concatenating columns) and you want to glue them end‑to‑end.
q1 = pd.read_csv('sales_Q1.csv')
q2 = pd.read_csv('sales_Q2.csv')
all_sales = pd.concat([q1, q2], ignore_index=True)
join
Use when: both tables share the index you want to join on, and you’d like a fast shorthand for merge(left_index=True, right_index=True)
.
cust = pd.read_csv('customers.csv').set_index('CustomerID')
orders = pd.read_csv('orders.csv').set_index('CustomerID')
orders = orders.join(cust, how='left')
merge
Use when: you need fine‑grained control—inner vs. outer joins, multiple key columns, suffix handling. Mirrors SQL’s SELECT … FROM A JOIN B ON …
.
#products = pd.read_csv('products.csv')
order_items = pd.read_csv('order_items.csv')
full = order_items.merge(products, on='ProductID', how='inner')
Rename lets you change the index and/or column names. You can do this by specifying index
or column
keyword parameters. It supports mutliple input formats, but the most common is a dictionary.
employees.rename(columns={'Salary': 'AnnualPay'}, inplace=True)
# Use employee ID as index
employees.set_index('EmpID', inplace=True)
Note: You will very rarely need to update index names. It's more common to update column names. Updating the index name is usually done with
set_index()
.
Pandas is a powerful tool for data analysis and manipulation. It provides a wide range of methods for working with tabular data, making it easy to read, write, and transform data. Whether you are working with small datasets or large datasets, pandas can help you get the job done quickly and efficiently.
Thanks for reading! I know that this is the third reference sheet in a row. Next week I will be throwing everything together with a tutorial on predicting Magic: The Gathering card prices!
Enjoy my content? You can read more on my blog at The Glitched Goblet or follow me on BlueSky at kaemonisland.bsky.social. I write new posts each week, so be sure to check back often!