May 19, 2025
TL;DR Clean messy data, encode categorical variables, bundle everything in pipelines, validate with cross‑validation, and unleash XGBoost. All with hands‑on code you can run in a notebook today.
If you finished the Intro to ML post and thought, “Cool, but real data is way uglier,” this follow‑up is for you. We’ll use the same California Housing
dataset so you can reuse your environment and immediately see the effects of each technique.
When sensors drop packets or humans skip survey fields, you get NaN
(Not‑a‑Number) entries. Most scikit‑learn models will straight‑up error if you feed them NaNs, so we MUST decide what to do. The three classic moves are:
Strategy | What it does | When it’s OK |
---|---|---|
Drop Columns | Delete any column containing any NaNs |
Column is mostly NaNs or not predictive |
Imputation | Replace NaNs with a statistic (mean/median) |
Numeric columns with moderate sparsity |
Extended Imputation | Impute and add a boolean “was_missing” flag | When the fact a value is missing may carry signal |
# Detect columns with NaNs
a = [c for c in X_train.columns if X_train[c].isna().any()]
# Drop them
X_drop_train = X_train.drop(a, axis=1)
X_drop_val = X_val.drop(a, axis=1)
Dropping is safe but often throws out good data. Imputation
keeps columns alive:
from sklearn.impute import SimpleImputer
# Impute missing values with the median value
imp = SimpleImputer(strategy='median')
X_imp_train = pd.DataFrame(imp.fit_transform(X_train), columns=X_train.columns)
X_imp_val = pd.DataFrame(imp.transform(X_val), columns=X_val.columns)
Sometimes the mere absence of a value matters. We signal that via extended imputation
:
for col in a:
X_train[col + '_was_missing'] = X_train[col].isna()
X_val[col + '_was_missing'] = X_val[col].isna()
Why care? Models can only learn from what they see. Telling them “this cell was originally blank” gives another dimension to reason over.
Algorithms speak numbers, not strings. A column like ocean_proximity
(values: '<1H OCEAN'
, 'INLAND'
, …) needs translation.
Assigns an integer to each category. Use only when the categories have an inherent order (e.g., low < medium < high
).
from sklearn.preprocessing import OrdinalEncoder
# Identify categorical columns
ord_enc = OrdinalEncoder()
X_ord_train = X_train.copy()
X_ord_val = X_val.copy()
X_ord_train[obj_cols] = ord_enc.fit_transform(X_train[obj_cols])
X_ord_val[obj_cols] = ord_enc.transform(X_val[obj_cols])
Makes a binary column per category—perfect when red isn’t “greater than” blue.
from sklearn.preprocessing import OneHotEncoder
# Identify categorical columns
oh = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_train = pd.DataFrame(oh.fit_transform(X_train[obj_cols]))
OH_val = pd.DataFrame(oh.transform(X_val[obj_cols]))
OH_train.index = X_train.index
OH_val.index = X_val.index
num_train = X_train.drop(obj_cols, axis=1)
num_val = X_val.drop(obj_cols, axis=1)
X_oh_train = pd.concat([num_train, OH_train], axis=1)
X_oh_val = pd.concat([num_val, OH_val], axis=1)
Memory watch: One‑hot explodes column count. For high‑cardinality features (e.g., zip codes) consider hashing tricks or target encoding. One-hot encoding is best for low-cardinality features, about 20 or fewer unique values.
Copy‑pasting preprocessed arrays around eventually ends in tears. Pipelines bundle every step—imputation, encoding, model—into a single object. A pipeline is a sequence of transformations, each feeding into the next.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
num_pipe = SimpleImputer(strategy='median')
cat_pipe = Pipeline([
('imp', SimpleImputer(strategy='most_frequent')),
('oh', OneHotEncoder(handle_unknown='ignore'))
])
pre = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
model = RandomForestRegressor(n_estimators=300, random_state=0)
tree_pipe = Pipeline([
('prep', pre),
('model', model)
])
Fit once and you’re done:
tree_pipe.fit(X_train, y_train)
preds = tree_pipe.predict(X_val)
Perks: cleaner code, cross‑validation becomes trivial, and serialization (joblib.dump
) ships the exact preprocessing logic to prod.
A single train/validation split might get lucky. k‑fold cross‑validation
(commonly k = 5) rotates the validation slice and averages the score.
from sklearn.model_selection import cross_val_score
neg_mae = cross_val_score(tree_pipe, X, y,
cv=5,
scoring='neg_mean_absolute_error',
n_jobs=-1)
print(f"CV MAE: {(-neg_mae).mean():.3f}")
Yes, it’s slower. Yes, it’s worth it for models that train within a few minutes.
When you’ve squeezed the Random Forest, Gradient Boosting
often goes further. XGBoost is the industry‑standard implementation. It’s fast, flexible, and has a ton of hyperparameters to tune. It essentially creates a forest of trees, each correcting the errors of the previous one. Then, after enough iterations, it combines them into a single model.
from xgboost import XGBRegressor
booster = XGBRegressor(
n_estimators=1000,
learning_rate=0.05,
n_jobs=4,
early_stopping_rounds=5
)
booster.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
Tweak key parameters:
Param | Effect |
---|---|
n_estimators |
number of trees |
learning_rate |
shrinkage per tree |
max_depth / min_child_weight |
controls complexity |
subsample / colsample_bytree |
row/feature subsampling |
Early stopping halts training when validation error stops improving—huge time saver.
Leakage means your model peeked at future info. Two common culprits:
The fix is to fit all transformers inside the pipeline after train_test_split
, and audit features to ensure they’d be available at prediction time.
Problem | Go‑to Fix |
---|---|
Missing numeric values | SimpleImputer(strategy='median') |
Missing categorical values | SimpleImputer(strategy='most_frequent') |
Nominal categorical | OneHotEncoder(handle_unknown='ignore') |
Ordinal categorical | OrdinalEncoder() |
Workflow sprawl | Pipeline + ColumnTransformer |
Over‑optimistic scores | 5‑fold cross_val_score |
Last‑mile accuracy | XGBRegressor() |
Intermediate ML is about discipline: rigorous preprocessing, airtight evaluation, and leak‑proof pipelines. Master these, and algorithms like XGBoost become powerful allies instead of mysterious black boxes.
As always, try the notebook, tweak hyper‑parameters, and let me know on BlueSky if your MAE drops. Happy glitching!