The Glitched Goblet

TL;DR Clean messy data, encode categorical variables, bundle everything in pipelines, validate with cross‑validation, and unleash XGBoost. All with hands‑on code you can run in a notebook today.

If you finished the Intro to ML post and thought, “Cool, but real data is way uglier,” this follow‑up is for you. We’ll use the same California Housing dataset so you can reuse your environment and immediately see the effects of each technique.

1. Wrestling With Missing Values

When sensors drop packets or humans skip survey fields, you get NaN (Not‑a‑Number) entries. Most scikit‑learn models will straight‑up error if you feed them NaNs, so we MUST decide what to do. The three classic moves are:

Strategy	What it does	When it’s OK
Drop Columns	Delete any column containing any `NaNs`	Column is mostly NaNs or not predictive
Imputation	Replace `NaNs` with a statistic (mean/median)	Numeric columns with moderate sparsity
Extended Imputation	Impute and add a boolean “was_missing” flag	When the fact a value is missing may carry signal

# Detect columns with NaNs
a = [c for c in X_train.columns if X_train[c].isna().any()]

# Drop them
X_drop_train = X_train.drop(a, axis=1)
X_drop_val   = X_val.drop(a, axis=1)

Dropping is safe but often throws out good data. Imputation keeps columns alive:

from sklearn.impute import SimpleImputer

# Impute missing values with the median value
imp = SimpleImputer(strategy='median')

X_imp_train = pd.DataFrame(imp.fit_transform(X_train), columns=X_train.columns)
X_imp_val = pd.DataFrame(imp.transform(X_val), columns=X_val.columns)

Sometimes the mere absence of a value matters. We signal that via extended imputation:

for col in a:
    X_train[col + '_was_missing'] = X_train[col].isna()
    X_val[col + '_was_missing'] = X_val[col].isna()

Why care? Models can only learn from what they see. Telling them “this cell was originally blank” gives another dimension to reason over.

2. Taming Categorical Variables

Algorithms speak numbers, not strings. A column like ocean_proximity (values: '<1H OCEAN', 'INLAND', …) needs translation.

2.a Ordinal Encoding

Assigns an integer to each category. Use only when the categories have an inherent order (e.g., low < medium < high).

from sklearn.preprocessing import OrdinalEncoder

# Identify categorical columns
ord_enc = OrdinalEncoder()

X_ord_train = X_train.copy()
X_ord_val = X_val.copy()

X_ord_train[obj_cols] = ord_enc.fit_transform(X_train[obj_cols])
X_ord_val[obj_cols] = ord_enc.transform(X_val[obj_cols])

2.b One‑Hot Encoding

Makes a binary column per category—perfect when red isn’t “greater than” blue.

from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
oh = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_train = pd.DataFrame(oh.fit_transform(X_train[obj_cols]))
OH_val = pd.DataFrame(oh.transform(X_val[obj_cols]))
OH_train.index = X_train.index
OH_val.index = X_val.index

num_train = X_train.drop(obj_cols, axis=1)
num_val = X_val.drop(obj_cols, axis=1)
X_oh_train = pd.concat([num_train, OH_train], axis=1)
X_oh_val = pd.concat([num_val, OH_val], axis=1)

Memory watch: One‑hot explodes column count. For high‑cardinality features (e.g., zip codes) consider hashing tricks or target encoding. One-hot encoding is best for low-cardinality features, about 20 or fewer unique values.

3. Pipelines: Duct‑Tape for Your Workflow

Copy‑pasting preprocessed arrays around eventually ends in tears. Pipelines bundle every step—imputation, encoding, model—into a single object. A pipeline is a sequence of transformations, each feeding into the next.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

num_pipe = SimpleImputer(strategy='median')

cat_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('oh', OneHotEncoder(handle_unknown='ignore'))
])

pre = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

model = RandomForestRegressor(n_estimators=300, random_state=0)

tree_pipe = Pipeline([
    ('prep', pre),
    ('model', model)
])

Fit once and you’re done:

tree_pipe.fit(X_train, y_train)
preds = tree_pipe.predict(X_val)

Perks: cleaner code, cross‑validation becomes trivial, and serialization (joblib.dump) ships the exact preprocessing logic to prod.

4. Cross‑Validation: Trust but Verify

A single train/validation split might get lucky. k‑fold cross‑validation (commonly k = 5) rotates the validation slice and averages the score.

from sklearn.model_selection import cross_val_score

neg_mae = cross_val_score(tree_pipe, X, y,
                          cv=5,
                          scoring='neg_mean_absolute_error',
                          n_jobs=-1)

print(f"CV MAE: {(-neg_mae).mean():.3f}")

Yes, it’s slower. Yes, it’s worth it for models that train within a few minutes.

5. XGBoost: Rocket Fuel

When you’ve squeezed the Random Forest, Gradient Boosting often goes further. XGBoost is the industry‑standard implementation. It’s fast, flexible, and has a ton of hyperparameters to tune. It essentially creates a forest of trees, each correcting the errors of the previous one. Then, after enough iterations, it combines them into a single model.

from xgboost import XGBRegressor

booster = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    n_jobs=4,
    early_stopping_rounds=5
)

booster.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

Tweak key parameters:

Param	Effect
`n_estimators`	number of trees
`learning_rate`	shrinkage per tree
`max_depth` / `min_child_weight`	controls complexity
`subsample` / `colsample_bytree`	row/feature subsampling

Early stopping halts training when validation error stops improving—huge time saver.

6. Data Leakage: The Silent Model Killer

Leakage means your model peeked at future info. Two common culprits:

Target leakage: predictors include a feature computed AFTER the target (e.g., actual sale price when predicting listing price).
Train‑test contamination: you fit a preprocessor on the full dataset BEFORE splitting, letting validation rows influence the transformation.

The fix is to fit all transformers inside the pipeline after train_test_split, and audit features to ensure they’d be available at prediction time.

Quick‑Reference Cheatsheet

Problem	Go‑to Fix
Missing numeric values	`SimpleImputer(strategy='median')`
Missing categorical values	`SimpleImputer(strategy='most_frequent')`
Nominal categorical	`OneHotEncoder(handle_unknown='ignore')`
Ordinal categorical	`OrdinalEncoder()`
Workflow sprawl	`Pipeline` + `ColumnTransformer`
Over‑optimistic scores	5‑fold `cross_val_score`
Last‑mile accuracy	`XGBRegressor()`

Final Thoughts

Intermediate ML is about discipline: rigorous preprocessing, airtight evaluation, and leak‑proof pipelines. Master these, and algorithms like XGBoost become powerful allies instead of mysterious black boxes.

As always, try the notebook, tweak hyper‑parameters, and let me know on BlueSky if your MAE drops. Happy glitching!

Intermediate Machine Learning