Machine Learning

Sparse Matrix and Dummy Variables

Why sparse matrix?

XGBoost only works with matrices that contain all numeric variables; consequently, we need to one-hot encode our data. (UC Business Analytics R Programming Guide) caret::preProcess uses bagging regression trees for missing values recovery (Yevhen Vasylenko), which requires all numeric variables.

There are different ways to do this in R.

library(tidyverse) dd <- data.frame(a = gl(3,4), b = gl(4,1,12), c = 1:12, d = sample(c("X", "Y", "Z"), 12, replace = TRUE)) str(dd)

Sparse Matrix and Dummy Variables

'data.