← Execution Workflow

Part 1: Feature Selection

Main focus on feature selection using lags, Fourier features, SHAP Explainer and TreeExplainer models.

Fourier Features

def add_fouriers(df, t_col='time', periods=[7, 30, 365], harmonics=3):
    out = df.copy()
    for p in periods:
        for k in range(1, harmonics + 1):
            out[f'fourier_P{p}_k{k}_sin'] = np.sin(2 * np.pi * k * out[t_col] / p)
            out[f'fourier_P{p}_k{k}_cos'] = np.cos(2 * np.pi * k * out[t_col] / p)
    return out

Rolling Windows

Track 3 main characteristics: mean(), std(), z_score:

feature_cols = [c for c in df.columns if c not in ['time', 'Y1', 'Y2']]
wins = [3, 7, 21, 63]

for w in wins:
    for col in feature_cols:
        g = df[col]
        df[f'{col}_mean{w}'] = g.transform(lambda x: x.rolling(w, min_periods=1).mean())
        df[f'{col}_std{w}']  = g.transform(lambda x: x.rolling(w, min_periods=1).std())
        df[f'{col}_z{w}']    = (df[col] - df[f'{col}_mean{w}']) / (df[f'{col}_std{w}'] + 1e-9)

SHAP Analysis

model = xgb.XGBRegressor(n_estimators=200, learning_rate=0.05, random_state=0)
model.fit(X_train_s, y_train, eval_set=[(X_test_s, y_test)], verbose=False)
explainer = shap.Explainer(model, X_train_s)
shap_values = explainer(X_test_s)

Note: n_estimators=200, learning_rate=0.05 was optimal given small number of training rows. Used TimeSeriesSplit cross-validator to average SHAP values across folds.


Additional Tests

  • ACF and PACF tests on Y1 and Y2
  • Lasso coefficients for linear association
  • Granger Causality Test for variable correlations at various lags (up to 79999)
  • Periodogram for seasonality patterns in Y1 & Y2
  • PCA tests for feature importance