Skip to main content

Feature Engineering for Machine Learning

The task of manually engineering features to be ready for ML model training is often a tedious and error-prone one. AI-Link, however, reduces these processes to simple programmatic steps that automate feature engineering and ensure that engineered features are consistently managed at the warehouse level.

For each of the below, let’s say we’ve initialized a DataModel object called my_data_model via a process like the one described in the Connect to AtScale section and included the following line in our environment:

from atscale.eda import feature_engineering

Feature Scaling

These functions normalize feature magnitude to regulate the influence of a given feature on the model’s loss (read more here). AI-Link offers a variety of scaling methods; the example below uses minmax scaling over a range of 0 to 1 to create a scaled version of example_numeric_feature_name in our data model called newly_scaled_feature:

feature_engineering.create_scaled_feature_minmax(
data_model=my_data_model,
new_feature_name="newly_scaled_feature",
numeric_feature_name="example_numeric_feature_name",
min_value=0,
max_value=1
)

Time Series Features

The generate_time_series_features function generates various time series features based off of rolling window statistics, including the minimum, maximum, mean, standard deviation, and sum of values over different time intervals. While these sorts of features are often constructed by hand, AI-Link users build them automatically with respect to the time hierarchies and time hierarchy levels of their choice. The below example builds time series features off of example_numeric_feature_1 and example_numeric_feature_2, which themselves are contained in the my_df DataFrame (i.e., the output of some get_data call on my_data_model):

feature_engineering.generate_time_series_features(
data_model=my_data_model,
dataframe=my_df,
numeric_features=[
"example_numeric_feature_name_1",
"example_numeric_feature_name_2"
],
time_hierarchy="example_time_hierarchy",
level="example_level"
)

One-Hot Encoding

One-hot encoding allows categorical features to be represented numerically with 0s and 1s so they can be passed as input to machine learning models. In the below example, we create a one-hot encoded version of example_categorical_feature_name in our semantic model:

feature_engineering.create_one_hot_encoded_features(
data_model=my_data_model,
categorical_feature="example_categorical_feature_name"
)