Skip to main content

feature_engineering

atscale.eda.feature_engineering.create_binned_feature

Creates a new feature that is a binned version of an existing numeric feature. If the value of numeric_feature_name is Null, it will be placed into bin number -1.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to bin
    • bin_edges (List *[*float ]) – The edges to use to compute the bins, left inclusive. Contents of bin_edges are interpreted in ascending order
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the created feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_correlation_feature

Creates a new feature off of the published project showing the correlation of two features.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • hierarchy_name (str) – The query name of the hierarchy used in the calculation
    • numeric_feature_1_name (str) – The query name of the first feature in the correlation calculation
    • numeric_feature_2_name (str) – The query name of the second feature in the correlation calculation
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (str , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the created feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_covariance_feature

Creates a new feature off of the published project showing the covariance of two features.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • hierarchy_name (str) – The query name of the hierarchy used in the calculation
    • numeric_feature_1_name (str) – The query name of the first feature in the covariance calculation
    • numeric_feature_2_name (str) – The query name of the second feature in the covariance calculation
    • use_sample (bool , optional) – Whether the covariance being calculated is the sample covariance. Defaults to True.
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (str , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the created feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_net_error_calculation

Creates a calculation for the net error of a predictive feature compared to the actual feature. Returns Null if the value of either predicted_feature_name or numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The Data Model that the feature will be created in
    • new_feature_name (str) – The query name of the new feature
    • predicted_feature_name (str) – The query name of the feature containing predictions
    • actual_feature_name (str) – The query name of the feature to compare the predictions to
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the created feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_one_hot_encoded_features

Creates a one hot encoded feature for each value in the given categorical feature. Works off of the published project.

  • Parameters:
    • data_model (DataModel) – The data model to add the features to.
    • categorical_feature (str) – The query name of the categorical feature to pull the values from.
    • hierarchy_name (str , optional) – The query name of the hierarchy to use for the feature. Only necessary if the feature is duplicated in multiple hierarchies.
    • allow_large_cardinality (bool , optional) – Whether to allow the ohe to generate more than 20 columns. Will raise an error if over the limit and this is set to False. Defaults to False
    • description (str , optional) – A description to add to the new features. Defaults to None.
    • folder (str , optional) – The folder to put the new features in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – A format sting for the new features. Defaults to None.
    • publish (bool , optional) – Whether to publish the project after creating the features. Defaults to True.
  • Returns: The query names of the newly created features
  • Return type: List[str]

atscale.eda.feature_engineering.create_pct_error_calculation

Creates a calculation for the percent error of a predictive feature compared to the actual feature. Returns Null if the value of either predicted_feature_name or numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • predicted_feature_name (str) – The query name of the feature containing predictions
    • actual_feature_name (str) – The query name of the feature to compare the predictions to
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_percent_change

Creates a time over time calculation. Returns Null if the value of either numeric_feature_name or the lookback of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the numeric feature to use for the calculation
    • hierarchy_name (str) – The query name of the time hierarchy used in the calculation
    • level_name (str) – The query name of the level within the time hierarchy
    • time_length (int) – The length of the lag
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_period_to_date

Creates a period-to-date calculation Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the numeric feature to use for the calculation
    • hierarchy_name (str) – The query name of the time hierarchy used in the calculation
    • level_name (str) – The query name of the level within the time hierarchy
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_log_transformed

Creates a new feature that is log transformed. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_maxabs

Creates a new feature that is maxabs scaled. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • maxabs (float) – The max absolute value of any data point from the base feature
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_minmax

Creates a new feature that is minmax scaled. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • min (float) – The min from the base feature
    • max (float) – The max from the base feature
    • feature_min (float , optional) – The min for the scaled feature. Defaults to 0.
    • feature_max (float , optional) – The max for the scaled feature. Defaults to 1.
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_power_transformed

Creates a new feature that is power transformed. Parameter ‘method’ must be either ‘box-cox’ or ‘yeo-johnson’. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • power (float) – The exponent used in the scaling
    • method (str , optional) – Which power transformation method to use. Defaults to ‘yeo-johnson’.
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_robust

Creates a new feature that is robust scaled; mirrors default behavior of scikit-learn.preprocessing.RobustScaler. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • median (float , optional) – _description_. Defaults to 0.
    • interquartile_range (float , optional) – _description_. Defaults to 1.
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_unit_vector_norm

Creates a new feature that is unit vector normalized. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • magnitude (float) – The magnitude of the base feature, i.e. the square root of the sum of the squares of numeric_feature’s data points
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.create_scaled_feature_z_score

Creates a new feature that is standard scaled. Returns Null if the value of numeric_feature_name is Null.

  • Parameters:
    • data_model (DataModel) – The DataModel that the feature will be written into
    • new_feature_name (str) – The query name of the new feature
    • numeric_feature_name (str) – The query name of the feature to scale
    • mean (float , optional) – The mean from the base feature. Defaults to 0.
    • standard_deviation (float , optional) – The standard deviation from the base feature. Defaults to 1.
    • description (str , optional) – The description for the feature. Defaults to None.
    • caption (str , optional) – The caption for the feature. Defaults to None.
    • folder (str , optional) – The folder to put the feature in. Defaults to None.
    • format_string (Union [enums.FeatureFormattingType , str ] , optional) – The format string for the feature. Defaults to None.
    • visible (bool , optional) – Whether the feature will be visible to BI tools. Defaults to True.
    • publish (bool , optional) – Whether or not the updated project should be published. Defaults to True.

atscale.eda.feature_engineering.generate_time_series_features

Generates time series features off of the published project, like rolling statistics and period to date for the given numeric features : using the time hierarchy from the given data model. The core of the function is built around the groupby function, like so: : dataframe[groupby(group_features + hierarchy_levels)][shift(shift_amount)][rolling(interval)][{aggregate function}]

  • Parameters:
    • data_model (DataModel) – The data model to use.
    • dataframe (pandas.DataFrame) – the pandas dataframe with the features.
    • numeric_features (List *[*str ]) – The list of numeric feature query names to build time series features of.
    • time_hierarchy (str) – The query names of the time hierarchy to use to derive features.
    • level (str) – The query name of the level within the time hierarchy to derive the features at.
    • group_features (List *[*str ] , optional) – The list of features to group by. Note that this acts as a logical grouping as opposed to a dimensionality reduction when paired with shifts or intervals. Defaults to None.
    • intervals (List *[*int ] , optional) – The intervals to create the features over. Will use default values based on the time step of the given level if None. Defaults to None.
    • shift_amount (int , optional) – The amount of rows to shift the new features. Defaults to 0.
  • Returns: A DataFrame containing the original columns and the newly generated ones
  • Return type: DataFrame

atscale.eda.feature_engineering.join_udf

Creates measures for each column in target_columns using the name that they are presented. For example, target_columns=[‘“predicted_sales” as “sales_prediction”’, ‘“confidence”’] would make two measures named ‘sales_prediction’ and ‘confidence’ respectively. The join_columns will be joined to join_features so that the target columns can be queried in tandem with the join_features and aggregate properly. If the join_columns already match the names of the categorical features in the data model, join_features can be omitted to use the names of the join_columns. The measures will be created from a QDS (Query Dataset) which uses the following query: ‘SELECT <target_column1, target_column2, … target_columnN, join_column1, join_column2, …> FROM <udf_call>’ Each target column will have a sum aggregate feature created with “_SUM” appended to the column name.

  • Parameters:
    • data_model (DataModel) – The AtScale data model to create the new features in
    • target_columns (List *[*str ]) – A list of target columns which will be made into features, proper quoting for the data warehouse used is required. Feature names will be based on the name of the column as queried. These strings represent raw SQL and thus a target column can be a calculated column or udf call as long as it is proper SQL syntax.
    • udf_call (str) – A valid SQL statement that will be placed directly after a FROM clause and a space with no
    • parenthesis.
    • join_features (list , optional) – a list of feature query names in the data model to use for joining. If None it will not join the qds to anything. Defaults to None for no joins.
    • join_columns (list , optional) – The columns in the from statement to join to the join_features. List must be either None or the same length and order as join_features. Defaults to None to use identical names to the join_features. If multiple columns are needed for a single join they should be in a nested list. Data warehouse specific quoting is not required, join_columns should be passed as strings and if quotes are required for the data model’s data warehouse, they will be inserted automatically.
    • roleplay_features (list , optional) – The roleplays to use on the relationships. List must be either None or the same length and order as join_features. Use ‘’ to not roleplay that relationship. Defaults to None.
    • folder (str) – Optionally specifies a folder to put the created features in. If the folder does not exist it will be created.
    • qds_name (str) – Optionally specifies the name of Query Dataset that is created. Defaults to None to be named AI_LINK_UDF_QDS_ where is 1 or the minimum number that doesn’t conflict with existing dataset names.
    • warehouse_id (str , optional) – Defaults to None. The id of the warehouse that datasets in the data model query from. This parameter is only required if no dataset has been created in the data model yet.
    • allow_aggregates (bool , optional) – Whether to allow aggregates to be built off of the QDS. Defaults to True.
    • create_hinted_aggregate (bool , optional) – Whether to generate an aggregate table for all measures and keys in this QDS to improve join performance. Defaults to False.
    • publish (bool) – Defaults to True. Whether the updated project should be published or only the draft should be updated.

atscale.eda.feature_engineering.write_linear_regression_model

Writes a scikit-learn LinearRegression model, which takes AtScale features exclusively as input, to the given Published DataModel as a sum aggregated feature with the given name. The feature will return the output of the coefficients and intercept in the model applied to feature_inputs as defined in AtScale. Omitting feature_inputs will use the names of the columns passed at training time and error if any names are not in the data model.

  • Parameters:
    • data_model (DataModel) – The AtScale DataModel to add the regression into.
    • regression_model (LinearRegression) – The scikit-learn LinearRegression model to build into a feature.
    • new_feature_name (str) – The query name of the created feature.
    • granularity_levels (List *[*str ]) – List of the query names for the categorical levels with the greatest
    • on. (levels of granularity that predictions with this model can be run)
    • feature_inputs (List *[*str ]) – List of query names of inputs features in the input order.

atscale.eda.feature_engineering.write_logistic_regression_model

Writes a scikit-learn binary LogisticRegression model, which takes AtScale features exclusively as input, to the given Published DataModel as a sum aggregated feature with the given name. The feature will return the output of the coefficients and intercept in the model applied to feature_inputs as defined in AtScale. Omitting feature_inputs will use the names of the columns passed at training time and error if any names are not in the data model.

  • Parameters:
    • data_model (DataModel) – The AtScale DataModel to add the regression into.
    • regression_model (LogisticRegression) – The scikit-learn LogisticRegression model to build into a feature.
    • new_feature_name (str) – The query name of the created feature.
    • granularity_levels (List *[*str ]) – List of the query names for the categorical levels with the greatest
    • on. (levels of granularity that predictions with this model can be run)
    • feature_inputs (List *[*str ]) – List of query names of inputs features in the input order.

atscale.eda.feature_engineering.write_snowpark_udf_to_qds

Writes a single column output of a udf into the given data_model as a feature. For example, if a : udf created in snowpark ‘udf’ outputs predictions based on a given set of features ‘[f]’, then calling write_udf_as_qds(data_model=atmodel, udf_name=udf, new_feature_name=’predictions’ feature_inputs=f) will create a new feature called ‘predictions’ which can be included in any query that excludes categorical features that are not accounted for in ‘[f]’ (no feature not in same dimension at same level or lower in [f]). Currently only supports snowflake udfs.

  • Parameters:
    • data_model (DataModel) – The AtScale data model to create the new feature in
    • udf_name (str) – The name of an existing udf which outputs a single column for every row of input. The full name space should be passed (ex. ‘“DB”.”SCHEMA”.udf_name’).
    • new_feature_name (str) – The query name of the newly created feature from the output of the udf.
    • feature_inputs (List *[*str ]) – The query names of features in data_model that are the inputs for the udf, in the order they are passed to the udf.
    • publish (bool , optional) – Whether to publish the project after updating, defaults to true.