data_model
class atscale.data_model.data_model.DataModel
Creates an object corresponding to an AtScale Data Model. Takes an existing model id and AtScale Catalog object to construct an object that deals with functionality related to model datasets and columns, as well as reading data.
property caption : str
Getter for the caption instance variable
- Returns: The caption of this model
- Return type: str
property catalog : Catalog
Getter for the catalog instance variable.
- Returns: The Catalog object this model belongs to.
- Return type: Catalog
column_exists
Checks if the given column name exists in the dataset.
- Parameters:
- dataset_name (str) – the name of the dataset we pull the columns from, case-sensitive.
- column_name (str) – the name of the column to check, case-sensitive
- Returns: true if name found, else false.
- Return type: bool
dataset_exists
Returns whether a given dataset_name exists in the data model, case-sensitive.
- Parameters: dataset_name (str) – the name of the dataset to try and find
- Returns: true if name found, else false.
- Return type: bool
generate_time_series_features
Generates time series features off of the data model, like rolling statistics and period to date for the given numeric features using the time hierarchy from the given data model. The core of the function is built around the groupby function, like so:
dataframe[groupby(group_features + hierarchy_levels)][shift(shift_amount)][rolling(interval)][{aggregate function}]
- Parameters:
- dataframe (pandas.DataFrame) – the pandas dataframe with the features.
- numeric_features (List *[*str ]) – The list of numeric feature query names to build time series features of.
- time_hierarchy (str) – The query names of the time hierarchy to use to derive features.
- level (str) – The query name of the level within the time hierarchy to derive the features at.
- group_features (List *[*str ] , optional) – The list of features to group by. Note that this acts as a logical grouping as opposed to a dimensionality reduction when paired with shifts or intervals. Defaults to None.
- intervals (List *[*int ] , optional) – The intervals to create the features over. Will use default values based on the time step of the given level if None. Defaults to None.
- shift_amount (int , optional) – The amount of rows to shift the new features. Defaults to 0.
- Returns: A DataFrame containing the original columns and the newly generated ones
- Return type: DataFrame
get_all_categorical_feature_names
Returns a list of all published categorical features (ie Hierarchy levels and secondary_attributes) in the given DataModel.
- Parameters: folder (str , optional) – The name of a folder in the DataModel containing features to exclusively list. Defaults to None to not filter by folder.
- Returns: A list of the query names of categorical features in the DataModel and, if given, in the folder.
- Return type: List[str]
get_all_numeric_feature_names
Returns a list of all published numeric features (ie Aggregate and Calculated Measures) in the data model.
- Parameters: folder (str , optional) – The name of a folder in the data model containing measures to exclusively list. Defaults to None to not filter by folder.
- Returns: A list of the query names of numeric features in the data model and, if given, in the folder.
- Return type: List[str]
get_columns
Gets all currently visible columns in a given dataset, case-sensitive.
- Parameters: dataset_name (str) – the name of the dataset to get columns from, case-sensitive.
- Returns: the columns in the given dataset
- Return type: Dict
get_connected_warehouse
Returns the warehouse info utilized in this data_model
- Returns: A dictionary describing the connected warehouse
- Return type: Dict
get_data
Submits a query against the data model using the supplied information and returns the results in a pandas DataFrame. Be sure that values passed to filters match the data type of the feature being filtered. Decimal precision in returned numeric features may differ from other variations of the get_data function.
- Parameters:
- feature_list (List *[*str ]) – The list of feature query names to query.
- filter_equals (Dict *[*str , Any ] , optional) – Filters results based on the feature equaling the value. Defaults to None.
- filter_greater (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater than the value. Defaults to None.
- filter_less (Dict *[*str , Any ] , optional) – Filters results based on the feature being less than the value. Defaults to None.
- filter_greater_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater or equaling the value. Defaults to None.
- filter_less_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being less or equaling the value. Defaults to None.
- filter_not_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature not equaling the value. Defaults to None.
- filter_in (Dict *[*str , list ] , optional) – Filters results based on the feature being contained in the values. Defaults to None.
- filter_not_in (Dict *[*str , list ] , optional) – Filters results based on the feature not being contained in the values. Defaults to None.
- filter_between (Dict *[*str , tuple ] , optional) – Filters results based on the feature being between the values. Defaults to None.
- filter_like (Dict *[*str , str ] , optional) – Filters results based on the feature being like the clause. Defaults to None.
- filter_not_like (Dict *[*str , str ] , optional) – Filters results based on the feature not being like the clause. Defaults to None.
- filter_rlike (Dict *[*str , str ] , optional) – Filters results based on the feature being matched by the regular expression. Defaults to None.
- filter_null (List *[*str ] , optional) – Filters results to show null values of the specified features. Defaults to None.
- filter_not_null (List *[*str ] , optional) – Filters results to exclude null values of the specified features. Defaults to None.
- order_by (List *[*Tuple *[*str , str ] ]) – The sort order for the returned dataframe. Accepts a list of tuples of the feature query name and ordering respectively: [(‘feature_name_1’, ‘DESC’), (‘feature_2’, ‘ASC’) …]. Defaults to None for AtScale Engine default sorting.
- limit (int , optional) – Limit the number of results. Defaults to None for no limit.
- comment (str , optional) – A comment string to build into the query. Defaults to None for no comment.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- fake_results (bool , optional) – Whether to use fake results, often used to train aggregates with queries that will frequently be used. Defaults to False.
- use_local_cache (bool , optional) – Whether to allow the query to use the local cache. Defaults to True.
- use_aggregate_cache (bool , optional) – Whether to allow the query to use the aggregate cache. Defaults to True.
- timeout (int , optional) – The number of minutes to wait for a response before timing out. Defaults to 10.
- use_postgres (bool , optional) – Whether to use Postgres dialect for inbound query. Defaults to True.
- Returns: A pandas DataFrame containing the query results.
- Return type: DataFrame
get_data_direct
Generates an AtScale query against the data model to get the given features, translates it to a database query, and submits it directly to the database using the SQLConnection. The results are returned as a Pandas DataFrame. Be sure that values passed to filters match the data type of the feature being filtered.Decimal precision in returned numeric features may differ from other variations of the get_data function.
- Parameters:
- dbconn (SQLConnection) – The connection to use to submit the query to the database.
- feature_list (List *[*str ]) – The list of feature query names to query.
- filter_equals (Dict *[*str , Any ] , optional) – A dictionary of features to filter for equality to the value. Defaults to None.
- filter_greater (Dict *[*str , Any ] , optional) – A dictionary of features to filter greater than the value. Defaults to None.
- filter_less (Dict *[*str , Any ] , optional) – A dictionary of features to filter less than the value. Defaults to None.
- filter_greater_or_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter greater than or equal to the value. Defaults to None.
- filter_less_or_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter less than or equal to the value. Defaults to None.
- filter_not_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter not equal to the value. Defaults to None.
- filter_in (Dict *[*str , list ] , optional) – A dictionary of features to filter in a list. Defaults to None.
- filter_not_in (Dict *[*str , list ] , optional) – Filters results based on the feature not being contained in the values. Defaults to None.
- filter_between (Dict *[*str , tuple ] , optional) – A dictionary of features to filter between the tuple values. Defaults to None.
- filter_like (Dict *[*str , str ] , optional) – A dictionary of features to filter like the value. Defaults to None.
- filter_not_like (Dict *[*str , str ] , optional) – Filters results based on the feature not being like the clause. Defaults to None.
- filter_rlike (Dict *[*str , str ] , optional) – A dictionary of features to filter rlike the value. Defaults to None.
- filter_null (List *[*str ] , optional) – A list of features to filter for null. Defaults to None.
- filter_not_null (List *[*str ] , optional) – A list of features to filter for not null. Defaults to None.
- order_by (List *[*Tuple *[*str , str ] ]) – The sort order for the returned dataframe. Accepts a list of tuples of the feature query name and ordering respectively: [(‘feature_name_1’, ‘DESC’), (‘feature_2’, ‘ASC’) …]. Defaults to None for AtScale Engine default sorting.
- limit (int , optional) – A limit to put on the query. Defaults to None.
- comment (str , optional) – A comment to put in the query. Defaults to None.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- Returns: The results of the query as a DataFrame
- Return type: DataFrame
get_data_spark
Uses the provided spark_session to execute a query generated by the AtScale query engine against the data model. Returns the results in a spark DataFrame. Be sure that values passed to filters match the data type of the feature being filtered. Decimal precision in returned numeric features may differ from other variations of the get_data function.
- Parameters:
- feature_list (List *[*str ]) – The list of feature query names to query.
- spark_session (pyspark.sql.SparkSession) – The pyspark SparkSession to execute the query with
- filter_equals (Dict *[*str , Any ] , optional) – Filters results based on the feature equaling the value. Defaults to None.
- filter_greater (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater than the value. Defaults to None.
- filter_less (Dict *[*str , Any ] , optional) – Filters results based on the feature being less than the value. Defaults to None.
- filter_greater_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater or equaling the value. Defaults to None.
- filter_less_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being less or equaling the value. Defaults to None.
- filter_not_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature not equaling the value. Defaults to None.
- filter_in (Dict *[*str , list ] , optional) – Filters results based on the feature being contained in the values. Defaults to None.
- filter_not_in (Dict *[*str , list ] , optional) – Filters results based on the feature not being contained in the values. Defaults to None.
- filter_between (Dict *[*str , tuple ] , optional) – Filters results based on the feature being between the values. Defaults to None.
- filter_like (Dict *[*str , str ] , optional) – Filters results based on the feature being like the clause. Defaults to None.
- filter_not_like (Dict *[*str , str ] , optional) – Filters results based on the feature not being like the clause. Defaults to None.
- filter_rlike (Dict *[*str , str ] , optional) – Filters results based on the feature being matched by the regular expression. Defaults to None.
- filter_null (List *[*str ] , optional) – Filters results to show null values of the specified features. Defaults to None.
- filter_not_null (List *[*str ] , optional) – Filters results to exclude null values of the specified features. Defaults to None.
- order_by (List *[*Tuple *[*str , str ] ]) – The sort order for the returned dataframe. Accepts a list of tuples of the feature query name and ordering respectively: [(‘feature_name_1’, ‘DESC’), (‘feature_2’, ‘ASC’) …]. Defaults to None for AtScale Engine default sorting.
- limit (int , optional) – Limit the number of results. Defaults to None for no limit.
- comment (str , optional) – A comment string to build into the query. Defaults to None for no comment.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- Returns: A pyspark DataFrame containing the query results.
- Return type: pyspark.sql.dataframe.DataFrame
get_data_spark_jdbc
Uses the provided information to establish a jdbc connection to the underlying data warehouse. Generates a query against the data model and uses the provided spark_session to execute. Returns the results in a spark DataFrame. Be sure that values passed to filters match the data type of the feature being filtered. Decimal precision in returned numeric features may differ from other variations of the get_data function.
- Parameters:
- feature_list (List *[*str ]) – The list of feature query names to query.
- spark_session (pyspark.sql.SparkSession) – The pyspark SparkSession to execute the query with
- jdbc_format (str) – the driver class name. For example: ‘jdbc’, ‘net.snowflake.spark.snowflake’, ‘com.databricks.spark.redshift’
- jdbc_options (Dict *[*str *,*str ]) – Case-insensitive to specify connection options for jdbc
- filter_equals (Dict *[*str , Any ] , optional) – Filters results based on the feature equaling the value. Defaults to None.
- filter_greater (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater than the value. Defaults to None.
- filter_less (Dict *[*str , Any ] , optional) – Filters results based on the feature being less than the value. Defaults to None.
- filter_greater_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being greater or equaling the value. Defaults to None.
- filter_less_or_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature being less or equaling the value. Defaults to None.
- filter_not_equal (Dict *[*str , Any ] , optional) – Filters results based on the feature not equaling the value. Defaults to None.
- filter_in (Dict *[*str , list ] , optional) – Filters results based on the feature being contained in the values. Defaults to None.
- filter_not_in (Dict *[*str , list ] , optional) – Filters results based on the feature not being contained in the values. Defaults to None.
- filter_between (Dict *[*str , tuple ] , optional) – Filters results based on the feature being between the values. Defaults to None.
- filter_like (Dict *[*str , str ] , optional) – Filters results based on the feature being like the clause. Defaults to None.
- filter_not_like (Dict *[*str , str ] , optional) – Filters results based on the feature not being like the clause. Defaults to None.
- filter_rlike (Dict *[*str , str ] , optional) – Filters results based on the feature being matched by the regular expression. Defaults to None.
- filter_null (List *[*str ] , optional) – Filters results to show null values of the specified features. Defaults to None.
- filter_not_null (List *[*str ] , optional) – Filters results to exclude null values of the specified features. Defaults to None.
- order_by (List *[*Tuple *[*str , str ] ]) – The sort order for the returned dataframe. Accepts a list of tuples of the feature query name and ordering respectively: [(‘feature_name_1’, ‘DESC’), (‘feature_2’, ‘ASC’) …]. Defaults to None for AtScale Engine default sorting.
- limit (int , optional) – Limit the number of results. Defaults to None for no limit.
- comment (str , optional) – A comment string to build into the query. Defaults to None for no comment.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- Returns: A pyspark DataFrame containing the query results.
- Return type: pyspark.sql.dataframe.DataFrame
get_database_query
Returns a database query generated using the data model to get the given features. Be sure that values passed to filters match the data type of the feature being filtered.
- Parameters:
- feature_list (List *[*str ]) – The list of feature query names to query.
- filter_equals (Dict *[*str , Any ] , optional) – A dictionary of features to filter for equality to the value. Defaults to None.
- filter_greater (Dict *[*str , Any ] , optional) – A dictionary of features to filter greater than the value. Defaults to None.
- filter_less (Dict *[*str , Any ] , optional) – A dictionary of features to filter less than the value. Defaults to None.
- filter_greater_or_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter greater than or equal to the value. Defaults to None.
- filter_less_or_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter less than or equal to the value. Defaults to None.
- filter_not_equal (Dict *[*str , Any ] , optional) – A dictionary of features to filter not equal to the value. Defaults to None.
- filter_in (Dict *[*str , list ] , optional) – A dictionary of features to filter in a list. Defaults to None.
- filter_not_in (Dict *[*str , list ] , optional) – A dictionary of features to filter not in a list. Defaults to None.
- filter_between (Dict *[*str , tuple ] , optional) – A dictionary of features to filter between the tuple values. Defaults to None.
- filter_like (Dict *[*str , str ] , optional) – A dictionary of features to filter like the value. Defaults to None.
- filter_not_like (Dict *[*str , str ] , optional) – A dictionary of features to filter not like the value. Defaults to None.
- filter_rlike (Dict *[*str , str ] , optional) – A dictionary of features to filter rlike the value. Defaults to None.
- filter_null (List *[*str ] , optional) – A list of features to filter for null. Defaults to None.
- filter_not_null (List *[*str ] , optional) – A list of features to filter for not null. Defaults to None.
- order_by (List *[*Tuple *[*str , str ] ]) – The sort order for the returned query. Accepts a list of tuples of the feature query name and ordering respectively: [(‘feature_name_1’, ‘DESC’), (‘feature_2’, ‘ASC’) …]. Defaults to None for AtScale Engine default sorting.
- limit (int , optional) – A limit to put on the query. Defaults to None.
- comment (str , optional) – A comment to put in the query. Defaults to None.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- Returns: The generated database query
- Return type: str
get_dataset
Gets the metadata of a dataset.
- Parameters: dataset_name (str) – The name of the dataset to pull.
- Returns: A dictionary of the metadata for the dataset.
- Return type: Dict
get_dataset_names
Gets the name of all datasets currently utilized by the DataModel and returns as a list.
- Returns: list of dataset names
- Return type: List[str]
get_dimension_dataset_names
Gets the name of all dimension datasets currently utilized by the DataModel and returns as a list.
- Returns: list of dimension dataset names
- Return type: List[str]
get_dimensions
Gets a dictionary of dictionaries with the published dimension names and metadata.
- Returns: A dictionary of dictionaries where the dimension names are the keys in the outer dictionary : while the inner keys are the following: ‘description’, ‘type’(value is Time or Standard).
- Return type: Dict
get_fact_dataset_names
Gets the name of all fact datasets currently utilized by the DataModel and returns as a list.
- Returns: list of fact dataset names
- Return type: List[str]
get_feature_description
Returns the description of a given published feature.
- Parameters: feature (str) – The query name of the feature to retrieve the description of.
- Returns: The description of the given feature.
- Return type: str
get_feature_expression
Returns the expression of a given published feature.
- Parameters: feature (str) – The query name of the feature to return the expression of.
- Returns: The expression of the given feature.
- Return type: str
get_features
Gets the feature names and metadata for each feature in the published DataModel.
- Parameters:
- feature_list (List *[*str ] , optional) – A list of feature query names to return. Defaults to None to return all. All features in this list must exist in the model.
- folder_list (List *[*str ] , optional) – A list of folders to filter by. Defaults to None to ignore folder.
- feature_type (enums.FeatureType , optional) – The type of features to filter by. Options include enums.FeatureType.ALL, enums.FeatureType.CATEGORICAL, or enums.FeatureType.NUMERIC. Defaults to ALL.
- Returns: A dictionary of dictionaries where the feature names are the keys in the outer dictionary : while the inner keys are the following: ‘data_type’(value is a level-type, ‘Aggregate’, or ‘Calculated’), ‘description’, ‘expression’, caption, ‘folder’, and ‘feature_type’(value is Numeric or Categorical).
- Return type: Dict
get_folders
Returns a list of the available folders in the published DataModel.
- Returns: A list of the available folders
- Return type: List[str]
get_hierarchies
Gets a dictionary of dictionaries with the published hierarchy names and metadata. Secondary attributes are treated as : their own hierarchies, they are hidden by default, but can be shown with the secondary_attribute parameter.
- Parameters:
- secondary_attribute (bool , optional) – if we want to filter the secondary attribute field. True will return hierarchies and secondary_attributes, False will return only non-secondary attributes. Defaults to False.
- folder_list (List *[*str ] , optional) – The list of folders in the data model containing hierarchies to exclusively list. Defaults to None to not filter by folder.
- Returns: A dictionary of dictionaries where the hierarchy names are the keys in the outer dictionary : while the inner keys are the following: ‘dimension’, ‘description’, ‘caption’, ‘folder’, ‘type’(value is Time or Standard), ‘secondary_attribute’.
- Return type: Dict
get_hierarchy_levels
Gets a list of strings for the levels of a given published hierarchy
- Parameters: hierarchy_name (str) – The query name of the hierarchy
- Returns: A list containing the hierarchy’s levels
- Return type: List[str]
get_secondary_attributes_at_level
Gets the secondary attributes that are tied to the provided level
- Parameters: level (str) – The level in question
- Returns: A list of attribute names
- Return type: List[str]
property id : str
Getter for the id instance variable
- Returns: The id of this model
- Return type: str
is_perspective()
Checks if this DataModel is a perspective
- Returns: true if this is a perspective
- Return type: bool
property name : str
Getter for the name instance variable. The name of the data model.
- Returns: The textual identifier for the data model.
- Return type: str
submit_atscale_query
Submits the given query against the data model and returns the results in a pandas DataFrame.
- Parameters:
- query (str) – The SQL query to submit.
- use_aggs (bool , optional) – Whether to allow the query to use aggs. Defaults to True.
- gen_aggs (bool , optional) – Whether to allow the query to generate aggs. Defaults to True.
- fake_results (bool , optional) – Whether to use fake results, often used to train aggregates with queries that will frequently be used. Defaults to False.
- use_local_cache (bool , optional) – Whether to allow the query to use the local cache. Defaults to True.
- use_aggregate_cache (bool , optional) – Whether to allow the query to use the aggregate cache. Defaults to True.
- timeout (int , optional) – The number of minutes to wait for a response before timing out. Defaults to 10.
- Returns: A pandas DataFrame containing the query results.
- Return type: DataFrame
validate_mdx
Verifies if the given MDX Expression is valid for the current data model.
- Parameters: expression (str) – The MDX expression for the feature.
- Returns: Returns True if mdx is valid.
- Return type: bool