About Incremental Rebuilds
Incremental rebuilds of aggregate tables append new rows and update existing rows that fall within a period of time that you can specify.
Incremental indicators
To specify that you want aggregate tables for a model to be rebuilt incrementally, you simply specify a column in the fact dataset to use as an incremental indicator. This column must have values that increase monotonically, such as a numeric UNIX timestamp showing seconds since epoch, or a Timestamp/DateTime. The values in this column enable the AtScale engine both to append rows to an aggregate table and update rows during an incremental rebuild.
Incremental aggregates are database views that consolidate one or more incrementally-built tables. Periodically, these tables are consolidated to maintain optimal query performance. You can specify how many tables to allow before consolidation takes place. The distribution of data in the underlying aggregate tables is based on the incremental indicator values. Segmenting of tables is based on ranges of values in the incremental indicator. Incremental indicator values are used as an internal tracking mechanism and are not made for querying in the incrementally-built aggregate.
Supported incremental indicator types
AtScale supports the following incremental indicator types:
- Long
- Integer
- Decimal (38,0)
- Timestamp
- DateTime
Be aware of the following:
- Decimal is only supported for Snowflake data warehouses.
- The fully supported precision for TIMESTAMP is up to TIMESTAMP(9) or nanoseconds. When reading a TIMESTAMP(12) value, the last 3 digits are truncated.
- For Google BigQuery, you can use the Timestamp Incremental Indicator to lower your costs by leveraging Timestamp-based partition pruning.
Appending Rows
During an incremental aggregate build, AtScale retrieves rows from the fact dataset that have an incremental indicator value greater than the last value recorded in the incremental aggregate.
Example
Suppose that the last incremental indicator key value used in the incremental aggregate is 1475191168000 (the UNIX timestamp in seconds since epoch for Thu, 29 Sep 2016 23:19:28 GMT). When an incremental rebuild of the aggregate begins, the engine will look in the fact dataset for rows with an incremental indicator value greater than this value, perform the necessary data aggregation, and append rows to the aggregate.
When the incremental indicator column type is as a timestamp, the grace period is always interpreted as Seconds.
Updating Rows
You also specify a grace period when you select a fact dataset's column to use as an incremental indicator. When the AtScale engine starts an incremental build, the grace period determines how far back in time the engine looks for updates to rows in the fact dataset, based on the units that are in the indicator. The engine then updates rows in the aggregate table where the incremental indicator matches those of rows in the fact dataset.
At the start of an incremental rebuild, the AtScale engine runs a query against an aggregate to find the latest incremental indicator to figure out new data that has been added since the last update. The query uses a WHERE filter to only scan data since the last update, which enables table pruning against partitioned fact datasets.
Moreover, if more than one aggregate configured for incremental rebuilds is based on the same fact datasets, the engine runs the query to see what new data has been added since the last update and uses the result for the remaining aggregates. For example, suppose that six aggregates configured for incremental rebuilds are based on the table factinternetsales. The query is run on the first of these six aggregates and the result is used for all six. This process saves time and resources, versus the alternative of running the query for each aggregate.
Example
As in the previous example, suppose that 1475191168000 is the last incremental indicator key value in an aggregate table. The grace period is 86400 seconds (one 24-hour period). The engine looks back in the fact dataset to the first row where the indicator is 86400 seconds earlier than the moment that the incremental build began. The engine also does this in the aggregate table.
Next, the engine compares the data in the fact set with the aggregated data in the aggregate table, moving forward in time in the data until the engine reaches the point in time when the incremental build began. As it compares data, the engine updates the aggregated data when it finds changes in the fact dataset.
Partition-based incremental aggregate rebuilds
Partition-based incremental aggregates are a Public Preview feature.
Although AtScale's standard, view-based incremental aggregate functionality accelerates aggregate build times, it also adds a significant run-time penalty to report queries that select from incrementally-built aggregate tables.
If you have encountered this issue, you may want to enable AtScale's partition-based incremental aggregates. This functionality leverages the data warehouse's native partition (or cluster block) pruning capabilities when building and querying incremental aggregates, thereby reducing the run-time query overhead observed in the view-based implementation.
Partition-based incremental aggregates are only supported for the following data warehouse platforms: Snowflake, Databricks, BigQuery.
The following sections describe how partition-based incremental aggregates work in AtScale.
Incremental indicators and grace periods
Like AtScale's view-based incremental aggregate functionality, partition-based incremental aggregates rely on an incremental indicator to track changes in the data, and a grace period to account for data that may have arrived late. For more information on how these work, refer to Incremental indicators above.
There is a known issue with partition-based incremental aggregate builds that arises when the incremental indicator column is a number type, such as Long, Decimal(38,0), or BigInteger. When using such a type, the incremental build routine does not correctly convert the grace period time offset and rebuilds more records than specified by the grace period parameter. This issue does not affect models that use an incremental indicator column of a date-like type, such as Date, Timestamp, or DateTime.
Partition strategies
In addition to the incremental indicator and grace period, partition-based incremental aggregates require a partition strategy, which determines the method used to select the partition key:
- Distribution: The system uses statistics to determine the partition key.
- User: The user defines the partition key.
- Indicator: The incremental indicator is used as the partition key.
You can configure the distribution strategy using the aggregate.incrementalUpdate.partition.strategy setting, which is available at both the global and models levels. For more information, see Setting Properties to Allow Incremental Rebuilds of Aggregates, Aggregate Setting (global level), and Aggregate Settings (model level).
The following sections describe each partition strategy in detail.
Distribution strategy
With the distribution partition strategy, AtScale automatically selects the best partition key using the following workflow.
When using the distribution strategy, if you specify a partition key for the aggregate, the system does not build aggregates incrementally. Instead, it builds a non-incremental aggregate with the user-defined partition.
Step 1: Identify candidate partition keys
First, the system identifies candidate partition keys. Preference is given to columns that are already present in the aggregate definition (for user-defined aggregates), or that are provided as key hints (for system-defined aggregates). Only these candidates are evaluated; all others from the hierarchy are ignored.
For UDAs, this means you can influence partition key selection through the aggregate definition. For example, if a UDA already includes a Month column, and Month is a valid candidate, Month will be selected even if Quarter might score better.
For system-defined aggregates, hinted keys are partition keys on the query the aggregate originated from. The system determines which of these belong to a hierarchy referenced by the incremental indicator and ignores other candidates.
If no candidates are included in the aggregate definition or provided as key hints, the system examines the incremental indicator's hierarchy, taking all keys referenced by the indicator and walking up the hierarchy tree for each to obtain its ancestor levels. Each ancestor level's primary key becomes a candidate. Additionally, the system determines the granularity of each candidate, or the number of ancestors it has.
For example, say the incremental indicator is a column called orderdatekey, which is part of a hierarchy that contains the levels Day, Month, Quarter, and Year. These levels become the candidate keys, with the following granularities: Year=0, Quarter=1, Month=2, Day=3.
Step 2: Estimate the size of the partitions
Next, the system estimates the size of the partition based on the total rows of the aggregates without partitions, and the cardinality of each partition key candidtate.
Step 3: Score the candidates
Finally, the system scores each candidate based on the following criteria:
- How close the size of the candidate partition is to the target partition size. The target partition size is configured by the
aggregate.incrementalUpdate.partition.targetSizesetting, which is available at both the global and model levels. By default, this is set to1000000. - The granularity of the parition. A slight preference is given to partitions with finer granularity.
Step 4: Select the best candidate
Once all candidates have been identified and scored, the system selects the one with the lowest score.
User strategy
With the user partition strategy, the partition key is defined by the user. For UDAs, this is done by including the partition: key element in the aggregate definition within the model's SML file. For system-defined aggregates, it is done by defining a partition in the model's SML file.
The aggregate is only built incrementally if the key corresponds to a level in the dimension hierarchy referenced by the aggregate's incremental indicator. For example, let's say you have a dataset whose incremental indicator is a column called orderdatekey. orderdatekey corresponds to a level in a hierarchy that contains the following levels: Day, Month, Quarter, Year. If the partition key you use points to the Month level, then your key is valid and defines a partition. If your partition key points to a level called ProductCategory, which is in a different hierarchy than orderdatekey, then it is invalid and is not used as a partition.
For more information on the partition properties, refer to the AtScale SML Object Documentation on GitHub.
Indicator strategy
With the indicator strategy, the dataset's incremental indicator column is used as the partition key.
When using the indicator strategy, if you specify a partition key for the aggregate, the system does not build aggregates incrementally. Instead, it builds a non-incremental aggregate with the user-defined partition.
Incremental rebuilds of aggregates that use joins
You can create aggregate tables that use joins to one or more dimensional datasets, and such aggregate tables can be rebuilt incrementally. See Aggregates for Fact Datasets that Use Joins.
More information
Setting Properties to Allow Incremental Rebuilds of Aggregates