Skip to main content

Adding Databricks Data Warehouses

A Databricks data warehouse contains the tables and views that you want to import as fact datasets and dimension datasets into AtScale models. It also contains aggregate table instances that either you or the AtScale engine creates in a database that you specify.

info

To configure a Databricks data warehouse, you must have a special AtScale license. Ask AtScale Support about obtaining a Databricks-enabled license.

Prerequisites

For AtScale:

  • You must have the datawarehouses_admin realm role assigned in the Identity Broker.
  • You must know the schema to use for building aggregate tables in the data warehouse. AtScale reads and writes aggregate data to this schema, so the AtScale service account user must have ALL privileges for it. BI tool user accounts should not have the select permission.

For Databricks:

  • Use a supported version of Databricks. For more information, see Supported Tools and Platforms.
  • When using Databricks SQL or Unity Catalog, make sure they are configured as needed. For details, see Databricks SQL guide and What is Unity Catalog.
  • Since Databricks does not support JDBC authentication with username and password, you have to generate a personal access token. Databricks recommends using access tokens belonging to service principals; for more information about obtaining them, see Manage service principals. When creating the Databricks connection (described below), use the token as the password, and token as the username.
note

Alternatively, you can use Databricks Partner Connect. This way Databricks provisions the required resources like a service principal, a personal access token, a SQL warehouse on your behalf. For more information, see Databricks' Atscale Partner Connect documentation in AWS or Azure.

Prerequisites for enabling impersonation

AtScale supports impersonation for Databricks data warehouses. To configure impersonation, you must use Microsoft Entra ID with OpenID Connect as your IdP. For instructions on configuring this in the Identity Broker, see Configuring Microsoft Entra ID With OpenID Connect.

You must also do the following:

  1. Ensure that your Databricks instance is instantiated in Entra ID.

  2. In Azure:

    1. Open the application for the AtScale Identity Broker.
    2. On the Overview page, click View API Permissions, then click Add a permission.
    3. In the Request API permissions pane, go to the APIs my organization uses tab, then search for and select AzureDatabricks.
    4. Select the user_impersonation checkbox.
    5. Click Add permissions.
  3. In the Identity Broker:

    1. In the left-hand navigation, select Identity providers, then open your Entra ID IdP for editing.
    2. Open the Advanced dropdown and enable the Disable user info option.
    3. In the Scopes field, enter 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default. This is the programmatic ID for Databricks with the default scope. For more information, refer to the Microsoft documentation.
    4. Go to the Mappers tab and click Add mapper.
    5. Define an attribute mapper to one of your Entra ID attributes that includes all users that require access to your Databricks data warehouse.
    6. In the Role field, select the query_user and impersonation_user roles.
    7. Click Save.

When you add your Databricks data warehouse in AtScale, be sure to enable the Enable to impersonate the client when connecting to data warehouse checkbox. For instructions, see the following section.

Add a data warehouse

  1. In Design Center, open the Datasources panel.

  2. Click the + icon and select Databricks.

    image

    The Add Data Warehouse panel opens.

  3. Enter a unique Name for the data warehouse. AtScale displays this name in the Design Center and uses it in log files.

  4. (Optional) Enter the name of the External connection ID. This field defaults to the Name value but can be overridden by enabling the Override generated value? toggle.

  5. Enter the name of the Aggregate Catalog, which AtScale uses when creating aggregate tables. You must create this; AtScale does not create one automatically. Also, be aware of the following:

    • If the catalog name contains non-alphabet characters, you should surround the name with backtick (`) characters.
    • AtScale supports Databricks SQL clusters with both Unity Catalog enabled and disabled, and thus three-part namespace support is enabled for Databricks SQL.
  6. Enter the name of the Aggregate Schema for AtScale to use when creating aggregate tables. You must create this schema; AtScale does not create one automatically.

    If Unity Catalog is disabled, the aggregate schema will reside in the default catalog hive_metastore. If Unity Catalog is enabled, it's the user's choice to decide in which Catalog-->Schema the aggregates can be stored.

  7. (Optional) In the Impersonation section, select the Enable to impersonate the client when connecting to data warehouse checkbox to enable impersonation. Note that this requires you to have completed the prerequisites for enabling impersonation above.

    When this option is enabled, the following options appear:

    • Always use Canary Queries: Determines whether canary queries are required in light of aggregate misses.
    • Allow Partial Aggregate Usage: Allows mixed aggregate and raw data queries.
  8. In the Table storage formats section, select Delta.

  9. If you need to use objects in a file system:

    1. In the Configure Filesystem section, enable the Enable file system option.
    2. Select the file system type.
    3. Enter the URI and other required details for the filesystem.
  10. In the Access Controls section, add the users and groups that you want to be able to access the data warehouse and its data. For more information on how data warehouse security works in AtScale, see Data Warehouse Security.

  11. Add a connection to the data warehouse. You must have at least one defined to finish adding the data warehouse.

    1. In the Connections section, click Connection Required. The Add Connection panel opens.

    2. Complete the following fields:

      • Name: The name of the connection.
      • Host: The location of the Databricks server.
      • Port: The port number to route transmitted data.
    3. In the Authorization section:

      1. Enter the Username AtScale uses when connecting to the data warehouse.
      2. (Optional) In the Authentication Method field, select Password, then enter the password AtScale will use when connecting to the data warehouse.
    4. (Optional) In the Additional Settings section:

      1. Access your Databricks cluster and make sure the following flags are set: transportMode, ssl, httpPath, AuthMech, UID, PWD, and (optionally) UseNativeQuery=1.
      2. In the Extra jdbc flags field, enter all of the above flags for the connection (except httpPath, which will be set in the next step). For example: transportMode=http;ssl=1;AuthMech=3;UID=token;PWD=<tokenID>;UseNativeQuery=1
      3. In the Http path field, enter the httpPath flag. For example: httpPath=<path>
    5. Click Test connection to test the connection.

    6. Click Add.

      The connection is added to the data warehouse. You can add more connections as needed.

  12. (Optional) If your data warehouse has multiple connections, you can configure their query mappings. This determines which types of queries are permitted per connection. A query role can be assigned to only one connection at a time.

    In the Query Mapping section, select the connection you want to assign to each type of query:

    • Small Interactive Queries: Small user queries. These are usually BI queries that can be optimized by AtScale aggregates or ones that involve low-cardinality dimensions.
    • Large Interactive Queries: Large user queries. These are usually BI queries involving full scans of the raw fact table or ones that involve high-cardinality dimensions.
    • Aggregate Build Queries: Queries made when building aggregate tables. This mapping should be assigned to a connection that has more compute power, and that is not subject to resource limitations that would prevent it from building aggregate tables.
    • System Queries: AtScale system processing, such as running aggregate maintenance jobs and querying system statistics.
    • Canary Queries: Queries that run periodically to monitor system response times, workload, etc.
  13. Click Save Data Warehouse.