Adding Databricks Data Warehouses
A Databricks data warehouse contains the tables and views that you want to import as fact datasets and dimension datasets into AtScale models. It also contains aggregate table instances that either you or the AtScale engine creates in a database that you specify.
To configure a Databricks data warehouse, you must have a special AtScale license. Ask your Account Executive about obtaining a Databricks-enabled license.
Before you begin
For AtScale:
- You must be logged in as an admin user.
- You must know the schema to use for building aggregate tables in the data warehouse. AtScale reads and writes aggregate data to this schema, so the AtScale service account user must have ALL privileges for it. BI tool user accounts should not have the select permission.
For Databricks:
-
Use a supported version of Databricks. For more information, see Supported Tools and Platforms.
-
Configure supported features.
When using Databricks SQL or Unity Catalog, make sure they are configured as needed. For details, see Databricks SQL guide and What is Unity Catalog.
-
Generate a personal access token.
Since Databricks does not support JDBC authentication with username and password, you have to generate a personal access token. Databricks recommends using access tokens belonging to service principals; for more information about obtaining them, see Manage service principals. When creating the Databricks connection (described below), use the token as the password, and
token
as the username.
Prerequisites for Enabling Impersonation
AtScale supports impersonation for Databricks data warehouses. To configure impersonation, you must use Microsoft Entra ID with OpenID Connect as your IdP. For instructions on configuring this in the Identity Broker, see Configuring Microsoft Entra ID With OpenID Connect.
You must also do the following:
-
Ensure that your Databricks instance is instantiated in Entra ID.
-
In Azure:
- Open the application for the AtScale Identity Broker.
- On the Overview page, click View API Permissions, then click Add a permission.
- In the Request API permissions pane, go to the APIs my organization uses tab, then search for and select AzureDatabricks.
- Select the user_impersonation checkbox.
- Click Add permissions.
-
In the Identity Broker:
- In the left-hand navigation, select Identity providers, then open your Entra ID IdP for editing.
- Open the Advanced dropdown and enable the Disable user info option.
- In the Scopes field, enter
2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default
. This is the programmatic ID for Databricks with the default scope. For more information, refer to the Microsoft documentation. - Go to the Mappers tab and click Add mapper.
- Define an attribute mapper to one of your Entra ID attributes that includes all users that require access to your Databricks data warehouse.
- In the Role field, select the
query_user
andimpersonation_user
roles. - Click Save.
When you add your Databricks data warehouse in AtScale, be sure to enable the Enable to impersonate the client when connecting to data warehouse checkbox. For instructions, see the following section.
Add the Data Warehouse
-
In Design Center, open the main menu and select Settings. The Settings page opens.
-
In the Settings panel, click Data Warehouses.
-
In the Add Data Warehouse section, click the Databricks icon. The Add Data Warehouse panel opens.
-
Enter a unique Name for the data warehouse. AtScale displays this name in the Design Center and uses it in log files.
-
(Optional) Enter the name of the External connection ID. This field defaults to the Name value but can be overridden by enabling the Override generated value? option.
-
Enter the name of the Aggregate Catalog, which AtScale uses when creating aggregate tables. You must create this; AtScale does not create one automatically. Also, be aware of the following:
- If the catalog name contains non-alphabet characters, you should surround the name with backtick (`) characters.
- AtScale supports Databricks SQL clusters with both Unity Catalog enabled and disabled, and thus three-part namespace support is enabled for Databricks SQL.
-
Enter the name of the Aggregate Schema for AtScale to use when creating aggregate tables. You must create this schema; AtScale does not create one automatically.
With Unity Catalog disabled, aggregate schema will reside in default catalog hive_metastore. And with Unity Catalog enabled, it's the user choice to decide in which Catalog-->Schema the aggregates can be stored.
-
(Optional) Specify whether the data warehouse is a Read-only source.
- Select this option if AtScale will be configured to access this database with credentials that do not have permission to create new tables, or if you need to prevent AtScale from placing any aggregate tables in this data warehouse.
- If Read-only source is enabled upon editing an existing data warehouse with aggregates present, all existing aggregates stored in this data warehouse will be deactivated.
-
(Optional) In the Impersonation section, select the Enable to impersonate the client when connecting to data warehouse checkbox to enable impersonation. Note that this requires you to have completed the prerequisites for enabling impersonation above.
When this checkbox is selected, the following options appear:
- Always use Canary Queries: Determines whether canary queries are required in light of aggregate misses.
- Allow Partial Aggregate Usage: Allows mixed aggregate and raw data queries.
-
In the Table storage formats section, specify the storage format for new tables (if you have a preference).
Select default to let the AtScale engine use the format configured in the
aggregates.tableConfig.preferredStorageFormat
engine setting. For more information, see Engine Settings for Both System-Defined and User-Defined Aggregates. -
If you need to use objects in a file system:
- In the Configure Filesystem section, enable the Enable file system option.
- Select the file system type.
- Enter the URI and other required details for the filesystem.
-
Click Apply.
Add a Databricks Connection
After setting up your data warehouse, you must add a connection to the data warehouse before you can run queries. You can establish multiple connections with different configurations to the same data warehouse. If multiple connections are configured, you can assign which types of queries are permitted per connection; for details, see Configuring Query Mapping.
To add a Databricks connection:
-
On the Data Warehouses page, click the expand icon for your data warehouse.
-
In the Add Connection field, click the plus icon. The Add Connection panel opens.
-
Complete the following fields:
- Name: The name of the connection.
- Host: The location of the Databricks server.
- Port: The port number to route transmitted data.
- Database: The name of the default database to use after connecting.
-
(Optional) In the Extra jdbc flags field, enter the JDBC flags that are needed for the connection:
- Access your Databricks cluster to retrieve your JDBC information.
- Make sure all of the following flags are set:
transportMode
,ssl
,httpPath
,AuthMech
,UID
,PWD
. - (Optional) Set the
UseNativeQuery=1
flag. The following is an example:transportMode=http;ssl=1;httpPath=<path>;AuthMech=3;UID=token;PWD=<tokenID>;UseNativeQuery=1
.
-
In the Authorization section, enter the username AtScale uses when connecting to the data warehouse.
-
In the Authentication Method field, select the authentication method to use when connecting to the data warehouse, then complete the credential fields as needed.
-
Click Test connection to test the connection.
-
Click Apply.
Next Steps
After configuring your data warehouses and connections, you can configure which types of queries are permitted for each connection. For more information, see Configuring Query Mapping.