Warehouse-Level Machine Learning
AI-Link can push data science and/or machine learning workloads to the warehouse level, allowing users to perform inference and explore data without moving it or incurring heavy computational costs on their own machine. Instead of extracting data from the warehouse and analyzing it locally, a user working in Python can simply call a function that performs all necessary computation in their data warehouse.
Linear Regression
Linear regression predicts the value of some target feature y via a linear combination of n other predictor features x_1, ..., x_n, i.e.:
y = c_1x_1 + ... + c_nx_n
AI-Link determines the set of constants c_1, ..., c_n for the features x_1, ..., x_n so that users can predict the value of y using just the predictor features.
Below is an example of how one would call this functionality:
from atscale.eda.linear_regression import linear_regression
coefs = linear_regression(
dbconn=db,
data_model=data_model,
predictors=["total_number_of_customers"],
prediction_target=["total_sales"],
granularity_levels=["month"]
)
In the above example, db
is a SQLConnection
object (e.g., a Snowflake
object) corresponding to the user's data warehouse, and data_model
is the user's data model. The predictors
parameter is the list of features in the user's data model serving as inputs to the regression (i.e., x_1, ..., x_n in the model definition above), while the prediction_target
is the feature being estimated (i.e., y). The granularity_levels
parameter specifies the granularity of the output as well as the training data passed to the model – in the example above, it specifies that the total customers will be aggregated at the month level in order to forecast the total monthly sales. Lastly, the coefs
output contains the coefficients c_1, ..., c_n referenced above.
Principal Components Analysis
Principal components analysis (PCA) is an exploratory data analysis technique for quantifying the interrelation of features in a dataset. For a dataset with n features, PCA returns a collection of principal components – i.e., the vectors x_1, ..., x_n – as well as a collection of corresponding scalar weights w_1, ..., w_n. At a high level, each vector describes a pattern of interrelation among the n features, and each corresponding weight describes how prevalent the pattern is in the data.
Below is an example of how one would call this functionality:
from atscale.eda.pca import pca
pcs, weights = eda.pca(
dbconn=db,
data_model=data_model,
pc_num=2,
numeric_features=[
"Stock_A_price",
"Stock_B_price",
"Stock_C_price"
],
granularity_levels=["date"]
)
In the above example, db
is a SQLConnection
object (e.g., a Snowflake
object) corresponding to the user's data warehouse, and data_model
is the user's data model. The pc_num
parameter indicates how many principal components/weights the user wants returned. The numeric_features
parameter indicates the features making up the dataset – in this case, the features consist of time series data for three different stocks. The granularity_levels
parameter specifies the granularity of the dataset; in this case, daily stock prices are provided.
With this output, a user could investigate whether price fluctuations in Stocks A, B, and C are related – and if so, by how much.
Summary Statistics
AI-Link also supports warehouse-level computation of familiar summary statistics like variance, standard deviation, covariance, and correlation for features in a user's data model. For instance, a user can find the standard deviation of the total_sales
feature measured daily as follows:
from atscale.eda.stats import std
stdev = std(
dbconn=db,
data_model=data_model,
feature="total_sales",
granularity_levels=["date"]
)
In the above example, db
is a SQLConnection
object (e.g., a Snowflake
object) corresponding to the user's data warehouse, and data_model
is the user's data model.