Splunk has 100s of apps that extend its core
capabilities. The Machine Learning Toolkit App delivers new ML specific search commands,
visualizations, and other tools for performing machine learning on your Splunk data. In addition,
we have ML Assistants that guide you through the process of building custom models and
operationalizing them on the Splunk Platform. In this video, we’ll explore the “Forecast
Time Series” assistant, which helps us forecast future values of a metric using past values.
Forecasting is a type of prediction where we try to predict future values from past
values. Note that most of the input fields and dashboard
panels in this assistant have tooltips, giving us more information about the particular field
or panel. Hover over a panel title or field label to reveal a tooltip. In this example, we will forecast the trade-weighted
index for a currency with the ARIMA forecasting algorithm. Using the search field, we first
load the exchange dataset that is packaged with the app. After clicking the search button,
we get the “Raw Data Preview” panel below, which shows us the exchange rate field over
time. The time gap between each rate is one month. Next, we set the algorithm to use as ‘ARIMA’
and select ‘rate’ as the Field to forecast. After that, we select other parameters for
the ARIMA algorithm. ARIMA works on stationary datasets, i.e.,
time series data whose mean and variance is constant over time. Such datasets do not have
trends or varying seasonality. One can transform a non-stationary dataset into stationary
before processing it using ARIMA. If you have data with seasonality or trends, then you
should also try the Kalman Filter algorithm (i.e., the predict command) as well. The three primary parameters for ARIMA are: 1) Auto-Regressive (AR) or p: This parameter
controls the use of past values in the regression equation. For example, my current month’s
bank balance would depend on my previous month’s bank balance. p is the number of lags (past
values) used in the model. 2) Integrated (I) or d: d controls differencing
a series, which involves simply subtracting its current and previous values d times.
Often, differencing is used to remove trend from a series when the stationarity assumption
is not met. 3) Moving Average (MA) or q: This component
controls the error of the model as a combination of previous error terms. The order q determines
the number of such terms to include. Once we are done with the selection of p,
d, q parameters, set the no. of points to be used for evaluating the forecast, called
holdback, and the number of points to be forecasted in future, called future timespan. Next, we have the option to select the confidence
interval. A common choice is 95%. The most appropriate confidence interval depends on
the use case and characteristics of the data that are beyond the scope of this video. Clicking the forecast button computes a forecast
for the specified field. To evaluate how well the test data has been forecast, the assistant
automatically generates a visualization and computes test statistics using the withheld
data. The Auto Correlation Function (ACF) and Partial
Auto Correlation Function (PACF) charts are used to select better values for ARIMA’s
configuration parameters, p, d, and q. To know the commonly used parameter combinations
in ARIMA, you can read this article. The overall process for utilizing the ACF and PACF charts
to choose the best value of p and q is again beyond the scope of this video. You can refer
to these articles if you want to learn more about ACF and PACF. For the value of d, you need to find out if
your data is stationary or not; if it’s not, then you can try setting d to 1. If that’s
not sufficient, then increase the value of d. Typical values for d are zero or one. The Forecast Visualization shows the metric,
the forecast, and the confidence envelope. This plot is divided into three parts: training,
withholding, and forecast. The training part of the visualization shows the actual and
predicted values of the rate; in this case, the predicted values are close to the real
values, which is what we want. Next is the withheld data, where we are evaluating
the quality of our forecast. We compare the forecasted values and the actual values; these
values are also used to compute validation statistics like R^2 and RMSE. Finally, we
have the Forecast, where we don’t have actual values to compare against. The uncertainty
as we forecast further into the future is reflected in the growing confidence envelope. Similar to the ACF and PACF charts, the ACF
and PACF residual charts show the difference between the forecast and actual values. ACF
and PACF residual chart values for a good model should be close to zero. For more details,
consult the documentation. The Forecast Outliers panel shows the no.
of outliers detected in the forecast, meaning values that fell outside the confidence bounds.
These outliers could also be plotted over time using the ‘Plot Outliers Over Time’
button. On the “Load Existing Settings” tab, we
can compare different setups used for forecasting values using statistics such as ARIMA parameters,
R-Square and Root Mean Squared Error, etc. and reload the one that worked best.
These results could then be sent to a financial analyst who can balance the spending’s based
on the forecasted rate. These assistants not only walk you through
the steps required to forecast a field, but all of these tools can be reused and provide
a variety of options for operationalizing this setup. You can view the SPL commands
used at every step, create alerts based on the output, or use the validation panels on
your own dashboards. To learn more about the Machine Learning Toolkit,
including the other Assistants, browse through our ML videos.