Forest-based Forecast (Space Time Pattern Mining)

Summary

Forecasts the values of each location of a space-time cube using an adaptation of the random forest algorithm, which is a supervised machine learning method developed by Leo Breiman and Adele Cutler. The forest regression model is trained using time windows on each location of the space-time cube.

Learn more about how Forest-based Forecast works

Illustration

Forest-based Forecast tool illustration
A forecast time series using the Forest-based Forecast tool is shown.

Usage

  • This tool accepts netCDF files created by the Create Space Time Cube By Aggregating Points, Create Space Time Cube From Defined Locations, Create Space Time Cube from Multidimensional Raster Layer, and Subset Space Time Cube tools.

  • Compared to other forecasting tools in the Time Series Forecasting toolset, this tool is the most complex but includes the fewest assumptions about the data. It is recommended for time series with complicated shapes and trends that are difficult to model with simple mathematical functions or when the assumptions of other methods are not satisfied. It is also recommended if your space-time cube has other variables that are related to the variable being forecast. These variables can be included as explanatory variables to improve the forecast.

    Additionally, this tool is the only forecasting tool that allows you to build models at varying geographic scales. Rather than building an independent forecast model at each location of the space-time cube, this tool allows you to build a single global forecast model that uses each location as training data. If there are time series clustering results for any variable of the input space-time cube, you can also build a different forecast model for each cluster.

  • The Model Scale parameter can be used specify the scale used to estimate the forest-based models. The parameter has the following three options:

    • Individual location—A different model will be independently estimated for each location of the space-time cube. This is the default.
    • Entire cube—A single model will be estimated using all locations as training data. The shared model will be used to forecast future values at every location.
    • Time series cluster—A different model will be independently estimated for each cluster of a time series clustering result. Provide the variable with time series clustering results in the Cluster Variable parameter. You must use the Time Series Clustering tool on the variable. You can use any variable with time series clustering results, including the analysis variable.

    Learn more about estimating models at different scales

  • Multiple forecasted space-time cubes can be compared and merged using the Evaluate Forecasts By Location tool. This allows you to create multiple forecast cubes using different forecasting tools and parameters, and the tool will identify the best forecast for each location using either Forecast root mean square error (RMSE) or Validation RMSE.

  • While predicting future values, the tool builds two models that serve different purposes.

    • Forecast model—This model is used to forecast values of the space-time cube by building a forest using the values of the time series and using this forest to forecast the values of future time steps. The fit of the forecast model to the values of the space-time cube is measured by the Forecast RMSE value.
    • Validation model—This model is used to validate the forecast model and test how accurately it can forecast values. If a number greater than 0 is specified for the Number of Time Steps to Exclude for Validation parameter, this model is built using the time steps that were included and is used to forecast the values of the time steps that were excluded. This allows you to see how well the model can forecast values. The fit of the forecasted values to the excluded values is measured by the Validation RMSE value.

    Learn more about the forecast model, validation model, and RMSE statistics

  • The Output Features parameter values will be added to the Contents pane with rendering based on the final forecasted time step.

  • This tool creates geoprocessing messages and pop-up charts you can use to understand and visualize the forecast results. The messages contain information about the structure of the space-time cube and summary statistics of the RMSE values and season lengths. Click a feature using the Explore navigation tool to display a line chart in the Pop-up pane showing the values of the space-time cube, fitted forest values, forecasted values, and confidence bounds for that location.

  • You can include explanatory variables to improve the forecasts using the Other Variables parameter. If any other variables are provided, the forecast model is a multivariate forest-based forecast. Each explanatory variable is converted to time-lagged factors within each time window that are used to train the forest model. This allows you to estimate any lagged (delayed) effect between the explanatory variables and the analysis variable. For example, a rise in the number of hospitalizations during a pandemic may predict the number of deaths 14 days later, whereas the number of hospitalizations may poorly predict the number of deaths in the next 3 days. The number of time lags is equal to the value of the Time Step Window parameter, so the time window must be wider than any lagged effect you want to capture.

    The Output Importance Table parameter creates a table displaying the most important factors at each location and includes a Time Lag Importance bar chart that displays counts of the most important factors across all locations, sorted by time lag within the time window. This allows you to see which variables were important in predicting the value of the analysis variable, as well as to visualize the associated lag when the factor was most important. For example, if the number of hospitalizations is associated with the number of deaths 14 days later, the time step window should be at least 14 days, and a large count of hospitalizations should be observed approximately 14 days before the end of the time window.

    The number of factors deemed important at each location depends on the Importance Threshold parameter value. For example, if a value of 15 is used, the top 15 percent of factors for each location will be included in the table and chart.

  • The Outlier Option parameter can be used to detect statistically significant outliers in the time series values at each location.

    Learn more about detecting time series outliers

  • If you choose the Identify outliers option for the Outlier Option parameter, it is recommended that you provide a value for the Time Step Window parameter rather than leaving the parameter empty and estimating a different time step window at each location. For each location, the forest model uses the time steps in the first time step window to train the forecast model, and outliers are only detected for the remaining time steps. If different locations exclude different numbers of time steps for training, summary statistics such as the mean, minimum, and maximum number of outliers per time step or per location may be misleading.

  • If any explanatory variables are included in the Other Variables parameter or if the Entire cube or Time series clusters options for the Model Scale parameter are specified, only the Build model by value option is available for the Forecast Approach parameter. Additionally, the processing time will increase when using any of these options.

  • Deciding how many time steps to exclude for validation is an important choice. The more time steps are excluded, the fewer time steps remain to estimate the validation model. However, if too few time steps are excluded, the Validation RMSE will be estimated using a small amount of data and may be misleading. It is recommended that you exclude as many time steps as possible while maintaining sufficient time steps to estimate the validation model. It is also suggested that you withhold at least as many time steps for validation as the number of time steps you intend to forecast, if your space-time cube has enough time steps to allow this.

Parameters

LabelExplanationData Type
Input Space Time Cube

The netCDF cube containing the variable to forecast to future time steps. This file must have an .nc file extension and must have been created using the Create Space Time Cube By Aggregating Points, Create Space Time Cube From Defined Locations, or Create Space Time Cube From Multidimensional Raster Layer tool.

File
Analysis Variable

The numeric variable in the netCDF file that will be forecasted to future time steps.

String
Output Features

The output feature class of all locations in the space-time cube with forecasted values stored as fields. The layer displays the forecast for the final time step and contains pop-up charts showing the time series, forecasts, and 90 percent confidence bounds for each location.

Feature Class
Output Space Time Cube
(Optional)

A new space-time cube (.nc file) containing the values of the input space-time cube with the forecasted time steps appended. The Visualize Space Time Cube in 3D tool can be used to see all of the observed and forecasted values simultaneously.

File
Number of Time Steps to Forecast
(Optional)

A positive integer specifying the number of time steps to forecast. This value cannot be larger than 50 percent of the total time steps in the input space-time cube. The default value is one time step.

Long
Time Step Window
(Optional)

The number of previous time steps that will be used when training the model. If the data displays seasonality (repeating cycles), provide the number of time steps corresponding to one season. This value cannot be larger than one-third of the number of time steps in the input space-time cube. When using individual location model scale, if no value is provided, a time window is estimated for each location using a spectral density function. When using entire cube or time series cluster model scales, if no value is provided, one-fourth of the number of time steps will be used.

Learn more about seasonality and choosing a time window

Long
Number of Time Steps to Exclude for Validation
(Optional)

The number of time steps at the end of each time series to exclude for validation. The default value is 10 percent (rounded down) of the number of input time steps, and this value cannot be larger than 25 percent of the number of time steps. Provide the value 0 to not exclude any time steps.

Long
Number of Trees
(Optional)

The number of trees that will be created in the forest model. More trees generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100, and the value must be at least 1 and not greater than 1,000.

Long
Minimum Leaf Size
(Optional)

The minimum number of observations that are required to keep a leaf (that is, the terminal node on a tree without further splits). For very large data, increasing this number will decrease the run time of the tool.

Long
Maximum Tree Depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chance of overfitting the model. If no value is provided, a value will be identified by the tool based on the number of trees created by the model and the size of the time step window.

Long
Percentage of Training Available per Tree (%)
(Optional)

The percent of training data that will be used to fit the forecast model. The training data consists of associated explanatory and dependent variables constructed using time windows. All remaining training data will be used to optimize the parameters of the forecast model. The default is 100 percent.

Long
Forecast Approach
(Optional)

Specifies how the explanatory and dependent variables will be represented when training the forest model at each location.

To train the forest model that will be used to forecast, sets of explanatory and dependent variables must be created using time windows. Use this parameter to specify whether these variables will be linearly detrended and whether the dependent variable will be represented by its raw value or by the residual of a linear regression model. This linear regression model uses all time steps within a time window as explanatory variables and uses the following time step as the dependent variable. The residual is calculated by subtracting the predicted value based on linear regression from the raw value of the dependent variable.

If any variables are provided in the Other Variables parameter or if Entire cube or Time series cluster is specified for the Model Scale parameter, the Value option will be the only available forecast approach.

  • Build model by value Values within the time window will not be detrended and the dependent variable will be represented by its raw value. If any other variables are provided or if the model scale is not individual location, this will be the only available forecast approach and will be the default.
  • Build model by value after detrending Values within the time window will be linearly detrended, and the dependent variable will be represented by its detrended value. This is the default.
  • Build model by residual Values within the time window will not be detrended, and the dependent variable will be represented by the residual of a linear regression model using the values within the time window as explanatory variables.
  • Build model by residual after detrending Values within the time window will be linearly detrended, and the dependent variable will be represented by the residual of a linear regression model using the detrended values within the time window as explanatory variables.
String
Outlier Option
(Optional)

Specifies whether statistically significant time series outliers will be identified.

  • NoneOutliers will not be identified. This is the default.
  • Identify outliersOutliers will be identified using the Generalized ESD test.
String
Level of Confidence
(Optional)

Specifies the confidence level for the test for time series outliers.

  • 90%The confidence level for the test is 90 percent. This is the default.
  • 95%The confidence level for the test is 95 percent.
  • 99%The confidence level for the test is 99 percent.
String
Maximum Number of Outliers

The maximum number of time steps that can be declared outliers for each location. The default value corresponds to 5 percent (rounded down) of the number of time steps of the input space-time cube (a value of at least 1 will always be used). This value cannot exceed 20 percent of the number of time steps.

Long
Other Variables
(Optional)

Other variables of the input space-time cube that will be used as explanatory variables to improve the forecasts.

String
Importance Threshold (%)
(Optional)

The percent of factors deemed most important for forecasting the analysis variable. For example, if the value is 20, the top 20 percent of factors for each location will be included in the importance table. Each variable (the analysis variable and each explanatory variable) is represented as a factor once for each time step in the time step window, so the number of factors at each location is the length of the time window multiplied by the number of variables. The number of factors is multiplied by the importance threshold to determine the number of important factors for each forecast model. The default is 10, and the value must be an integer between 1 and 100.

Long
Output Importance Table
(Optional)

The output table that will contain the most important factors at each location. For individual location model scale, each important factor at each location of the space-time cube will be represented as a row in the table with fields containing the name of the variable and associated time lag. For entire cube and time series cluster model scales, all important factors in the entire cube or cluster model will be represented by a row .The table will include a chart displaying the most important factors across all locations separated by time lag. The chart allows you to visualize lagged effects between the explanatory variables and the variable being forecasted.

Table
Model Scale
(Optional)

Specifies the scale that will be used to estimate the forecast and validation models.

  • Individual locationA different forecast model and validation model will be estimated for each location. This is the default.
  • Entire cubeA single forecast model and validation model will be estimated using all locations as training data.
  • Time series clusterA forecast and validation model will be estimated for each cluster of a time series clustering result. Provide the variable with time series clustering results in the Cluster Variable parameter.
String
Cluster Variable
(Optional)

The variable that will be used to group the locations of the space-time cube into regions, and different forecast and validation models will be estimated for each region. The variable must have time series clustering results to be used. The cluster variable can be any variable of the space-time cube including the analysis variable.

String

arcpy.stpm.ForestBasedForecast(in_cube, analysis_variable, output_features, {output_cube}, {number_of_time_steps_to_forecast}, {time_window}, {number_for_validation}, {number_of_trees}, {minimum_leaf_size}, {maximum_depth}, {sample_size}, {forecast_approach}, {outlier_option}, {level_of_confidence}, maximum_number_of_outliers, {other_variables}, {importance_threshold}, {output_importance_table}, {model_scale}, {cluster_variable})
NameExplanationData Type
in_cube

The netCDF cube containing the variable to forecast to future time steps. This file must have an .nc file extension and must have been created using the Create Space Time Cube By Aggregating Points, Create Space Time Cube From Defined Locations, or Create Space Time Cube From Multidimensional Raster Layer tool.

File
analysis_variable

The numeric variable in the netCDF file that will be forecasted to future time steps.

String
output_features

The output feature class of all locations in the space-time cube with forecasted values stored as fields. The layer displays the forecast for the final time step and contains pop-up charts showing the time series, forecasts, and 90 percent confidence bounds for each location.

Feature Class
output_cube
(Optional)

A new space-time cube (.nc file) containing the values of the input space-time cube with the forecasted time steps appended. The Visualize Space Time Cube in 3D tool can be used to see all of the observed and forecasted values simultaneously.

File
number_of_time_steps_to_forecast
(Optional)

A positive integer specifying the number of time steps to forecast. This value cannot be larger than 50 percent of the total time steps in the input space-time cube. The default value is one time step.

Long
time_window
(Optional)

The number of previous time steps that will be used when training the model. If the data displays seasonality (repeating cycles), provide the number of time steps corresponding to one season. This value cannot be larger than one-third of the number of time steps in the input space-time cube. When using individual location model scale, if no value is provided, a time window is estimated for each location using a spectral density function. When using entire cube or time series cluster model scales, if no value is provided, one-fourth of the number of time steps will be used.

Long
number_for_validation
(Optional)

The number of time steps at the end of each time series to exclude for validation. The default value is 10 percent (rounded down) of the number of input time steps, and this value cannot be larger than 25 percent of the number of time steps. Provide the value 0 to not exclude any time steps.

Long
number_of_trees
(Optional)

The number of trees that will be created in the forest model. More trees generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100, and the value must be at least 1 and not greater than 1,000.

Long
minimum_leaf_size
(Optional)

The minimum number of observations that are required to keep a leaf (that is, the terminal node on a tree without further splits). For very large data, increasing this number will decrease the run time of the tool.

Long
maximum_depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chance of overfitting the model. If no value is provided, a value will be identified by the tool based on the number of trees created by the model and the size of the time step window.

Long
sample_size
(Optional)

The percent of training data that will be used to fit the forecast model. The training data consists of associated explanatory and dependent variables constructed using time windows. All remaining training data will be used to optimize the parameters of the forecast model. The default is 100 percent.

Learn more about training the forest forecast model

Long
forecast_approach
(Optional)

Specifies how the explanatory and dependent variables will be represented when training the forest model at each location.

To train the forest model that will be used to forecast, sets of explanatory and dependent variables must be created using time windows. Use this parameter to specify whether these variables will be linearly detrended and whether the dependent variable will be represented by its raw value or by the residual of a linear regression model. This linear regression model uses all time steps within a time window as explanatory variables and uses the following time step as the dependent variable. The residual is calculated by subtracting the predicted value based on linear regression from the raw value of the dependent variable.

If any variables are provided in the Other Variables parameter or if Entire cube or Time series cluster is specified for the Model Scale parameter, the Value option will be the only available forecast approach.

Learn more about the Forecast Approach parameter

  • VALUE Values within the time window will not be detrended and the dependent variable will be represented by its raw value. If any other variables are provided or if the model scale is not individual location, this will be the only available forecast approach and will be the default.
  • VALUE_DETREND Values within the time window will be linearly detrended, and the dependent variable will be represented by its detrended value. This is the default.
  • RESIDUAL Values within the time window will not be detrended, and the dependent variable will be represented by the residual of a linear regression model using the values within the time window as explanatory variables.
  • RESIDUAL_DETREND Values within the time window will be linearly detrended, and the dependent variable will be represented by the residual of a linear regression model using the detrended values within the time window as explanatory variables.
String
outlier_option
(Optional)

Specifies whether statistically significant time series outliers will be identified.

  • NONEOutliers will not be identified. This is the default.
  • IDENTIFYOutliers will be identified using the Generalized ESD test.
String
level_of_confidence
(Optional)

Specifies the confidence level for the test for time series outliers.

  • 90%The confidence level for the test is 90 percent. This is the default.
  • 95%The confidence level for the test is 95 percent.
  • 99%The confidence level for the test is 99 percent.
String
maximum_number_of_outliers

The maximum number of time steps that can be declared outliers for each location. The default value corresponds to 5 percent (rounded down) of the number of time steps of the input space-time cube (a value of at least 1 will always be used). This value cannot exceed 20 percent of the number of time steps.

Long
other_variables
[other_variables,...]
(Optional)

Other variables of the input space-time cube that will be used as explanatory variables to improve the forecasts.

String
importance_threshold
(Optional)

The percent of factors deemed most important for forecasting the analysis variable. For example, if the value is 20, the top 20 percent of factors for each location will be included in the importance table. Each variable (the analysis variable and each explanatory variable) is represented as a factor once for each time step in the time step window, so the number of factors at each location is the length of the time window multiplied by the number of variables. The number of factors is multiplied by the importance threshold to determine the number of important factors for each forecast model. The default is 10, and the value must be an integer between 1 and 100.

Long
output_importance_table
(Optional)

The output table that will contain the most important factors at each location. For individual location model scale, each important factor at each location of the space-time cube will be represented as a row in the table with fields containing the name of the variable and associated time lag. For entire cube and time series cluster model scales, all important factors in the entire cube or cluster model will be represented by a row .The table will include a chart displaying the most important factors across all locations separated by time lag. The chart allows you to visualize lagged effects between the explanatory variables and the variable being forecasted.

Table
model_scale
(Optional)

Specifies the scale that will be used to estimate the forecast and validation models.

  • INDIVIDUAL_LOCATIONA different forecast model and validation model will be estimated for each location. This is the default.
  • ENTIRE_CUBEA single forecast model and validation model will be estimated using all locations as training data.
  • TIME_SERIES_CLUSTERA forecast and validation model will be estimated for each cluster of a time series clustering result. Provide the variable with time series clustering results in the cluster_variable parameter.
String
cluster_variable
(Optional)

The variable that will be used to group the locations of the space-time cube into regions, and different forecast and validation models will be estimated for each region. The variable must have time series clustering results to be used. The cluster variable can be any variable of the space-time cube including the analysis variable.

String

Code sample

ForestBasedForecast example 1 (Python window)

The following Python script demonstrates how to use the ForestBasedForecast function.

# Forecast four time steps using a random forest with detrending.
arcpy.stpm.ForestBasedForecast("CarTheft.nc","Cars_NONE_ZEROS", 
           "Analysis.gdb/Forecasts", "outForecastCube.nc", 4, 3, 
           5, 100, "", "", 100, "VALUE_DETREND", "", "", "", "",
           "", "", "INDIVIDUAL_LOCATION")
ForestBasedForecast example 2 (stand-alone script)

The following Python script demonstrates how to use the ForestBasedForecast function to forecast counts of car theft.

# Forecast change in car thefts using a random forest.

# Import system modules.
import arcpy

# Set property to overwrite existing output, by default.
arcpy.env.overwriteOutput = True

# Set workspace.
workspace = r"C:\Analysis"
arcpy.env.workspace = workspace

# Forecast three time steps using a random forest based on change.
arcpy.stpm.ForestBasedForecast("CarTheft.nc","Cars_NONE_ZEROS","Analysis.gdb/Forecasts",
           "outForecastCube.nc", 4, 3, 5, 100, "", "", 100, "RESIDUAL", "IDENTIFY", 
           "90%", 4, None, 10, None, "INDIVIDUAL_LOCATION")

# Create a feature class visualizing the forecasts.
arcpy.stpm.VisualizeSpaceTimeCube3D("outForecastCube.nc", "Cars_NONE_ZEROS", "VALUE", 
           "Analysis.gdb/ForecastsFC")
ForestBasedForecast example 3 (stand-alone script)

The following Python script demonstrates how to use the ForestBasedForecast function to forecast PM2.5 using other variables to improve the forecast.

import arcpy
arcpy.env.workspace = "C:/Analysis"

# Forecast twelve time steps using a random forest.
# Use entire cube model scale and multiple other variables
# Create variable importance table with top 10% of most important variables
arcpy.stpm.ForestBasedForecast("air_quality_cities.nc", "PM25", 
           "Analysis.gdb/PM25_forecast", "PM25_forecast_cube.nc", 12, None,
           30, 100, None, None, 100, "VALUE", "NONE", "90%",15, 
           "CO;HUMIDITY;O3;PRESSURE;TEMPERATURE;WINDSPEED", 10, 
           "Analysis.gdb/pm25_importance", "ENTIRE_CUBE")
ForestBasedForecast example 4 (stand-alone script)

The following Python script demonstrates how to use the ForestBasedForecast function to forecast county populations using clusters of counties with similar populations.

import arcpy
arcpy.env.workspace = "C:/Analysis"

# Run time series clustering to cluster counties by population value.
arcpy.stpm.TimeSeriesClustering("USA_County_Population_1969_2019.nc", 
           "POPULATION_SUM_ZEROS",
           "Analysis.gdb/USA_County_Population_TimeSeriesClustering",
           "VALUE", None, None, None, "CREATE_POPUP")

# Run forest-based forecast models on each time series cluster
arcpy.stpm.ForestBasedForecast("USA_County_Population_1969_2019.nc",
          "POPULATION_SUM_ZEROS", 
          "Analysis.gdb/USA_County_Population_ForestBasedForecast", 
          "USA_County_Population_ForestBasedForecast_cube.nc", 20, 
          None, 3, 100, None, None, 100, "VALUE", "NONE", "90%", 1,
          None, 10, None, "TIME_SERIES_CLUSTER", "POPULATION_SUM_ZEROS")

Licensing information

  • Basic: Yes
  • Standard: Yes
  • Advanced: Yes

Related topics