Forest-based Classification and Regression (Spatial Statistics)

Summary

Creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features, raster datasets, and distance features used to calculate proximity values for use as additional variables. In addition to validation of model performance based on the training data, predictions can be made to either features or a prediction raster.

Learn more about how Forest-based Classification and Regression works

Illustration

Forest-based Classification and Regression tool illustration

Usage

  • This tool creates hundreds of trees, called an ensemble of decision trees, that are used to create a model that can then be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important, as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.

  • This tool can be used in three different operation modes. The Train option can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Predict to features or Predict to raster option. This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.

  • The Input Training Features parameter value can be points or polygons. This tool does not work with multipart data.

  • An ArcGIS Spatial Analyst extension license is required to use rasters as explanatory variables or to predict to an Output Prediction Surface value.

  • This tool produces a variety of outputs. Output Trained Features will contain all of the Input Training Features values used in the model created as well as all of the explanatory variables used in the model (including input fields used, distances calculated, and raster values extracted or calculated). It will also contain predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created. When using this tool for prediction, it will produce either a new feature class containing the Output Predicted Features values or a new Output Prediction Surface value if explanatory rasters are provided.

  • When using the Predict to features option, a new feature class containing the Output Predicted Features values will be created. When the Predict to Raster option is used, a new Output Prediction Surface value will be created.

  • This tool also creates messages and charts to help you understand the performance of the model created. You can access the messages by hovering over the progress bar, clicking the pop-out button, or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Forest-based Classification and Prediction tool through the Geoprocessing history. The messages include information on the model characteristics, out of bag errors, variable importance, and validation diagnostics.

    You can use the Output Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window. The chart can be accessed directly below the layer in the Contents pane.

  • Explanatory variables can come from fields or be calculated from distance features or extracted from rasters. You can use any combination of these explanatory variable types, but at least one type is required. The explanatory variables (from fields, distance features, or rasters) used should contain a variety of values. If the explanatory variable is categorical, check the Categorical check box (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.

  • Distance features are used to automatically create explanatory variables representing a distance from the provided features to the Input Training Features values. Distances will be calculated from each of the input Explanatory Training Distance Features values to the nearest Input Training Feature value. If the input Explanatory Training Distance Features values are polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features. However, distances are calculated differently for polygons and lines. See How proximity tools calculate distance for details.

  • If the Input Training Features values are points and you are using the Explanatory Training Rasters parameter, the tool drills down to extract explanatory variables at each point location. For multiband rasters, only the first band is used.

  • Although you can have multiple layers with the same name in the Contents pane, the tool does not accept explanatory layers with the same name or remove duplicate layer names in the drop-down lists. To avoid this issue, ensure that each layer has a unique name.

  • If the Input Training Features values are polygons, the Variable to Predict parameter value is categorical, and you are using Explanatory Training Rasters values exclusively, the Convert Polygons to Raster Resolution for Training parameter will be available. If you check this parameter, the polygon will be divided into points at the centroid of each raster cell whose centroid falls within the polygon. The raster values at each point location are then extracted and used to train the model. A bilinear sampling method is used for numeric variables, and the nearest method is used for categorical variables. The default cell size of the converted polygons will be the maximum cell size of input rasters. However, you can change this using the Cell Size environment setting. If this parameter is not checked, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters.

    Polygons are converted to raster resolution (left) or assigned an average value (right).

  • There must be variation in the data used for each explanatory variable specified. If you receive an error that there is no variation in one of the fields or rasters specified, you can try running the tool again, marking that variable as categorical. If 95 percent of the features have the same value for a particular variable, that variable is flagged as having no variation.

  • The Compensate for Sparse Categories parameter can be used if the variation in the categories is unbalanced. For example, if you have some categories that occur hundreds of times in the dataset and a few that occur significantly less often, checking this parameter will ensure that each category is represented in each tree to create balanced models.

  • When matching explanatory variables, the Prediction and Training fields must be of the same type (a double field in Training must be matched to a double field in Prediction).

  • Forest-based models do not extrapolate, they can only classify or predict to a value that the model was trained on. When predicting a value based on explanatory variables much higher or lower than the range of the original training dataset, the model will estimate the value to be around the highest or lowest value in the original dataset. This tool may perform poorly when trying to predict with explanatory variables that are out of range of the explanatory variables used to train the model.

  • The tool will fail if categories exist in the prediction explanatory variables that are not present in the training features.

  • To use mosaic datasets as explanatory variables, use the Make Mosaic Layer tool first and copy the full path to the layer into the tool or use the Make Mosaic Layer tool and the Make Raster Layer tool to adjust the processing template for the mosaic dataset.

  • The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.

  • When the Calculate Uncertainty parameter is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the Variable to Predict value. When Prediction Type is Train only or Predict to features, two fields will be added to either the Output Trained Features value or the Output Predicted Features value. These fields, ending with _P05 and _P95, represent the upper and lower bounds of the prediction interval. For any new observation, you can predict with 90 percent confidence that the value of a new observation will fall within the interval, given the same explanatory variables. When the Predict to raster option is used, two additional rasters representing the upper and lower bounds of the prediction interval will be added to the Contents pane.

  • For performance reasons, the Explanatory Training Distance Features parameter is not available when using the Prediction Type parameter's Predict to raster option. To include distances to features as explanatory variables, calculate distance rasters using the Distance Accumulation tool, and include the distance rasters in the Explanatory Training Rasters parameter.

  • This tool supports parallel processing for prediction and uses 50 percent of available processors by default. The number of processors can be increased or decreased using the Parallel Processing Factor environment.

  • To learn more about how this tool works and understand the output messages and charts, see How Forest-based Classification and Regression works.

    References:

    • Breiman, Leo. Out-Of-Bag Estimation. 1996.
    • Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
    • Breiman, Leo. "Random Forests". Machine Learning. 45 (1): 5-32. doi:10.1023/A:1010933404324. 2001.
    • Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and regression trees. New York: Routledge. Chapter 4. 2017.
    • Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.
    • Gini, C. (1912). Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi.
    • Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308-319.
    • Ho, T. K. (1995, August). Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on Document Analysis and Recognition. (Vol. 1, pp. 278-282). IEEE.
    • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.
    • LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91(436), 1641-1650.
    • Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica sinica, 815-840.
    • Meinshausen, Nicolai. "Quantile regression forests." Journal of Machine Learning Research 7. Jun (2006): 983-999.
    • Nadeau, C., & Bengio, Y. (2000). Inference for the generalization error. In Advances in neural information processing systems (pp. 307-313).
    • Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1), 307.
    • Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC press.

Parameters

LabelExplanationData Type
Prediction Type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • Train onlyA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of the model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • Predict to featuresPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table and chart of variable importance.
  • Predict to rasterA prediction raster will be generated for the area where the explanatory rasters intersect. Explanatory rasters must be provided for both the training area and the area to be predicted. The output of this option will be a prediction surface, model diagnostics in the messages window, and an optional table and chart of variable importance.
String
Input Training Features

The feature class containing the Variable to Predict parameter value and, optionally, the explanatory training variables from fields.

Feature Layer
Variable to Predict
(Optional)

The variable from the Input Training Features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
Treat Variable as Categorical
(Optional)

Specifies whether the Variable to Predict value is a categorical variable.

  • Checked—The Variable to Predict value is a categorical variable and the tool will perform classification.
  • Unchecked—The Variable to Predict value is continuous and the tool will perform regression. This is the default.
Boolean
Explanatory Training Variables
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of the Variable to Predict value. Check the Categorical check box for any variables that represent classes or categories (such as land cover or presence or absence).

Value Table
Explanatory Training Distance Features
(Optional)

The explanatory training distance features. Explanatory variables will be automatically created by calculating a distance from the provided features to the Input Training Features values. Distances will be calculated from each of the input Explanatory Training Distance Features values to the nearest Input Training Features value. If the input Explanatory Training Distance Features values are polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
Explanatory Training Rasters
(Optional)

The explanatory training variables extracted from rasters. Explanatory training variables will be automatically created by extracting raster cell values. For each feature in the Input Training Features parameter, the value of the raster cell is extracted at that exact location. Bilinear raster resampling is used when extracting the raster value for continuous rasters. Nearest neighbor assignment is used when extracting a raster value from categorical rasters. Check the Categorical check box for any rasters that represent classes or categories such as land cover or presence or absence.

Value Table
Input Prediction Features
(Optional)

A feature class representing locations where predictions will be made. This feature class must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Feature Layer
Output Predicted Features
(Optional)

The output feature class containing the prediction results.

Feature Class
Output Prediction Surface
(Optional)

The output raster containing the prediction results. The default cell size will be the maximum cell size of the raster inputs. To set a different cell size, use the Cell Size environment setting.

Raster Dataset
Match Explanatory Variables
(Optional)

A list of the Explanatory Variables values specified from the Input Training Features parameter on the right and corresponding fields from the Input Prediction Features parameter on the left.

Value Table
Match Distance Features
(Optional)

A list of the Explanatory Distance Features values specified for the Input Training Features parameter on the right and corresponding feature sets from the Input Prediction Features parameter on the left.

The Explanatory Distance Features values that are more appropriate for the Input Prediction Features parameter can be provided if those used for training are in a different study area or time period.

Value Table
Match Explanatory Rasters
(Optional)

A list of the Explanatory Rasters values specified for the Input Training Features parameter on the right and corresponding rasters from the Input Prediction Features parameter or the Prediction Surface parameter to be created on the left.

The Explanatory Rasters values that are more appropriate for the Input Prediction Features parameter can be provided if those used for training are in a different study area or time period.

Value Table
Output Trained Features
(Optional)

The explanatory variables used for training (including sampled raster values and distance calculations), as well as the observed Variable to Predict field and accompanying predictions that will be used to further assess performance of the trained model.

Feature Class
Output Variable Importance Table
(Optional)

The table that will contain information describing the importance of each explanatory variable (fields, distance features, and rasters) used in the model created. The chart created from this table can be accessed in the Contents pane.

Table
Convert Polygons to Raster Resolution for Training
(Optional)

Specifies how polygons will be treated when training the model if the Input Training Features values are polygons with a categorical Variable to Predict value and only Explanatory Training Rasters values have been specified.

  • Checked—The polygon will be divided into all of the raster cells with centroids falling within the polygon. The raster values at each centroid will be extracted and used to train the model. The model will no longer be trained on the polygon, ; it will be trained on the raster values extracted for each cell centroid. This is the default.
    Polygon divided into raster cells
  • Unchecked—Each polygon will be assigned the average value of the underlying continuous rasters and the majority for underlying categorical rasters.
    Polygon value assigned as either average or majority

Boolean
Number of Trees
(Optional)

The number of trees that will be created in the forest model. More trees generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
Minimum Leaf Size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
Maximum Tree Depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
Data Available per Tree (%)
(Optional)

The percentage of the Input Training Features values that will be used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
Number of Randomly Sampled Variables
(Optional)

The number of explanatory variables that will be used to create each decision tree.

Each of the decision trees in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting the model particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables (fields, distances, and rasters combined) if the Variable to Predict value is numeric or divide the total number of explanatory variables (fields, distances, and rasters combined) by 3 if the Variable to Predict value is categorical.

Long
Training Data Excluded for Validation (%)
(Optional)

The percentage (between 10 percent and 50 percent) of the Input Training Features values that will be reserved as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted values. The default is 10 percent.

Double
Output Classification Performance Table (Confusion Matrix)
(Optional)

A confusion matrix for classification summarizing the performance of the model created. This table can be used to calculate other diagnostics in addition to the accuracy and sensitivity measures the tool calculates in the output messages.

Table
Output Validation Table
(Optional)

If the Number of Runs for Validation value is greater than 2, this table creates a chart of the distribution of R2 for each model. This distribution can be used to assess the stability of the model.

Table
Compensate for Sparse Categories
(Optional)

Specifies whether each category in the training dataset, regardless of its frequency, will be represented in each tree.

  • Checked—Each tree will include every category that is represented in the training dataset.
  • Unchecked—Each tree will be created based on a random sample of the categories in the training dataset. This is the default.

Boolean
Number of Runs for Validation
(Optional)

The number of iterations of the tool. The distribution of the R2 for each run can be displayed using the Output Validation Table parameter. When this is set and predictions are being generated, only the model that produced the highest R2 value will be used for predictions.

Long
Calculate Uncertainty
(Optional)

Specifies whether prediction uncertainty will be calculated when training, predicting to features, or predicting to raster.

  • Checked—A prediction uncertainty interval will be calculated.
  • Unchecked—Uncertainty will not be calculated. This is the default.
Boolean

Derived Output

LabelExplanationData Type
Output Uncertainty Raster Layers

When the Calculate Uncertainty parameter is checked, the tool will calculate a 90 percent prediction interval around each predicted value of the Variable to Predict parameter.

Raster Layer

arcpy.stats.Forest(prediction_type, in_features, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {features_to_predict}, {output_features}, {output_raster}, {explanatory_variable_matching}, {explanatory_distance_matching}, {explanatory_rasters_matching}, {output_trained_features}, {output_importance_table}, {use_raster_values}, {number_of_trees}, {minimum_leaf_size}, {maximum_depth}, {sample_size}, {random_variables}, {percentage_for_training}, {output_classification_table}, {output_validation_table}, {compensate_sparse_categories}, {number_validation_runs}, {calculate_uncertainty})
NameExplanationData Type
prediction_type

Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, predict features, or create a prediction surface.

  • TRAINA model will be trained, but no predictions will be generated. Use this option to assess the accuracy of the model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance. This is the default
  • PREDICT_FEATURESPredictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature class, model diagnostics in the messages window, and an optional table and chart of variable importance.
  • PREDICT_RASTERA prediction raster will be generated for the area where the explanatory rasters intersect. Explanatory rasters must be provided for both the training area and the area to be predicted. The output of this option will be a prediction surface, model diagnostics in the messages window, and an optional table and chart of variable importance.
String
in_features

The feature class containing the variable_predict parameter value and, optionally, the explanatory training variables from fields.

Feature Layer
variable_predict
(Optional)

The variable from the in_features parameter containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
treat_variable_as_categorical
(Optional)
  • CATEGORICALThe variable_predict value is a categorical variable and the tool will perform classification.
  • NUMERICThe variable_predict value is continuous and the tool will perform regression. This is the default
Boolean
explanatory_variables
[[Variable, Categorical],...]
(Optional)

A list of fields representing the explanatory variables that help predict the value or category of the variable_predict value. Use the treat_variable_as_categorical parameter for any variables that represent classes or categories (such as land cover or presence or absence). Specify the variable as CATEGORICAL if it represents classes or categories such as land cover or presence or absence and NUMERIC if it is continuous.

Value Table
distance_features
[distance_features,...]
(Optional)

The explanatory training distance features. Explanatory variables will be automatically created by calculating a distance from the provided features to the in_features values. Distances will be calculated from each of the input distance_features values to the nearest in_features value. If the input distance_features values are polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
explanatory_rasters
[[Variable, Categorical],...]
(Optional)

The explanatory training variables extracted from rasters. Explanatory training variables will be automatically created by extracting raster cell values. For each feature in the in_features parameter, the value of the raster cell is extracted at that exact location. Bilinear raster resampling is used when extracting the raster value unless it is specified as categorical, in which case nearest neighbor assignment is used. Specify the raster as CATEGORICAL if it represents classes or categories such as land cover or presence or absence and NUMERIC if it is continuous.

Value Table
features_to_predict
(Optional)

A feature class representing locations where predictions will be made. This feature class must also contain any explanatory variables provided as fields that correspond to those used from the training data.

Feature Layer
output_features
(Optional)

The output feature class containing the prediction results.

Feature Class
output_raster
(Optional)

The output raster containing the prediction results. The default cell size will be the maximum cell size of the raster inputs. To set a different cell size, use the Cell Size environment setting.

Raster Dataset
explanatory_variable_matching
[[Prediction, Training],...]
(Optional)

A list of the explanatory_variables values specified from the in_features parameter on the right and corresponding fields from the features_to_predict parameter on the left, for example, [["LandCover2000", "LandCover2010"], ["Income", "PerCapitaIncome"]].

Value Table
explanatory_distance_matching
[[Prediction, Training],...]
(Optional)

A list of the distance_features values specified for the in_features parameter on the right and corresponding feature sets from the features_to_predict parameter on the left.

The explanatory_distance_features values that are more appropriate for the features_to_predict parameter can be provided if those used for training are in a different study area or time period.

Value Table
explanatory_rasters_matching
[[Prediction, Training],...]
(Optional)

A list of the explanatory_rasters values specified for the in_features on the right and corresponding rasters from the features_to_predict parameter or output_raster parameter to be created on the left.

The explanatory_rasters values that are more appropriate for the features_to_predict parameter can be provided if those used for training are in a different study area or time period.

Value Table
output_trained_features
(Optional)

The explanatory variables used for training (including sampled raster values and distance calculations), as well as the observed variable_to_predict field and accompanying predictions that will be used to further assess performance of the trained model.

Feature Class
output_importance_table
(Optional)

The table that will contain information describing the importance of each explanatory variable (fields, distance features, and rasters) used in the model created.

Table
use_raster_values
(Optional)

Specifies how polygons will be treated when training the model if the in_features values are polygons with a categorical variable_predict value and only explanatory_rasters values have been specified.

  • TRUEThe polygon will be divided into all of the raster cells with centroids falling within the polygon. The raster values at each centroid will be extracted and used to train the model. The model will no longer be trained on the polygon; it will be trained on the raster values extracted for each cell centroid. This is the default.
  • FALSEEach polygon will be assigned the average value of the underlying continuous rasters and the majority for underlying categorical rasters.
Boolean
number_of_trees
(Optional)

The number of trees that will be created in the forest model. More trees generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Long
minimum_leaf_size
(Optional)

The minimum number of observations required to keep a leaf (that is, the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Long
maximum_depth
(Optional)

The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Long
sample_size
(Optional)

The percentage of the in_features values that will be used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Long
random_variables
(Optional)

The number of explanatory variables that will be used to create each decision tree.

Each of the decision trees in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting the model particularly if there is one or more dominant variables. A common practice is to use the square root of the total number of explanatory variables (fields, distances, and rasters combined) if the variable_predict value is numeric or divide the total number of explanatory variables (fields, distances, and rasters combined) by 3 if the variable_predict value is categorical.

Long
percentage_for_training
(Optional)

The percentage (between 10 percent and 50 percent) of the in_features values that will be reserved as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent.

Double
output_classification_table
(Optional)

A confusion matrix for classification summarizing the performance of the model created. This table can be used to calculate other diagnostics in addition to the accuracy and sensitivity measures the tool calculates in the output messages.

Table
output_validation_table
(Optional)

If the Number of Runs for Validation value is greater than 2, this table creates a chart of the distribution of R2 for each model. This distribution can be used to assess the stability of the model.

Table
compensate_sparse_categories
(Optional)

Specifies whether each category in the training dataset, regardless of its frequency, will be represented in each tree.

  • TRUEEach tree will include every category that is represented in the training dataset.
  • FALSEEach tree will be created based on a random sample of the categories in the training dataset. This is the default.
Boolean
number_validation_runs
(Optional)

The number of iterations of the tool. The distribution of the R2 for each run can be displayed using the Output Validation Table parameter. When this is set and predictions are being generated, only the model that produced the highest R2 value will be used for predictions.

Long
calculate_uncertainty
(Optional)

Specifies whether prediction uncertainty will be calculated when training, predicting to features, or predicting to raster.

  • TRUE A prediction uncertainty interval will be calculated.
  • FALSE Uncertainty will not be calculated. This is the default.
Boolean

Derived Output

NameExplanationData Type
output_uncertainty_raster_layers

When calculate_uncertainty is set to TRUE, the tool will calculate a 90 percent prediction interval around each predicted value of the variable_to_predict parameter.

Raster Layer

Code sample

Forest example 1 (Python window)

The following Python script demonstrates how to use the Forest function.

import arcpy
arcpy.env.workspace = r"c:\data"

# Forest-based model using only the training method and all data
# comes from a single polygon feature class. The tool excludes 10% of the 
# input features from training and uses these values to validate the model.

prediction_type = "TRAIN"
in_features = r"Boston_Vandalism.shp"
variable_predict = "VandCnt"
explanatory_variables = [["Educat", "false"], ["MedAge", "false"], 
    ["HHInc", "false"], ["Pop", "false"]]
output_trained_features = "TrainingFeatures.shp"
number_of_trees = 100
sample_size = 100
percentage_for_training = 10

arcpy.stats.Forest(prediction_type, in_features, variable_predict, None,
    explanatory_variables, None, None, None, None, None, None, None, None,
    output_trained_features, None, True, number_of_trees, None, None, sample_size, 
    None, percentage_for_training)
Forest example 2 (stand-alone script)

The following Python script demonstrates how to use the Forest function to predict to features.

# Import system modules
import arcpy

# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True

# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\BostonCrimeDB.gdb"

# Forest-based model taking advantage of both distance features and 
# explanatory rasters. The training and prediction data has been manually
# split so the percentage to exclude parameter was set to 0. A variable importance
# table is created to help assess results and advanced options have been used
# to fine tune the model.

prediction_type = "PREDICT_FEATURES"
in_features = r"Boston_Vandalism_Training"
variable_predict = "Vandalism_Count"
treat_variable_as_categorical = None
explanatory_variables = [["EduClass", "true"], ["MedianAge", "false"],
    ["HouseholdIncome", "false"], ["TotalPopulation", "false"]]
distance_features = r"Boston_Highways"
explanatory_rasters = r"LandUse true"
features_to_predict = r"Boston_Vandalism_Prediction"
output_features = r"Prediction_Output"
output_raster = None
explanatory_variable_matching = [["EduClass", "EduClass"], ["MedianAge", "MedianAge"], 
    ["HouseholdIncome", "HouseholdIncome"], ["TotalPopulation", "TotalPopulation"]]
explanatory_distance_matching = [["Boston_Highways", "Boston_Highways"]]
explanatory_rasters_matching = [["LandUse", "LandUse"]]
output_trained_features = r"Training_Output"
output_importance_table = r"Variable_Importance"
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = 2
maximum_level = 5
sample_size = 100
random_sample = 3
percentage_for_training = 0

arcpy.stats.Forest(prediction_type, in_features, variable_predict,
    treat_variable_as_categorical, explanatory_variables, distance_features,
    explanatory_rasters, features_to_predict, output_features, output_raster,
    explanatory_variable_matching, explanatory_distance_matching, 
    explanatory_rasters_matching, output_trained_features, output_importance_table,
    use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
    sample_size, random_sample, percentage_for_training)
Forest example 3 (stand-alone script)

The following Python script demonstrates how to use the Forest function to create a prediction surface.

# Import system modules
import arcpy

# Set property to overwrite existing outputs
arcpy.env.overwriteOutput = True

# Set the work space to a gdb
arcpy.env.workspace = r"C:\Data\Landsat.gdb"

# Using a forest-based model to classify a landsat image. The TrainingPolygons feature 
# class was created manually and is used to train the model to 
# classify the remainder of the landsat image.

prediction_type = "PREDICT_RASTER"
in_features = r"TrainingPolygons"
variable_predict = "LandClassName"
treat_variable_as_categorical = "CATEGORICAL" 
explanatory_variables = None
distance_features = None
explanatory_rasters = [["Band1", "false"], ["Band2", "false"], ["Band3", "false"]]
features_to_predict = None
output_features = None
output_raster = r"PredictionSurface"
explanatory_variable_matching = None
explanatory_distance_matching = None
explanatory_rasters_matching = [["Band1", "Band1"], ["Band2", "Band2"], ["Band3", "Band3"]]
output_trained_features = None
output_importance_table = None
use_raster_values = True
number_of_trees = 100
minimum_leaf_size = None
maximum_level = None
sample_size = 100
random_sample = None
percentage_for_training = 10

arcpy.stats.Forest(prediction_type, in_features, variable_predict,
    treat_variable_as_categorical, explanatory_variables, distance_features,
    explanatory_rasters, features_to_predict, output_features, output_raster,
    explanatory_variable_matching, explanatory_distance_matching, 
    explanatory_rasters_matching, output_trained_features, output_importance_table,
    use_raster_values, number_of_trees, minimum_leaf_size, maximum_level,
    sample_size, random_sample, percentage_for_training)

Environments

Random number generator

The Random Generator Type used is always Mersenne Twister.

Parallel Processing Factor

Parallel processing is only used when predictions are being made.

Licensing information

  • Basic: Limited
  • Standard: Limited
  • Advanced: Limited

Related topics