Splits dataset into train/test set and runs multiple calibrations using increasing number of observartions from training data. Model accuracy is calculated for each created model using the hold-out test set. Can be used to assess effect of sample size on model performance. Should support most (numeric) regression models implemented in caret (no guarantees though - not tested) [Experimental]

train_sample_size(
  data,
  Xr = "spc_sg_snv_rs4",
  Yr = "TOC",
  trans = none,
  trans_rev = none,
  train_ratio = 0.7,
  kmeans_pc = 0.99,
  min_samples = 30,
  sample_step = 10,
  method = "glm",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = NULL,
  save_all = F,
  save_models = F,
  output_folder = "models_folder",
  return_all = T,
  seed = 123,
  ...
)

Arguments

data

Dataset for model calibration. Can be tibble or dataframe. Must contain column Yr and column or nested tibble/data.frame/matrix Xr.

train_ratio

0.7, train / test split. 0...1 - fraction of data used for training

min_samples

lowest no of samples for training.

sample_step

steps used to increment training sample size from min_samples to nrow(data)*train_ratio

trControl

model fitting parameters handed to caret::train

output_folder

folder path in which models will be saved if save = TRUE. If dir.exists(output_folder) = FALSE, folder will be created.

...

additional arguments passed to caret::train

Xr="spc_sg_snv_rs4"

Predictor variable(s). Vector column or nested tibble/data.frame/matrix.Will be converted into nested matrix.

Yr="TOC"

Target variable. Currently only single column supported.

trans=none,

Transformation applied to Yr by match.fun(trans)(Yr). Set to none for no operation.

trans_rev=none,

Reverse operation to trans. Appplied by match.fun(trans_rev)(Yr). Set to none for no operation.

kmeans_pc=.99,

cumvar of kmeans pc decomposition. Ignored when Xr<3. (kmeans applied directly on Xr)

method="glm"

model type (handed to caret::train). Make sure to install/load required packages. (not yet automated)

tuneGrid=NULL

tuning parameter grid handed to caret::train

save=F

should individual models be saved?

return_all=T

return to console

seed=123

seed for initial train / test split for reproduceability

Value

If return_all = TRUE, returns object containing a list 'models' with all calibrated models (caret::train object with additional details appended to object$documentation) and val_stats containing evaluation metrics. !!! CURRENTLY USES LOCAL FUNCTION from evaluate_model_adjusted.R

Note

In future, extract kmeans sampling function as standalone