train_sample_size.Rd
Splits dataset into train/test set and runs multiple calibrations using increasing number of observartions from training data.
Model accuracy is calculated for each created model using the hold-out test set.
Can be used to assess effect of sample size on model performance. Should support most (numeric) regression models implemented in caret (no guarantees though - not tested)
train_sample_size(
data,
Xr = "spc_sg_snv_rs4",
Yr = "TOC",
trans = none,
trans_rev = none,
train_ratio = 0.7,
kmeans_pc = 0.99,
min_samples = 30,
sample_step = 10,
method = "glm",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = NULL,
save_all = F,
save_models = F,
output_folder = "models_folder",
return_all = T,
seed = 123,
...
)
Dataset for model calibration. Can be tibble or dataframe. Must contain column Yr and column or nested tibble/data.frame/matrix Xr.
0.7, train / test split. 0...1 - fraction of data used for training
lowest no of samples for training.
steps used to increment training sample size from min_samples to nrow(data)*train_ratio
model fitting parameters handed to caret::train
folder path in which models will be saved if save = TRUE. If dir.exists(output_folder) = FALSE, folder will be created.
additional arguments passed to caret::train
Predictor variable(s). Vector column or nested tibble/data.frame/matrix.Will be converted into nested matrix.
Target variable. Currently only single column supported.
Transformation applied to Yr by match.fun(trans)(Yr). Set to none for no operation.
Reverse operation to trans. Appplied by match.fun(trans_rev)(Yr). Set to none for no operation.
cumvar of kmeans pc decomposition. Ignored when Xr<3. (kmeans applied directly on Xr)
model type (handed to caret::train). Make sure to install/load required packages. (not yet automated)
tuning parameter grid handed to caret::train
should individual models be saved?
return to console
seed for initial train / test split for reproduceability
If return_all = TRUE, returns object containing a list 'models' with all calibrated models (caret::train object with additional details appended to object$documentation) and val_stats containing evaluation metrics. !!! CURRENTLY USES LOCAL FUNCTION from evaluate_model_adjusted.R
In future, extract kmeans sampling function as standalone