
Sample size effect on model accuracy estimator
train_sample_size.RdSplits dataset into train/test set and runs multiple calibrations using increasing number of observartions from training data.
Model accuracy is calculated for each created model using the hold-out test set.
Can be used to assess effect of sample size on model performance. Should support most (numeric) regression models implemented in caret (no guarantees though - not tested)
Usage
train_sample_size(
data,
Xr = "spc_sg_snv_rs4",
Yr = "TOC",
trans = none,
trans_rev = none,
train_ratio = 0.7,
kmeans_pc = 0.99,
min_samples = 30,
sample_step = 10,
method = "glm",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = NULL,
save_all = F,
save_models = F,
output_folder = "models_folder",
return_all = T,
seed = 123,
...
)Arguments
- data
Dataset for model calibration. Can be tibble or dataframe. Must contain column Yr and column or nested tibble/data.frame/matrix Xr.
- train_ratio
0.7, train / test split. 0...1 - fraction of data used for training
- min_samples
lowest no of samples for training.
- sample_step
steps used to increment training sample size from min_samples to nrow(data)*train_ratio
- trControl
model fitting parameters handed to caret::train
- output_folder
folder path in which models will be saved if save = TRUE. If dir.exists(output_folder) = FALSE, folder will be created.
- ...
additional arguments passed to caret::train
- Xr="spc_sg_snv_rs4"
Predictor variable(s). Vector column or nested tibble/data.frame/matrix.Will be converted into nested matrix.
- Yr="TOC"
Target variable. Currently only single column supported.
- trans=none,
Transformation applied to Yr by match.fun(trans)(Yr). Set to none for no operation.
- trans_rev=none,
Reverse operation to trans. Appplied by match.fun(trans_rev)(Yr). Set to none for no operation.
- kmeans_pc=.99,
cumvar of kmeans pc decomposition. Ignored when Xr<3. (kmeans applied directly on Xr)
- method="glm"
model type (handed to caret::train). Make sure to install/load required packages. (not yet automated)
- tuneGrid=NULL
tuning parameter grid handed to caret::train
- save=F
should individual models be saved?
- return_all=T
return to console
- seed=123
seed for initial train / test split for reproduceability