r - Why is caret train taking up so much memory? -
when train using glm
, works, , don't come close exhausting memory. when run train(..., method='glm')
, run out of memory.
is because train
storing lot of data each iteration of cross-validation (or whatever trcontrol procedure is)? i'm looking @ traincontrol
, can't find how prevent this...any hints? care performance summary , maybe predicted responses.
(i know it's not related storing data each iteration of parameter-tuning grid search because there's no grid glm's, believe.)
the problem 2 fold. i) train
doesn't fit model via glm()
, bootstrap model, defaults, train()
25 bootstrap samples, which, coupled problem ii) the (or a) source of problem, , ii) train()
calls glm()
function its defaults. , defaults store model frame (argument model = true
of ?glm
), includes copy of data in model frame style. object returned train()
stores copy of data in $trainingdata
, , "glm"
object in $finalmodel
has copy of actual data.
at point, running glm()
using train()
producing 25 copies of expanded model.frame
and original data, need held in memory during resampling process - whether these held concurrently or consecutively not clear quick @ code resampling happens in lapply()
call. there 25 copies of raw data.
once resampling finished, returned object contain 2 copies of raw data , full copy of model.frame
. if training data large relative available ram or contains many factors expanded in model.frame
, using huge amounts of memory carrying copies of data around.
if add model = false
train call, might make difference. here small example using clotting
data in ?glm
:
clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) require(caret)
then
> m1 <- train(lot1 ~ log(u), data=clotting, family = gamma, method = "glm", + model = true) fitting: parameter=none aggregating results fitting model on full training set > m2 <- train(lot1 ~ log(u), data=clotting, family = gamma, method = "glm", + model = false) fitting: parameter=none aggregating results fitting model on full training set > object.size(m1) 121832 bytes > object.size(m2) 116456 bytes > ## ordinary glm() call: > m3 <- glm(lot1 ~ log(u), data=clotting, family = gamma) > object.size(m3) 47272 bytes > m4 <- glm(lot1 ~ log(u), data=clotting, family = gamma, model = false) > object.size(m4) 42152 bytes
so there size difference in returned object , memory use during training lower. how lower depend on whether internals of train()
keep copies of model.frame
in memory during resampling process.
the object returned train()
larger returned glm()
- mentioned @dwin in comments, below.
to take further, either study code more closely, or email max kuhn, maintainer of caret, enquire options reduce memory footprint.
Comments
Post a Comment