progress indicator? #81

AlJohri · 2019-08-05T16:36:45Z

@owlas @arokem I'm running fci.random_forest_error on a fairly large dataset.

train: 3334431, 200
test: 13703350, 200

(train is smaller after undersampling)

I'm trying to use both the memory_constrained version and the low_memory version (#74).

I ran the memory_constrained version on a m5.24xlarge EC2 instance with 384.0 GiB of memory and 96 vCPUs. I gave it a memory_limit of 100000 MB (100 GB). This only utilized about half of the memory on the instance and it ran for over 48 hours until I just terminated the instance.

I'm currently running the low_memory option on a m5.12xlarge (192 GiB memory, 48 vCPU) which has been running for 15 hours straight and hasn't finished yet. Using top I do see that all of the CPUs are being utilized to 100%.

I have a few questions:

how can I estimate the ideal size of memory_limit? I understand its the max size of the intermediate matrices from the docs but it wasn't clear to me how many intermediate matricies are being created at a time? Is it sequential, i.e. do I just give it the whole RAM?
is the memory_constrained version faster than the low_memory option given large enough memory limit? It wasn't clear to me which one I should be expecting to complete.
Is there any way to show a progress indicator (even if I have to hack in a print for now)? I'd like to know how close I am to completion with jobs that seem to take multiple days to run.
Overall, I'm looking to precompute as much as possible and then run this model on live predictions one at a time. Looking at the code I believe this should be possible- does this sound do-able to you?

Any other advice you can offer would be most helpful.

Thanks!

EDIT: I'm trying the low_memory option again with a memory_limit of 300000 (300 GiB). I believe that limit does indeed appear to be sequential. The memory slowly crawls up to the max, all the cores kick in for a few minutes, and then the memory comes back down again.

Notably, while the memory is slowly filling up, only one core is being. Only when the memory fills up all the way, most of the cores kick into action for about 2 minutes. Then it takes 1 minute for the memory to go back down with a single core being used. The cycle then starts again.

Is there perhaps some optimization that can allow all the cores to be used more efficiently? It seems the majority of time is currently spent waiting until the memory limit is reached and only then some computation occurs.

As mentioned before, the low_memory option, on the other hand, is constantly using all cores at 100%.

The text was updated successfully, but these errors were encountered:

el-hult · 2024-06-20T08:26:49Z

Interesting work. Did you manage to get your estimates?

In the original paper by Wager & Athey, they mention that the number of trees needed to get reliable variance estimates are Theta(n). See page 14, right under equation 16 when discussing bias correction. link So for a large data set, you likely need a very large random forest. Did you gather any experience on how large forest you needed for this data set? I work on a dataset of ca 100k training and test samples, and it seems a forest of size 1000 is far too small.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress indicator? #81

progress indicator? #81

AlJohri commented Aug 5, 2019 •

edited

Loading

el-hult commented Jun 20, 2024

progress indicator? #81

progress indicator? #81

Comments

AlJohri commented Aug 5, 2019 • edited Loading

el-hult commented Jun 20, 2024

AlJohri commented Aug 5, 2019 •

edited

Loading