Skip to content

progress indicator? #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlJohri opened this issue Aug 5, 2019 · 1 comment
Open

progress indicator? #81

AlJohri opened this issue Aug 5, 2019 · 1 comment

Comments

@AlJohri
Copy link

AlJohri commented Aug 5, 2019

@owlas @arokem I'm running fci.random_forest_error on a fairly large dataset.

train: 3334431, 200
test: 13703350, 200

(train is smaller after undersampling)

I'm trying to use both the memory_constrained version and the low_memory version (#74).

I ran the memory_constrained version on a m5.24xlarge EC2 instance with 384.0 GiB of memory and 96 vCPUs. I gave it a memory_limit of 100000 MB (100 GB). This only utilized about half of the memory on the instance and it ran for over 48 hours until I just terminated the instance.

I'm currently running the low_memory option on a m5.12xlarge (192 GiB memory, 48 vCPU) which has been running for 15 hours straight and hasn't finished yet. Using top I do see that all of the CPUs are being utilized to 100%.

I have a few questions:

  1. how can I estimate the ideal size of memory_limit? I understand its the max size of the intermediate matrices from the docs but it wasn't clear to me how many intermediate matricies are being created at a time? Is it sequential, i.e. do I just give it the whole RAM?

  2. is the memory_constrained version faster than the low_memory option given large enough memory limit? It wasn't clear to me which one I should be expecting to complete.

  3. Is there any way to show a progress indicator (even if I have to hack in a print for now)? I'd like to know how close I am to completion with jobs that seem to take multiple days to run.

  4. Overall, I'm looking to precompute as much as possible and then run this model on live predictions one at a time. Looking at the code I believe this should be possible- does this sound do-able to you?

Any other advice you can offer would be most helpful.

Thanks!

EDIT: I'm trying the low_memory option again with a memory_limit of 300000 (300 GiB). I believe that limit does indeed appear to be sequential. The memory slowly crawls up to the max, all the cores kick in for a few minutes, and then the memory comes back down again.

Notably, while the memory is slowly filling up, only one core is being. Only when the memory fills up all the way, most of the cores kick into action for about 2 minutes. Then it takes 1 minute for the memory to go back down with a single core being used. The cycle then starts again.

Is there perhaps some optimization that can allow all the cores to be used more efficiently? It seems the majority of time is currently spent waiting until the memory limit is reached and only then some computation occurs.

As mentioned before, the low_memory option, on the other hand, is constantly using all cores at 100%.

@el-hult
Copy link
Contributor

el-hult commented Jun 20, 2024

Interesting work. Did you manage to get your estimates?

In the original paper by Wager & Athey, they mention that the number of trees needed to get reliable variance estimates are Theta(n). See page 14, right under equation 16 when discussing bias correction. link So for a large data set, you likely need a very large random forest. Did you gather any experience on how large forest you needed for this data set? I work on a dataset of ca 100k training and test samples, and it seems a forest of size 1000 is far too small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants