Skip to content

WIP: Improve scantime #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

rollator
Copy link

@rollator rollator commented Aug 11, 2016

Update 1.2

Reworked distance calculations, should be seriously faster now. To achieve that I had to add scipy and numpy and pyproj as dependencies.

Would appreciate if some folks with bigger datasets than me(>2k spawn points) could test it.

Update 1.1

Did some refactoring to improve code quality and fixed something which may very well have lead to an infinite loop.

Update 1

What do you need to run the PR at this stage?

  1. set a new variable in config file SAMPLES_PER_POINT = 5. Increasing the value increases the computation time needed before we can start, but results in a smaller set of scanning points. However increasing it beyond 10 won't improve the result significantly.
  2. the area needs to be pre-scanned with the normal version as you need data about the spawn points.

Will it reduce server request?

No, not unless you reduce the grid parameters in the config file. You can expect a decrease of points to scan of about ~50%, so you could use 50% less workers to get the same results as before which in turn would lead to fewer server request.

Still missing:

  • better integration into the worker cycle. Will start working on this part when PR [v0.6.0] coroutines #216 is merged.
  • setting variable to decide if we should scan everything normally first or can directly work with reduced scanset.
  • code quality is seriously lacking, will be fixed soon. fixed
  • scanning efficiency can be improved even further. But as this also requires changes on the worker side I'll stall until PR [v0.6.0] coroutines #216 merged.

TL;DR: set SAMPLES_PER_POINT = 5 in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

edit: I just realized I was talking bullshit about the efficiency gain. It wasn't 30% more effective for me. It reduced the scanning points to 30%, so the speedup is way higher. Corrected the corresponding passages.


Reduce the scan time by calculating a set of scan points which covers all distinct spawns. This leads to a reduced amount of points to scan without missing a single pokemon.

As the title states, it's still work in progress. What's missing at this point:

  • A proper integration into the worker cycle, so that it makes a 2 hours session every 24 hours in which it makes an extensive scan sweep to detect possible changes in spawn points. Right now there are no sweeps and nothing, it just directly starts with the reduced set. I'm not quite sure how to accomplish this, would love some help here.
  • a setting in which one can inform the program if the area already has been scanned(so that it can already start with the reduced scan point set)
  • code quality: It's a prototype which was hacked together as fast as possible, so there will be improvements here xD

lon <= {lon_end}
GROUP BY spawn_id;
'''.format(
lat_end=config.MAP_START[0],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to edit this to min(start[0],end[0]) and so on. In case the rectangle is defined the other way around

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I had forgotten about that. Had to switch start and end as it was the other way round for me and wanted to generalize it for others but looks like i didn't :s

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting error in db.py

import db
Traceback (most recent call last):
File "", line 1, in
File "db.py", line 449
return results
SyntaxError: 'return' outside function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats weird. Which python version are you using?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.7.6
Dont have any errors on my backup server that has basics without this pull added

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And which file did you try to execute? Could you attach the full traceback?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure how to add a full traceback, still learning a lot.

im getting that error when i go to import the db on a clean scanner server setup. using these commands

python -i
import db

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried my best but I can't reproduce this error. May I ask how applied this pull request?

@SapitoSucio
Copy link

Would this mean a reduction of requests sent to Niantic Servers ?

@rollator
Copy link
Author

That depends on the user. If you don't adjust the grid parameter in the config file it will still make more or less the same amount of requests but each point will be scanned more often.

As there are fewer points to scan(500 instead of 1500 for me),you could use fewer accounts to scan the same area, or keep the number of accounts and scan a larger area. Or don't change anything at all and have a higher refresh rate :)

Right now I'm working with only 2/3 of the workers i did without these changes but also reduced the workload for every worker.

@crhbetz
Copy link

crhbetz commented Aug 12, 2016

Hello @rollator ,
just to be clear - as this is right now, I need a previously filled database and your change will draw the spawn points that need to be scanned from there, right?
Also, what is the definition of a "sample" and a "point" as in SAMPLES_PER_POINT ?

@rollator
Copy link
Author

rollator commented Aug 12, 2016

Oh yes, sorry I should have mentioned that:

  • As of right now you need a pre-filled database for the area you want to scan. (This certainly will be changed in the future)
  • SAMPLES_PER_POINT is a value that's used for calculation of the set of scanning points: As the exact computation would be unfeasable by all means, the minimal needed set can only be approximated. First all distinct spawnpoints are pulled from the database. These are the "points". Now as a second step, we visit each of these points, and generate a specific amount of circles(SAMPLES_PER_POINT) around the point so that the point will be inside the generated circles and add all of these to a collection of circles. In a third step we iterate over the circles and choose those which cover the most points until all points are covered. As the generation of circles is random, more circles leads to a better set to choose from. However it's also computationally expensive to calculate the distances from points to circle center.
    For example if we have 2000 spawn points and SAMPLES_PER_POINT = 5, we'll generate 5 * 2000=10'000 circles. Now we need to check the distances for 10'000 circles to 2000 points which are 10'000 * 2000=20'000'000 pairs to calculate(actually the real number is a bit lower thanks to fancy datastructures called interval trees). So increasing the SAMPLES_PER_POINT leads to fewer scanpoints but at some point there won't ve a better solution. But it also drastically increases the computation time

@vDrag0n
Copy link

vDrag0n commented Aug 13, 2016

With my data set, Im getting
Traceback (most recent call last):
File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 378, in
spawn_workers(workers, status_bar=args.status_bar)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 308, in spawn_workers
points = utils.calculate_minimal_pointset(spawn_points)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 210, in calculate_minimal_pointset
foo = map_points_to_circles(interval_of_circles, interval_of_points)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 106, in map_points_to_circles
points = IntervalTree(map(lambda point, circles=circles: map_point(point,circles),points))
File "C:\Python27\lib\site-packages\intervaltree\intervaltree.py", line 254, in init
self.top_node = Node.from_intervals(self.all_intervals)
File "C:\Python27\lib\site-packages\intervaltree\node.py", line 64, in from_intervals
node = node.init_from_sorted(sorted(intervals))
File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 185, in lt
return self.cmp(other) < 0
File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 170, in cmp
return -1 if self.data < other.data else 1
RuntimeError: maximum recursion depth exceeded in cmp

@rollator
Copy link
Author

rollator commented Aug 13, 2016

@vDrag0n I just added a small commit which could fix it for you, but I'm not really sure. Could you try it with the new version and tell if it works now and what your number of spawnpoints was? (Will be printed in the console before the most ressource intensive part, so you'll have plenty of time to take note :)

@Aiyubi
Copy link

Aiyubi commented Aug 13, 2016

  • you use variable names like "foo", "meh" or "anything" which should really be cleaned up
  • in util line 220 you start a while loop with a single break. Easy fix: "while condition".
  • you sometimes import something like math inside a function without using it
  • you have a funcion with something like 100 lines of code. Maybe break that up

lat <= {lat_end} AND
lon >= {lon_start} AND
lon <= {lon_end}
GROUP BY spawn_id;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont know if spawn_id is unique. Better GROUP BY lat,lon ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually spawn_id is unique, but I can change it anyway as it shouldn't really make a difference :)

Copy link

@Aiyubi Aiyubi Aug 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

267k records in one db, 562k in the other. 505 duplicate spawn ids.

Source: https://www.reddit.com/r/pokemongodev/comments/4vazmk/encounter_id_is_not_unique/
Title says encounter_id but really is spawn_id if you read it

Copy link
Author

@rollator rollator Aug 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I'm not really conviced he's talking about the id of spawn points. He only mentions "spawn id" once and from the context it seems like he's talking about the encounter_id field.
In our db, spawn_id is spawn point id, not encounter_id, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we dont save the encounter id at all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do save encounter_id on the sightings table.

@rollator
Copy link
Author

you use variable names like "foo", "meh" or "anything" which should really be cleaned up
in util line 220 you start a while loop with a single break. Easy fix: "while condition".
you sometimes import something like math inside a function without using it
you have a funcion with something like 100 lines of code. Maybe break that up

Thanks for the styling tips but as as I already stated in very first post, I'm very much aware that point code quality is seriously lacking at this point. It's a prototype and more of a POC rather than code I'd actually want to be merged. As the title suggests, this PR is still very much Work In Progress and nowhere near production quality.

@vDrag0n
Copy link

vDrag0n commented Aug 13, 2016

Using your spawn point retrieval sql, i'm pulling 26k records over a 1.2million row database.

I've tried changing the recursive limits of python, and just setting the SAMPLES_PER_POINT to 1

edit:
Tried the small fix. Takes about 20mins to calculate the pointset, but looks like its works now.

@rollator
Copy link
Author

that did the trick? great, thank you for testing :D
would you mind sharing some data about scanning efficiency gain, for example how many points you had to scan before versus now? Would be really interesting to see how much(if at all) one can expect to gain on average, as for now I only have my numbers(~500 instead of 1700)

The long runtime isn't really unexpected, as the calculations are really computational intensive and calculating distances on earth takes quite some time compared to a flat plane :/

@vDrag0n
Copy link

vDrag0n commented Aug 13, 2016

It all depends on how dense your spawn points are and where you're scanning. Since Im scanning a large area with parks, schools, and inlets, Im saving about 50%.

For the speed problem, Its only using one core. Im sure it could be quicker if you just multi threaded it.

@rollator
Copy link
Author

It all depends on how dense your spawn points are and where you're scanning.

Exactly that's why I'm asking, so we can have a rough estimate which doesn't only depend on 1 scan location

As for the speed problem, that's definitly something I'll be looking into

@Aiyubi
Copy link

Aiyubi commented Aug 13, 2016

tested this:
circles = IntervalTree(map(lambda interval_circle, points=points: map_circle(interval_circle, points), circles))
this line is running for 4h and still not finished :/
btw: 16047 spawnpoints

@rollator
Copy link
Author

what's your SAMPLES_PER_POINT value?

@rollator
Copy link
Author

@gunawanputra right now you won't any information about the location of a pokemon outside the 70m radius, so you still needed to scan all the points within the cell when you know that there are pokemons

@Aiyubi
Copy link

Aiyubi commented Aug 13, 2016

@rollator update: turns out the way python performs the map, it is ~1h including the treecreation. Will need to perform some more debugging where it runs all the other hours

@rollator
Copy link
Author

@Aiyubi I would guess that the 3 hours are in that huge horrible while True: loop :/
Right now I'm refactoring the code a bit, when I tinkering around to improve things I just had enough working with this horrible code.

@Aiyubi
Copy link

Aiyubi commented Aug 14, 2016

Sooo it finished with samples = 1
16047 spawnpoints
4482 optimized scanpoints
7884 non optimized scanpoints

@rollator
Copy link
Author

@Aiyubi could you test it again please? It should run MUCH faster now which could enable you increase the sample size and decrease the number of optimized scanpoints

@theresthatguy
Copy link

I have over 300k db. Will be trying this soon.

@rollator
Copy link
Author

@theresthatguy I wasn't talking db entries, I was talking spawnpoints. My db contains ~350k entries, so we're pretty similar in that sence, so I guess it should run no more than a few seconds for you and you can even set the SAMPLES_PER_POINT to a higher value(~40 was fine for me) to achieve better results :)

@Aiyubi
Copy link

Aiyubi commented Aug 14, 2016

Number of spawnpoints: 16047
Generating samples...
Starting distance calculations
Distance calculations finished
Calculating best scan points
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scanpoints.py", line 345, in calculate_minimal_pointset
    circle_indexes = choose_best_scan_points(inside_matrix)
  File "scanpoints.py", line 276, in choose_best_scan_points
    mult_matrix = scipy.sparse.diags(identity_array[0])
TypeError: diags() takes at least 2 arguments (1 given)

@vDrag0n
Copy link

vDrag0n commented Aug 15, 2016

I've tested it and it works much quicker after the refactor. Only the checking of spawn points takes a long time.

@rollator
Copy link
Author

rollator commented Aug 15, 2016

@Aiyubi what python & scipy version are you using? This seems like it's an older scipy version

@vDrag0n by checking of spawn points you mean the "safety check"? That part will be removed asap, however I first wanted confirmation by other testers that the part where it print 'lonely circle found' never gets executed.
Haven't removed it because it shouldn't ever execute, but I wanted to really be sure.
If you want you can change line 359 in scanpoints.py from scan_points = _check_scan_point_set_for_safety(points, scan_points) to #scan_points = _check_scan_point_set_for_safety(points, scan_points) and it won't do this slow check. How long did it take for you to run it now?

@vDrag0n
Copy link

vDrag0n commented Aug 15, 2016

about 5mins with 10 samples before the safety check.
Its hard to check if it gets printed or not. Theres no user input requirement, or logging if there is one, since the worker will go straight to scanning. But I did purposely add a pause after completing calculations and did not see any lonely circles with my data set

@rollator
Copy link
Author

you're right. Added commit to log results to minimal_scan_calculations.log

@Aiyubi
Copy link

Aiyubi commented Aug 15, 2016

@rollator you were right. Did pip for the wrong python version and had an old scipy

  • you are missing pyproj in the requirements
  • I realized that this does not take gyms into account. Is it unlikely enough to assume we do not need this? I mean for my ~100km² that would be just 449 extra points (with obviously bigger range)
  • seems to be working pretty good

@rollator
Copy link
Author

thanks for the feedback. About the gyms, I completely forgot about them, will add them soon.

@Aiyubi
Copy link

Aiyubi commented Aug 16, 2016

So I added spawnpoints to my map ( #241 ). What I noticed is that there are still some circles that could be optimized away.

See this:
minimize

The middle one has no unique points and could be optimized away.

I wonder if espresso algorithm could be used to minimize this. (Since Quine mccluskey has exponential runtime)

@crhbetz
Copy link

crhbetz commented Aug 17, 2016

Do we need to consider gyms though? It seems to me that workers find gyms in quite a huge radius as it is, so for a gym to be missed there has to be a huge void of spawn points surrounding it. Is that ever the case? Aren't gyms "defined" to be places where "a lot is going on"?

@rollator
Copy link
Author

It wouldn't be much of a hassle to add them, it's just a manner of adding some columns to the matrix.

@Aiyubi I can't really tell as I don't know this algorithm and don't find adequate ressources to change that. Perhaps if you could help me out there I certainly could try it but I guess that would have to wait ~2 weeks as I just don't have the time right know :/

@nhumber
Copy link

nhumber commented Aug 19, 2016

i decided to give this a go today, and its a no go, i tried with a database from both 0.5.4 and one from 0.5.2 and i get different errors with both, any ideas on something i can try?

Number of spawnpoints: 0
Generating samples...
Starting distance calculations
Traceback (most recent call last):
File "worker.py", line 402, in
spawn_workers(workers, status_bar=args.status_bar)
File "worker.py", line 332, in spawn_workers
points = scanpoints.calculate_minimal_pointset(spawn_points)
File "/root/maps/pokedev/scanpoints.py", line 324, in calculate_minimal_pointset
tree_of_points = KDTree(euclid_points)
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/kdtree.py", line 235, in init
self.n, self.m = np.shape(self.data)
ValueError: need more than 1 value to unpack

@rollator
Copy link
Author

@nhumber try it with python 3

@nhumber
Copy link

nhumber commented Aug 20, 2016

im stuck on 2.7, ive tried to get my vps up to 3.5 but i seem to mess it up everytime and have to revert.
i saw a mention to use pyenv, but i couldnt understand the guide i found, do you know any simple way to get python 3.5 on ubuntu 14.04 without messing everything up ? thanks and sorry for the newbieness.

@fourbytes
Copy link

fourbytes commented Aug 20, 2016

With 5763 spawn points, safety check seemed to take quite a long time, but "A sad lonely circle has been found :(" was not printed, so if nobody else is getting it, I think it'd be safe to remove. Awesomely faster though now!

@rollator
Copy link
Author

@nhumber virtualenvwrapper is pretty easy to setup (https://virtualenvwrapper.readthedocs.io/en/latest/)
@fourbytes Yes, that will be removed in the next version

@nhumber
Copy link

nhumber commented Aug 22, 2016

thank you for the suggestion, i searched around and found this guide and followed it:

http://blog.dwaynecrooks.com/post/136668432442/installing-python-351-from-source-on-ubuntu

I got up and running and away from python 2.7 prior to the big co routines update!!, if a newb like me can do it anyone can ;-)

edit ** looks like i spoke too soon, i have python 3.5 working in a env but now im getting this error

(env) user@MAIN:~/maptest/pm$ python worker.py
Number of spawnpoints: 0
Generating samples...
Starting distance calculations
Traceback (most recent call last):
File "worker.py", line 402, in
spawn_workers(workers, status_bar=args.status_bar)
File "worker.py", line 332, in spawn_workers
points = scanpoints.calculate_minimal_pointset(spawn_points)
File "/home/user/maptest/pm/scanpoints.py", line 324, in calculate_minimal_pointset
tree_of_points = KDTree(euclid_points)
File "/home/user/maptest/env/lib/python3.5/site-packages/scipy/spatial/kdtree.py", line 235, in init
self.n, self.m = np.shape(self.data)
ValueError: not enough values to unpack (expected 2, got 1)

i see above mention of scipy version, but it seems im already up to date.

any ideas i could have missed?

also

I run the project in combination with a twitter bot to notify on rare/ultra rare spawns, first for the notifcations and secondly for data collection with the hopes that at some point we can get to the point of predicting the spawns rather than scanning for them.

this is slightly OT but has anyone tried narrowing down spawn points to remove points that never spawn anything rare? i realize this is not the scope of the project but, Do we know if certain spawn points NEVER spawn anything rare/good ? if so can those points just be eliminated?

i was thinking about sorting the db by spawn point location and then again by pokemon ID and removing any points that never spawn anything but common pokemon.

Good idea or bad idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants