WIP: Improve scantime #212

rollator · 2016-08-11T23:23:49Z

Update 1.2

Reworked distance calculations, should be seriously faster now. To achieve that I had to add scipy and numpy and pyproj as dependencies.

Would appreciate if some folks with bigger datasets than me(>2k spawn points) could test it.

Update 1.1

Did some refactoring to improve code quality and fixed something which may very well have lead to an infinite loop.

Update 1

What do you need to run the PR at this stage?

set a new variable in config file SAMPLES_PER_POINT = 5. Increasing the value increases the computation time needed before we can start, but results in a smaller set of scanning points. However increasing it beyond 10 won't improve the result significantly.
the area needs to be pre-scanned with the normal version as you need data about the spawn points.

Will it reduce server request?

No, not unless you reduce the grid parameters in the config file. You can expect a decrease of points to scan of about ~50%, so you could use 50% less workers to get the same results as before which in turn would lead to fewer server request.

Still missing:

better integration into the worker cycle. Will start working on this part when PR [v0.6.0] coroutines #216 is merged.
setting variable to decide if we should scan everything normally first or can directly work with reduced scanset.
~~code quality is seriously lacking, will be fixed soon.~~ fixed
scanning efficiency can be improved even further. But as this also requires changes on the worker side I'll stall until PR [v0.6.0] coroutines #216 merged.

TL;DR: set `SAMPLES_PER_POINT = 5` in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

edit: I just realized I was talking bullshit about the efficiency gain. It wasn't 30% more effective for me. It reduced the scanning points to 30%, so the speedup is way higher. Corrected the corresponding passages.

Reduce the scan time by calculating a set of scan points which covers all distinct spawns. This leads to a reduced amount of points to scan without missing a single pokemon.

As the title states, it's still work in progress. What's missing at this point:

A proper integration into the worker cycle, so that it makes a 2 hours session every 24 hours in which it makes an extensive scan sweep to detect possible changes in spawn points. Right now there are no sweeps and nothing, it just directly starts with the reduced set. I'm not quite sure how to accomplish this, would love some help here.
a setting in which one can inform the program if the area already has been scanned(so that it can already start with the reduced scan point set)
code quality: It's a prototype which was hacked together as fast as possible, so there will be improvements here xD

Aiyubi · 2016-08-12T09:02:55Z

db.py

+            lon <= {lon_end}
+        GROUP BY spawn_id;
+    '''.format(
+        lat_end=config.MAP_START[0],


you might want to edit this to min(start[0],end[0]) and so on. In case the rectangle is defined the other way around

Thanks, I had forgotten about that. Had to switch start and end as it was the other way round for me and wanted to generalize it for others but looks like i didn't :s

Getting error in db.py

import db
Traceback (most recent call last):
File "", line 1, in
File "db.py", line 449
return results
SyntaxError: 'return' outside function

Thats weird. Which python version are you using?

2.7.6
Dont have any errors on my backup server that has basics without this pull added

And which file did you try to execute? Could you attach the full traceback?

Not 100% sure how to add a full traceback, still learning a lot.

im getting that error when i go to import the db on a clean scanner server setup. using these commands

python -i
import db

I tried my best but I can't reproduce this error. May I ask how applied this pull request?

distribute points among worker

SapitoSucio · 2016-08-12T17:06:32Z

Would this mean a reduction of requests sent to Niantic Servers ?

rollator · 2016-08-12T17:38:50Z

That depends on the user. If you don't adjust the grid parameter in the config file it will still make more or less the same amount of requests but each point will be scanned more often.

As there are fewer points to scan(500 instead of 1500 for me),you could use fewer accounts to scan the same area, or keep the number of accounts and scan a larger area. Or don't change anything at all and have a higher refresh rate :)

Right now I'm working with only 2/3 of the workers i did without these changes but also reduced the workload for every worker.

crhbetz · 2016-08-12T20:24:41Z

Hello @rollator ,
just to be clear - as this is right now, I need a previously filled database and your change will draw the spawn points that need to be scanned from there, right?
Also, what is the definition of a "sample" and a "point" as in SAMPLES_PER_POINT ?

rollator · 2016-08-12T20:53:08Z

Oh yes, sorry I should have mentioned that:

As of right now you need a pre-filled database for the area you want to scan. (This certainly will be changed in the future)
SAMPLES_PER_POINT is a value that's used for calculation of the set of scanning points: As the exact computation would be unfeasable by all means, the minimal needed set can only be approximated. First all distinct spawnpoints are pulled from the database. These are the "points". Now as a second step, we visit each of these points, and generate a specific amount of circles(SAMPLES_PER_POINT) around the point so that the point will be inside the generated circles and add all of these to a collection of circles. In a third step we iterate over the circles and choose those which cover the most points until all points are covered. As the generation of circles is random, more circles leads to a better set to choose from. However it's also computationally expensive to calculate the distances from points to circle center.
For example if we have 2000 spawn points and SAMPLES_PER_POINT = 5, we'll generate 5 * 2000=10'000 circles. Now we need to check the distances for 10'000 circles to 2000 points which are 10'000 * 2000=20'000'000 pairs to calculate(actually the real number is a bit lower thanks to fancy datastructures called interval trees). So increasing the SAMPLES_PER_POINT leads to fewer scanpoints but at some point there won't ve a better solution. But it also drastically increases the computation time

vDrag0n · 2016-08-13T10:54:46Z

With my data set, Im getting
Traceback (most recent call last):
File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 378, in
spawn_workers(workers, status_bar=args.status_bar)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\worker.py", line 308, in spawn_workers
points = utils.calculate_minimal_pointset(spawn_points)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 210, in calculate_minimal_pointset
foo = map_points_to_circles(interval_of_circles, interval_of_points)
File "C:\Users\vDrag0n\Desktop\pokeminer-master\utils.py", line 106, in map_points_to_circles
points = IntervalTree(map(lambda point, circles=circles: map_point(point,circles),points))
File "C:\Python27\lib\site-packages\intervaltree\intervaltree.py", line 254, in init
self.top_node = Node.from_intervals(self.all_intervals)
File "C:\Python27\lib\site-packages\intervaltree\node.py", line 64, in from_intervals
node = node.init_from_sorted(sorted(intervals))
File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 185, in lt
return self.cmp(other) < 0
File "C:\Python27\lib\site-packages\intervaltree\interval.py", line 170, in cmp
return -1 if self.data < other.data else 1
RuntimeError: maximum recursion depth exceeded in cmp

rollator · 2016-08-13T11:04:29Z

@vDrag0n I just added a small commit which could fix it for you, but I'm not really sure. Could you try it with the new version and tell if it works now and what your number of spawnpoints was? (Will be printed in the console before the most ressource intensive part, so you'll have plenty of time to take note :)

Aiyubi · 2016-08-13T13:03:57Z

you use variable names like "foo", "meh" or "anything" which should really be cleaned up
in util line 220 you start a while loop with a single break. Easy fix: "while condition".
you sometimes import something like math inside a function without using it
you have a funcion with something like 100 lines of code. Maybe break that up

Aiyubi · 2016-08-13T13:05:25Z

db.py

+            lat <= {lat_end} AND
+            lon >= {lon_start} AND
+            lon <= {lon_end}
+        GROUP BY spawn_id;


we dont know if spawn_id is unique. Better GROUP BY lat,lon ?

actually spawn_id is unique, but I can change it anyway as it shouldn't really make a difference :)

267k records in one db, 562k in the other. 505 duplicate spawn ids.

Source: https://www.reddit.com/r/pokemongodev/comments/4vazmk/encounter_id_is_not_unique/
Title says encounter_id but really is spawn_id if you read it

hm, I'm not really conviced he's talking about the id of spawn points. He only mentions "spawn id" once and from the context it seems like he's talking about the encounter_id field.
In our db, spawn_id is spawn point id, not encounter_id, right?

right, we dont save the encounter id at all

We do save encounter_id on the sightings table.

rollator · 2016-08-13T13:26:16Z

you use variable names like "foo", "meh" or "anything" which should really be cleaned up
in util line 220 you start a while loop with a single break. Easy fix: "while condition".
you sometimes import something like math inside a function without using it
you have a funcion with something like 100 lines of code. Maybe break that up

Thanks for the styling tips but as as I already stated in very first post, I'm very much aware that point code quality is seriously lacking at this point. It's a prototype and more of a POC rather than code I'd actually want to be merged. As the title suggests, this PR is still very much Work In Progress and nowhere near production quality.

vDrag0n · 2016-08-13T17:01:36Z

Using your spawn point retrieval sql, i'm pulling 26k records over a 1.2million row database.

I've tried changing the recursive limits of python, and just setting the SAMPLES_PER_POINT to 1

edit:
Tried the small fix. Takes about 20mins to calculate the pointset, but looks like its works now.

rollator · 2016-08-13T18:02:45Z

that did the trick? great, thank you for testing :D
would you mind sharing some data about scanning efficiency gain, for example how many points you had to scan before versus now? Would be really interesting to see how much(if at all) one can expect to gain on average, as for now I only have my numbers(~500 instead of 1700)

The long runtime isn't really unexpected, as the calculations are really computational intensive and calculating distances on earth takes quite some time compared to a flat plane :/

vDrag0n · 2016-08-13T18:33:18Z

It all depends on how dense your spawn points are and where you're scanning. Since Im scanning a large area with parks, schools, and inlets, Im saving about 50%.

For the speed problem, Its only using one core. Im sure it could be quicker if you just multi threaded it.

rollator · 2016-08-13T18:39:47Z

It all depends on how dense your spawn points are and where you're scanning.

Exactly that's why I'm asking, so we can have a rough estimate which doesn't only depend on 1 scan location

As for the speed problem, that's definitly something I'll be looking into

Aiyubi · 2016-08-13T19:56:03Z

tested this:
circles = IntervalTree(map(lambda interval_circle, points=points: map_circle(interval_circle, points), circles))
this line is running for 4h and still not finished :/
btw: 16047 spawnpoints

rollator · 2016-08-13T20:08:23Z

what's your SAMPLES_PER_POINT value?

rollator · 2016-08-13T20:41:07Z

@gunawanputra right now you won't any information about the location of a pokemon outside the 70m radius, so you still needed to scan all the points within the cell when you know that there are pokemons

Aiyubi · 2016-08-13T21:18:16Z

@rollator update: turns out the way python performs the map, it is ~1h including the treecreation. Will need to perform some more debugging where it runs all the other hours

rollator · 2016-08-13T21:46:20Z

@Aiyubi I would guess that the 3 hours are in that huge horrible while True: loop :/
Right now I'm refactoring the code a bit, when I tinkering around to improve things I just had enough working with this horrible code.

Aiyubi · 2016-08-14T06:42:49Z

Sooo it finished with samples = 1
16047 spawnpoints
4482 optimized scanpoints
7884 non optimized scanpoints

rollator · 2016-08-14T14:52:41Z

@Aiyubi could you test it again please? It should run MUCH faster now which could enable you increase the sample size and decrease the number of optimized scanpoints

theresthatguy · 2016-08-14T15:54:18Z

I have over 300k db. Will be trying this soon.

rollator · 2016-08-14T16:03:38Z

@theresthatguy I wasn't talking db entries, I was talking spawnpoints. My db contains ~350k entries, so we're pretty similar in that sence, so I guess it should run no more than a few seconds for you and you can even set the SAMPLES_PER_POINT to a higher value(~40 was fine for me) to achieve better results :)

Aiyubi · 2016-08-14T21:37:04Z

Number of spawnpoints: 16047
Generating samples...
Starting distance calculations
Distance calculations finished
Calculating best scan points
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scanpoints.py", line 345, in calculate_minimal_pointset
    circle_indexes = choose_best_scan_points(inside_matrix)
  File "scanpoints.py", line 276, in choose_best_scan_points
    mult_matrix = scipy.sparse.diags(identity_array[0])
TypeError: diags() takes at least 2 arguments (1 given)

vDrag0n · 2016-08-15T00:17:57Z

I've tested it and it works much quicker after the refactor. Only the checking of spawn points takes a long time.

rollator · 2016-08-15T00:22:09Z

@Aiyubi what python & scipy version are you using? This seems like it's an older scipy version

@vDrag0n by checking of spawn points you mean the "safety check"? That part will be removed asap, however I first wanted confirmation by other testers that the part where it print 'lonely circle found' never gets executed.
Haven't removed it because it shouldn't ever execute, but I wanted to really be sure.
If you want you can change line 359 in scanpoints.py from scan_points = _check_scan_point_set_for_safety(points, scan_points) to #scan_points = _check_scan_point_set_for_safety(points, scan_points) and it won't do this slow check. How long did it take for you to run it now?

vDrag0n · 2016-08-15T01:11:45Z

about 5mins with 10 samples before the safety check.
Its hard to check if it gets printed or not. Theres no user input requirement, or logging if there is one, since the worker will go straight to scanning. But I did purposely add a pause after completing calculations and did not see any lonely circles with my data set

rollator · 2016-08-15T05:42:11Z

you're right. Added commit to log results to minimal_scan_calculations.log

Aiyubi · 2016-08-15T15:51:38Z

@rollator you were right. Did pip for the wrong python version and had an old scipy

you are missing pyproj in the requirements
I realized that this does not take gyms into account. Is it unlikely enough to assume we do not need this? I mean for my ~100km² that would be just 449 extra points (with obviously bigger range)
seems to be working pretty good

rollator · 2016-08-16T05:57:55Z

thanks for the feedback. About the gyms, I completely forgot about them, will add them soon.

Aiyubi · 2016-08-16T13:58:40Z

So I added spawnpoints to my map ( #241 ). What I noticed is that there are still some circles that could be optimized away.

See this:

The middle one has no unique points and could be optimized away.

I wonder if espresso algorithm could be used to minimize this. (Since Quine mccluskey has exponential runtime)

crhbetz · 2016-08-17T05:35:46Z

Do we need to consider gyms though? It seems to me that workers find gyms in quite a huge radius as it is, so for a gym to be missed there has to be a huge void of spawn points surrounding it. Is that ever the case? Aren't gyms "defined" to be places where "a lot is going on"?

rollator · 2016-08-18T06:15:28Z

It wouldn't be much of a hassle to add them, it's just a manner of adding some columns to the matrix.

@Aiyubi I can't really tell as I don't know this algorithm and don't find adequate ressources to change that. Perhaps if you could help me out there I certainly could try it but I guess that would have to wait ~2 weeks as I just don't have the time right know :/

nhumber · 2016-08-19T19:19:42Z

i decided to give this a go today, and its a no go, i tried with a database from both 0.5.4 and one from 0.5.2 and i get different errors with both, any ideas on something i can try?

Number of spawnpoints: 0
Generating samples...
Starting distance calculations
Traceback (most recent call last):
File "worker.py", line 402, in
spawn_workers(workers, status_bar=args.status_bar)
File "worker.py", line 332, in spawn_workers
points = scanpoints.calculate_minimal_pointset(spawn_points)
File "/root/maps/pokedev/scanpoints.py", line 324, in calculate_minimal_pointset
tree_of_points = KDTree(euclid_points)
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/kdtree.py", line 235, in init
self.n, self.m = np.shape(self.data)
ValueError: need more than 1 value to unpack

rollator · 2016-08-20T09:24:34Z

@nhumber try it with python 3

nhumber · 2016-08-20T13:31:08Z

im stuck on 2.7, ive tried to get my vps up to 3.5 but i seem to mess it up everytime and have to revert.
i saw a mention to use pyenv, but i couldnt understand the guide i found, do you know any simple way to get python 3.5 on ubuntu 14.04 without messing everything up ? thanks and sorry for the newbieness.

fourbytes · 2016-08-20T14:49:51Z

With 5763 spawn points, safety check seemed to take quite a long time, but "A sad lonely circle has been found :(" was not printed, so if nobody else is getting it, I think it'd be safe to remove. Awesomely faster though now!

rollator · 2016-08-21T17:17:02Z

@nhumber virtualenvwrapper is pretty easy to setup (https://virtualenvwrapper.readthedocs.io/en/latest/)
@fourbytes Yes, that will be removed in the next version

nhumber · 2016-08-22T19:50:26Z

thank you for the suggestion, i searched around and found this guide and followed it:

http://blog.dwaynecrooks.com/post/136668432442/installing-python-351-from-source-on-ubuntu

I got up and running and away from python 2.7 prior to the big co routines update!!, if a newb like me can do it anyone can ;-)

edit ** looks like i spoke too soon, i have python 3.5 working in a env but now im getting this error

(env) user@MAIN:~/maptest/pm$ python worker.py
Number of spawnpoints: 0
Generating samples...
Starting distance calculations
Traceback (most recent call last):
File "worker.py", line 402, in
spawn_workers(workers, status_bar=args.status_bar)
File "worker.py", line 332, in spawn_workers
points = scanpoints.calculate_minimal_pointset(spawn_points)
File "/home/user/maptest/pm/scanpoints.py", line 324, in calculate_minimal_pointset
tree_of_points = KDTree(euclid_points)
File "/home/user/maptest/env/lib/python3.5/site-packages/scipy/spatial/kdtree.py", line 235, in init
self.n, self.m = np.shape(self.data)
ValueError: not enough values to unpack (expected 2, got 1)

i see above mention of scipy version, but it seems im already up to date.

any ideas i could have missed?

also

I run the project in combination with a twitter bot to notify on rare/ultra rare spawns, first for the notifcations and secondly for data collection with the hopes that at some point we can get to the point of predicting the spawns rather than scanning for them.

this is slightly OT but has anyone tried narrowing down spawn points to remove points that never spawn anything rare? i realize this is not the scope of the project but, Do we know if certain spawn points NEVER spawn anything rare/good ? if so can those points just be eliminated?

i was thinking about sorting the db by spawn point location and then again by pokemon ID and removing any points that never spawn anything but common pokemon.

Good idea or bad idea?

rollator added 7 commits August 11, 2016 19:22

first working prototype

7d99cbe

improve runtime and accuracy of scanpoint calculation

1dce208

update example config

0dfafe5

adjust return value of minimal scan circles to worker.py\'s needs

fd503b1

add primitive working example

2988396

remove some legacy code

31c07ed

remove whitespace

b4e357a

Aiyubi reviewed Aug 12, 2016
View reviewed changes

rollator added 4 commits August 12, 2016 11:37

fix lat/lon used for query boundaries

9d090db

add storage_for calculated_clusters

a072f84

better algorithm to

647d0ad

distribute points among worker

small fix

b16cecd

rollator added 2 commits August 13, 2016 11:28

fix for expired_timestamp overflow

2f219b7

smal fix

db3f5dc

points don't really need to be a list

ccc8280

Aiyubi reviewed Aug 13, 2016
View reviewed changes

rollator added 2 commits August 14, 2016 01:32

clean up code and small fixes

03d37b1

regression fix: load scanpoints from file works again

31c9261

rollator added 2 commits August 14, 2016 16:35

massively decrease calculation time

25a885b

add message concerning safety check

8ac0245

write log for scan point calculations

a596278

rollator added 2 commits August 16, 2016 11:50

merge commit

00452df

merge fix

b7454ca

WIP: Improve scantime #212

Are you sure you want to change the base?

WIP: Improve scantime #212

Uh oh!

Conversation

rollator commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update 1.2

Update 1.1

Update 1

What do you need to run the PR at this stage?

Will it reduce server request?

Still missing:

TL;DR: set SAMPLES_PER_POINT = 5 in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SapitoSucio commented Aug 12, 2016

Uh oh!

rollator commented Aug 12, 2016

Uh oh!

crhbetz commented Aug 12, 2016

Uh oh!

rollator commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vDrag0n commented Aug 13, 2016

Uh oh!

rollator commented Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aiyubi commented Aug 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aiyubi Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rollator Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rollator commented Aug 13, 2016

Uh oh!

vDrag0n commented Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rollator commented Aug 13, 2016

Uh oh!

vDrag0n commented Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rollator commented Aug 13, 2016

Uh oh!

Aiyubi commented Aug 13, 2016

Uh oh!

rollator commented Aug 13, 2016

Uh oh!

rollator commented Aug 13, 2016

rollator commented Aug 11, 2016 •

edited

Loading

TL;DR: set `SAMPLES_PER_POINT = 5` in config.py, have area prescanned. By using these changes you will have to do 50% fewer scans.

rollator commented Aug 12, 2016 •

edited

Loading

rollator commented Aug 13, 2016 •

edited

Loading

Aiyubi Aug 13, 2016 •

edited

Loading

rollator Aug 13, 2016 •

edited

Loading

vDrag0n commented Aug 13, 2016 •

edited

Loading

vDrag0n commented Aug 13, 2016 •

edited

Loading

rollator commented Aug 15, 2016 •

edited

Loading

Aiyubi commented Aug 15, 2016 •

edited

Loading

Aiyubi commented Aug 16, 2016 •

edited

Loading

fourbytes commented Aug 20, 2016 •

edited

Loading

nhumber commented Aug 22, 2016 •

edited

Loading