Home - General / All posts - This program will never make it - but why I use it anyway! (long post, sorry)
 adamw9,552 post(s) #01-Apr-21 17:12 Thanks! Very interesting!For 1, extracting vector points from raster, are we talking about Trace? Were points single-pixel or small clusters of pixels? Trace extracts areas, so if you needed points, there was a lot of unneeded computations that could have been eliminated. We could probably have an option to convert small areas into points, this would have helped the performance, likely significantly.For 2, ST_SubDivide, we hear you. We are also planning to make it less necessary to run things like ST_SubDivide on large geoms - although if you are willing to deal with merging the results back, it'd still make sense to run it.In general, our focus is on making 9 a much better fit for your "Smaller data" scenarios. That's what we are concentrating on currently.Thanks again, a great write-up.
 artlembo3,156 post(s) #01-Apr-21 20:52 extracting vector points from raster: as I mentioned, the raster is binary (1/0). So, any pixel with a 1 we'll turn into a point, and ignore the 0s. Therefore, I wasn't interested in areas, but rather individual pixel centroids. I was attempting to use TileGeomToValues. But, the 330 billion pixels was just too much to deal with. I even wrote a script to cycle through the st_subdivide geometries so break it off into smaller chunks, but to no avail.
 tjhb9,627 post(s) #01-Apr-21 22:14 Art, I'm curious, did you try a workflow that did not convert the raster(s) to centroids?I don't know the details, but perhaps you could just decompose the monkey polygons into triangles, or into convex parts (a familiar step to both of us), then Join those smaller areas to the original rasters. (Then sum.)Raster storage is massively efficient, especially in Manifold 9, and the Join dialog is flexible enough to allow raster-vector analysis without translation. Power.Tim
 artlembo3,156 post(s) #01-Apr-21 22:46 Good advice. And yes, I did attempt to do that. The storage, as you say, is very efficient. I think the problem is when you need to unpack the data to find out which pixel actually falls in an area. It’s the classic compression/decompression problem.I think the join command is awesome. But, having 330 billion pixels really does clog the pipes up a bit :-)Nonetheless, I hope you are as impressed as I am that the actual overlay got done in 17 minutes. I tried to do the overlay in postGIS with the subdivided polygons and it was taking hours and hours so I just killed the process. 17 minutes, wow.
 tjhb9,627 post(s) #01-Apr-21 23:01 Yes I am just as impressed as you are. For two reasons. (1) You made something impossible into something possible. That's helpful! (2) At 17 minutes, you can afford to experiment or to tweak. At many hours to days, you have to get the thing right once, or you're either stuffed (if it didn't work) or quite grumpy (if it worked but suboptimally).In both cases, a difference not just in degree, but in kind.
 adamw9,552 post(s) #02-Apr-21 14:38 OK, so just to recap:We have a raster of 330 billion pixels (that's short scale, right? 330 * 10^9?). In your example they are either 1 or 0, but I presume we can expect something like INT8 or INT16 and maybe conditions like BETWEEN 200 AND 300.We need to turn each pixel that satisfies the condition (in your example, equal to 1) into a geom.There are 61 million of such pixels in total (that's from the first post).Using TileToValues (TileGeomToValues was likely a typo, right?) is too slow.Correct?61 million geoms is manageable, so if the numbers above are correct, a reasonably fast way to do this seems totally within reach.
 artlembo3,156 post(s) #02-Apr-21 15:06 yes, you have that mostly correct. More specifically:the raster: 1,295,779 x 256,004and yes, mine is such a limited use/case that I think it is reasonable to expect that BETWEEN clause for other uses.TiletoValues: I actually used TileGeomToValues, and yes, that was too slow. That gave me the X,Y, and Value. I did not actually try the TileToValues. I think TileToValues would not work well on 330b pixels - it seems to first want to eat the elephant in one bite. I thought TileGeomToValues would be better because we are narrowing the area that we are searching.Also, in this case, the data is very sparse - only .0004 of the pixels have a value we are interested in. So, I'm guessing there are tons of tiles that have nothing in them. I'm not sure if that helps any.
 tjhb9,627 post(s) #03-Apr-21 19:21 Small anomaly? TileToValuesX3 in first (simplest) query.Also, I'm interested in the combined prefilterWHERE TileValueMin(tile) <= 202 AND TileValueMax(tile) >= 202which would not have occurred to me.Can you explain why, even though it is a composite test (scanning each tile twice?) it is so efficient?And then... how about swapping the functions/conditions toWHERE TileValueMin(tile) >= 202 AND TileValueMax(tile) <= 202and omitting the outer WHERE clause, redundant now because 202 will uniquely satisfy the prefilter?...Well, I think I can partly answer myself.TileValueMin() <= x and TileValueMax >= y will both short-circuit, i.e. return true as soon as they find any value satisfying their conditions. They will only scan the full tile if the condition either fails or succeeds only on the very last value.But TileValueMin() >= x and TileValueMax <= y will have to scan the full tile (twice) in all cases.So this swap is a very bad idea!But I'm still interested in why your dual pre-scan is efficient. I wouldn't have thought of this.
 adamw9,552 post(s) #06-Apr-21 13:56 TileToValuesX3 should be TileToValues. (One of the tests was operating on a 3-channel image, taking one channel out using TileChannel, I mixed up things by mistake.)TileValueMin ... AND TileValueMax ... does scan the pixels of each tile twice (once for function), but this happens in memory, which is fast. What would have ruined it is if the data set, which resides in a mix of memory and disk, was being scanned twice, but it only has to be scanned once.I couldn't use WHERE TileValueMin(tile) >= 202 ... because in my test, the tiles contained a mix of values on both sides of 202, so if a tile contained pixels with values of 150, 202, 250, the test would have thrown it away - while I needed it to stay because 202 is there. Yes, anything in AND can short-circuit. Keep in mind that the query engine can re-arrange or re-write conditions, though, if it thinks it can improve things (we mostly improve the blatantly obvious).
 adamw9,552 post(s) #06-Apr-21 14:15 A small addition: scanning the pixels of each tile in memory twice improves the performance of the query because doing this is faster than converting each pixel into a record, then iterating on these records outside of the functions in the query engine. Converting anything to records - even virtual ones - and iterating on them is not free.
 artlembo3,156 post(s) #03-Apr-21 22:57 I’ll give this a try tonight, and report back what I find.
 artlembo3,156 post(s) #04-Apr-21 01:08 Adam,I’m willing to bet that it will go a little bit faster than you think. The reason is, you are computing information about 330 billion pixels. But,a majority of the tiles are all zeros. So hopefully, 95% of the image will just be ignored and you’ll only be processing the small portion of the globe that has the mangroves in them. I don’t know if some of those statistics are already calculated about the tiles or not. If so then yes, you are probably going to have to search through 330 billion pixels. But if they are predetermined with summary statistics then they can simply be ignored and none of that data needs to be processed.
 tjhb9,627 post(s) #04-Apr-21 02:04 Yes. So the tile prefilter you will apply is justWHERE TileValueMax([tile])right?
 artlembo3,156 post(s) #04-Apr-21 04:02 Exactly. I will be heading into lab until the afternoon, so I’ll let you know how it turns out
 artlembo3,156 post(s) #05-Apr-21 14:08 so this is interesting - it took 87 minutes. Again, that is likely due to the fact that most of the tiles are empty. So, 87 minutes is certainly acceptable to me. But, here is another issue: it extracted 121 million points. So, it extracted double the points. I'm not going to see if 9 brought in too many points, or if the person who sent me the points from ArcGIS gave me too few.
 danb1,745 post(s) #05-Apr-21 20:35 I am not sure if this is relevant to your case, but one time when I was putting together a multistep workflow involving large binary mask grids, I found it beneficial to do a pre-scan using (I think) TileCompare to identify all tiles that matched my JSON definition of a blank tile (there are probably better ways to do this).I added an additional column to the image table and set an attribute to mark tiles that I could safely ignore from all further steps of the process and this made a big difference to the performance of the workflow as a whole. Landsystems Ltd ... Know your land | www.landsystems.co.nz
 tjhb9,627 post(s) #05-Apr-21 20:51 Nice!Besides TileCompare, you could equally use TileMax or TileMin I think? That might be faster.
 danb1,745 post(s) #06-Apr-21 00:11 Thanks for the pointer. This does make more sense thinking about it. Landsystems Ltd ... Know your land | www.landsystems.co.nz
 adamw9,552 post(s) #06-Apr-21 14:06 Well, I am not sure how the query with TileToValues can extract duplicate points. It's pretty straightforward. I would guess the extraction performed in ArcGIS was not 1 pixel = 1 point, perhaps it was merging points for small clusters of pixels. That's just a guess though.Anyway, I can report that we did manage to eliminate the gap between the ideal processing time (based on IO rate) and the actual processing time entirely.Here's a modified version of the query that you will be able to write in the upcoming build:--SQL9SELECT GeomMakePoint(VectorMakeX2(tx*128+x, ty*128+y)) AS [geom]INTO [geom202] FROM ( SELECT x AS tx, y AS ty,   SPLIT CALL TileToValues(TileMaskRange(tile, 202, 202, TRUE, TRUE), FALSE) FROM [large]);There are two changes: (1) the new function called TileMaskRange takes a tile and a range of values (here, min=202, max=202, so just 202), checks for each pixel whether its value falls into the range, then turns that pixel invisible depending on the result (here, pixels with values falling into the range are kept visible and all others are turned invisible), (2) the new argument to TileToValues skips reporting invisible pixels.With these changes, the time to process the large image (which used to be 3448 seconds) drops to 1091 seconds. Which is pretty much the lower bound expected for processing 50 GB of data (1000 seconds).No threads needed as this is already bound by IO as it stands.
 artlembo3,156 post(s) #06-Apr-21 15:47 thanks. This is great, I look forward to trying it out. I ran the overlay with the 121M points, and the overlay worked in 24 minutes (up from 17 minutes with 61M points). So, this is plenty fast. I decided to rerun the query with the original monkey polygons (as opposed to the ST_SubDivide version). It has now been over an hour, and it is still processing. So, the ST_SubDivide was a big win in this case.
 tjhb9,627 post(s) #06-Apr-21 18:09 GeomToConvex(Par) would probably do just as well as ST_Subdivide? And might even be faster with appropriate use of threads?
 artlembo3,156 post(s) #06-Apr-21 18:32 maybe. I’ll test it out. It’s just nice to be able to define the number of vertices.
 artlembo3,156 post(s) #06-Apr-21 19:20 still running after 4 hours, so I killed it. Therefore, ST_SubDivide is definitely necessary. I'm going to try Tim's suggestion next about Convex Parts. We'll see how that works. It creates many small polygons, but it also has some really large ones, as well. So, if I were a betting man, I'd say they should be close. Hopefully, I'll have an answer in about a 1/2 hour.
 artlembo3,156 post(s) #06-Apr-21 20:18 doing the overlay with the ConvexParts took 34 minutes vs. 24 minutes with the ST_SubDivide. Given that the problems was virtually unsolvable without splitting the polygons, ConvexParts is perfectly acceptable - that is, what's 10 minutes? And, it doesn't require you to move data in and out of PostGres to perform the ST_SubDivide. Nonetheless, I still think for big data analysis, ST_SubDivide and the ability to create a hexagonal grid are important tools to have.
 tjhb9,627 post(s) #06-Apr-21 20:28 Excellent! Well said.If we had the equivalent of ST_Subdivide I would certainly use it as well, especially on contour-derived and landcover data.I'm glad that convex parts is comparatively not too bad though! Good test Art. This whole thread increases community knowledge.
 danb1,745 post(s) #07-Apr-21 00:37 I'd always been laboring under the impression that something mathematically magical happened regarding point in polygon operations as soon as convex polygons came into play. Is this not the case? Is simply splitting complex areas into many simpler parts enough to gain additional performance? Landsystems Ltd ... Know your land | www.landsystems.co.nz
 tjhb9,627 post(s) #07-Apr-21 01:11 I think ST_Subdivide (also) guarantees convexity. All the resulting parts are convex I think? [No.]So the magic applies here too.[I am wrong. Concavity is allowed in the result of ST_Subdivide.]
 tjhb9,627 post(s) #07-Apr-21 02:15 It would be interesting to kmow the maximum number of vertices Art specified for ST_Subdivide (and how sensitive this was).
 tjhb9,627 post(s) #06-Apr-21 21:05 Both parts of(1) the new function called TileMaskRange takes a tile and a range of values..., checks for each pixel whether its value falls into the range, then turns that pixel invisible depending on the result..., (2) the new argument to TileToValues skips reporting invisible pixels.are going to be fantastically useful. Every day, brilliant.As a general question, how much slower would it be to use explicit string arguments (which can be visibly meaningful) rather than Boolean flags (which require reference to the manual)?I would often prefer the former if they were not significantly slower.(At school--referencing the other great recent thread--we used not punch cards but the kind you could block out with a graphite pencil. Face to face with Turing, a great way to learn as Ron said. Just not sure we still always need Booleans as args.)
 adamw9,552 post(s) #07-Apr-21 10:03 Agree booleans aren't great, they are hard to distinguish from each other and it is hard to remember what TRUE means for each.String parameters would be slow, however. We consider packing parameters into a JSON for big functions like Kriging, but doing this for small functions is too expensive.There are two other options:(1) Replace boolean flags with named integer constants. If there are multiple flags, combine them together using BITOR. Eg: GeomClip(geom, clip, GeomClipInner, 0). Pros: work right now, adding a new flag is better than adding a new parameter as the function prototype does not change. Cons: BITOR might feel too technical for SQL, there are tons of named constants, can use constants for function A to call function B.(2) Extend SQL to allow calls with named parameters. Eg: TileMake(cx = 100, cy = 100, value = 5). Pros: all parameters can be annotated well, no restrictions on function design, works for user-defined functions. Cons: not traditional SQL, changing the name of a parameter breaks callers, parameter names become unlocalizable.Not sure about any of them. (I mean, 2 is tempting, but seems too brittle with parameter names having to stay the same. No idea how this would work out in practice, could be bad.)
 tjhb9,627 post(s) #07-Apr-21 13:25 Thanks for explaining that!Not really worth improving on the current situation then.It's not broken after all, just requires thought and memory and, well, checking. None of that is bad.Thanks again.
 Dimitri6,511 post(s) #07-Apr-21 14:04 Agree booleans aren't great,Respectfully disagree. The best thing about booleans is that even if you are wrong, you're only off by a bit. :-)
 danb1,745 post(s) #02-Apr-21 05:30 In general, our focus is on making 9 a much better fit for your "Smaller data" scenarios. That's what we are concentrating on currentlyI am really glad to hear this is the case Adam. I love the power of M9 for carving up large datasets, but I would estimate that I probably have an 80:20 split between smaller tasks and those involving large datasets with millions of geometries. While I appreciate that you are likely having to design for for potentially massive datasets, it would be nice if we could gain some of the fluidity of M8 workflows for those more mundane GIS tasks or those where there is just a bit more data than M8, Arc or Q would normally be comfortable with. For me it is mostly about getting familiar with my data as a workflow develops, rapid filtering selection, looking for things that don't look quite right and of course ViewBots which I seem to mention quite a bit Landsystems Ltd ... Know your land | www.landsystems.co.nz
 danb1,745 post(s) #02-Apr-21 05:21 You can see my logic for subdividing the data here. After completing the join, I knew how many points were in each subdivided polygon, and simply did a sum and grouped by the monkey polygon id.I use this strategy quite regularly decomposing complex polygons to convex parts. In fact the attached screenshot shows an example from just yesterday where I am trying to build the equivalent of ArcMap's Eliminate function. I need to see what class the polygons are adjacent to the small polygons to be eliminated (purple). If decompose the larger polygons to convex parts first the adjacency test is much much faster.Attachments: Clipboard-1.png Landsystems Ltd ... Know your land | www.landsystems.co.nz