Subscribe to this thread
Home - General / All posts - Manifold 9 shredding through big data
artlembo


3,400 post(s)
#16-Oct-18 18:09

for those interested, I've been working more with the NYC taxi data to perform some big data analytics. The following blog post shows Manifold 9's performance when withing with 6GB of data for a spatial overlay. It is certainly holding its own against things like ArcPro, ArcGIS GeoAnalytics Server, and Postgres. I've sent the data to Manifold engineering, and they are looking at further optimizations. Enjoy!

tjhb
10,094 post(s)
#17-Oct-18 00:50

This is great Art. What a spread of timings!

I would also be interested to see timings (on same data and system) for a "naive" Manifold 9 query like

--SQL9

SELECT

    s.[mfd_id], s.[Geom], s.[zone],

    COUNT(o.[fare_amount]AS [numrides]

    AVG(o.[fare_amount]AS [avgfare],

    SUM(o.[fare_amount]AS [sumfare]

INTO [sumtable]

FROM

    (

    SELECT

        s.[mfd_id], s.[Geom], s.[zone],

        o.[fare_amount]

    FROM

        [taxi_zones] AS s

        LEFT JOIN

        [pickup Drawing] AS o

        ON GeomContains(s.[Geom], o.[Geom], 0)

    THREADS 5 -- possibly faster than SystemCpuCount()

    )

GROUP BY s.[mfd_id], s.[Geom], s.[zone]

THREADS 1 -- default for GROUP

;

...and for a postgreSQL query like your second one, but excluding the second join (unnecessary I think).

artlembo


3,400 post(s)
#17-Oct-18 01:27

I’ll give it a shot tomorrow.

I think the second join in Postgres is necessary due to the GROUP BY. Postgres does not have a FIRST clause.

In the past I’ve tried it without the GeomOverlayContained, but it is way slower.

BTW, why do you think the explicit THREADS statement is faster. That would be cool if true.

tjhb
10,094 post(s)
#17-Oct-18 02:18

Thanks Art, great.

For postgreSQL, it looks to me as if the second join would be redundant if you did

GROUP BY taxizones.zone, taxizones.Geom

but I won't try to write the whole postgreSQL syntax, I'll get it wrong.

On THREADS, it's not that an explicit statement is faster, but rather that Manifold can saturate physical cores so well that trying to allocate virtual cores beyond that is soon outweighed by thread latency. Intel's hyperthreading can only barely improve on Manifold's pipeline.

For me the sweet spot with 4 physical and 8 virtual cores, assuming they can all be saturated*, is usually 5 threads, though sometimes 4 or 6. Partitioning between all 8 virtual cores is always slower. Intel can't keep up. Which is fair enough: they designed hyperthreading for software that is not already thread-optimised.

(*That is critical. Where no more than x threads can get simultaneous work from the queue, it's always slower to add more: they just wait. Sometimes one or two threads are optimal for a node, especially if it is transport-bound. This is a reason to separate work into subqueries or functions having different THREADS specifiers--or to avoid separating.)

tjhb
10,094 post(s)
#17-Oct-18 02:46

The THREADS stuff above is all paraphrasing Adam, unless I have it wrong.

I am most interested in the timing comparison between GeomOverlayContained[Par] and simple GeomContains. Here we have a very simple case, points over areas. GeomContains is logically enough. Is it faster or slower?

I would suspect faster.

artlembo


3,400 post(s)
#17-Oct-18 02:53

Couple of things:

1. Including the geom as part of the group by clause was super slow. Adding the second join criteria didn’t add any overhead. Maybe it is an overlooked inefficiency in PostGIS

2. I prefer the simpler SQL with contains, so I hear you. but it was very slow with that. There is something in those special functions Manifold provides.

3. I can’t wait to try the query with an explicit number of threads. I’ll let you know how it turns out.

artlembo


3,400 post(s)
#17-Oct-18 13:51

I just installed PG 10, which theoretically can perform some parallel processing - we'll see how it goes.

But, I have this huge database in PG 9 so I was scratching my head as to how to move the data. The conventional wisdom is to create a dump of the data from PG9, then reimport the dump back into PG10. Ugh, I really didn't want to be messing around with reading/writing so many gigabytes of data.

I thought about just abandoning the idea because I didn't want to be bothered. Then I thought to myself:

wait, fire up Manifold 9, connect to PG9 and PG10, and then just drag and drop the table

That in-and-of itself is a pretty good use-case to justify the $425 for Manifold 9 alone! Ha, Ha!

artlembo


3,400 post(s)
#17-Oct-18 17:22

I gave it a shot with 5 THREADS: no measurable difference.

I think a big issue with Postgres was the CLUSTER, and having similar objects near one another on the disk.

artlembo


3,400 post(s)
#17-Oct-18 18:53

another improvement that might be pertinent to 9: having the ability to subdivide an area based on a maximum vertice count might make for a nice transform.

Dimitri


7,413 post(s)
#19-Oct-18 13:21

Great blog post! I hope you don't mind if I took the liberty of reposting it to the gis subreddit. There are two things that come to mind reading your blog post:

1. What would it be like with 9 using a larger number of real cores, like eight, 12, 16 or 32 cores? How would that scale? How about a more DBMS architecture for the hardware, where instead of one storage device there are two or four? 12 GB of RAM is way too little. What if you had 32 or 48 GB of RAM?

2. Regarding the ESRI GeoAnalytics Server scenario - it would be great to peek into that black box: how many machines were running, and what were those machines? What was the setup time? By the latter, I am curious what this costs to distribute to multiple machines in terms of time.

I ask the latter point not because I'm against distributing processing (on the contrary, all for it...) but because part of the problem with distributed processing sometimes is the time you spend copying data to all the distributed machines. I'm curious how ESRI does it in this case.

Could you describe how that GeoAnalytics was set up? Do you need an ESRI license on all of the machines? How many machines were used? What were they, in terms of processors? Did you have to dispatch all or part of the 5 GB data set to each machine? If so, how long did that take?

I'm curious to see more of what is in the GeoAnalytics Server black box, and to learn from that. :-)

artlembo


3,400 post(s)
#19-Oct-18 16:54

thanks for sharing the post.

1. I did try it on a friend's computer with 20 cores. But, I did it with GeomContainingPar, rather than GeomContainedPar. So, I just asked him to rerun things. I'm very interested in what he discovers (if you guys bought me a 20 core PC, I'd be happy to do it myself!).

2. As I indicated in my blog post, a friend actually ran that. He said that GeoAnalytics Server is in fact, a black box. But, in his case, he did it on his desktop (quad core). Don't remember the RAM. I'll have to check. All the data and software was on his desktop. I asked him because I was actually quite impressed as to how fast it ran. As far as set up, he just loaded it into BDSF. But, I showed him this post, and asked if he would chime in.

I have one other thing I'm excited to try, and that is to leverage the distributed processing with Postgres and Manifold. In this post, I talk about the EXECUTE statement in 9. So, I want to install multiple each month on say, 12 computers in my lab. So, maybe January on one computer, February on another, etc. Then, I can have each computer tuned to give me that 1m 3s results. Then, I can use the EXECUTE statement in 9 to write a single query on a 13th computer to call Postgres on the other 12 computers. Each of those computers should finish their query in 1m 3s, and then using UNION ALL, I'll gather the returned table of 263 polygons and then do another summation and GROUP BY. So, I should be able to process an entire year's worth of the taxi data in about 2 minutes!

This is why I'd love to see Manifold act as its own server - that that other Manifold installs can call Manifold on other computers. Basically, this is a cheap hack for a MAP/REDUCE functionality that you see in Hadoop, but I think it will be much easier.

Dimitri


7,413 post(s)
#19-Oct-18 19:46

in his case, he did it on his desktop (quad core). Don't remember the RAM. I'll have to check. All the data and software was on his desktop.

It sounds like your friend might not have realized that the desktop machine he was using was a client to a network of more than one machine As ESRI puts it:

ArcGIS GeoAnalytics Server enables distributed analysis across multiple ArcGIS Server machines. By harnessing the compute power of multiple machines, your analysis can be performed more quickly and with larger quantities of data than could previously be computed on a single machine.

So... he might not have realized that when he ran the task it was running on multiple machines and not just on his local machine.

Setting up GeoAnalytics Server is non-trivial. Again, from ESRI:

To use ArcGIS GeoAnalytics Server, you must have Portal for ArcGIS in your deployment. For details on how to set up your deployment to enable GeoAnalytics Server, see Set up ArcGIS GeoAnalytics Server.

If you follow that link above, you see it ends up being a fairly complex deployment involving more than one machine.

I am not criticizing, as it is good ESRI is doing this, but there is a lot to be learned from the details of how they approach the matter.

Your idea is also cool to leverage distributed processing with hybrid use of PostgreSQL and 9 on multiple machines. The task for all of us in the coming months is to get our heads around how that can best be done in various distributed, parallel processing scenarios and how relatively small tweaks in 9 could make such things a lot easier.

Manifold will, for sure, act as its own server in a variety of ways. First, there are commitments to rounding out cartography, closing some holes in vector editing, and then builds can shift gears into rolling out more server functionality while steadily continuing to cranking out "smaller" things in rasters and making other connections where filling in a few holes and making some small connections really helps cumulative functionality.

As you've already reckoned, that 9 itself is parallel makes it a lot easier to wire up distributed computing over multiple 9 servers. There really isn't that much difference between having a task running on 9 distributed among multiple cores on multiple CPUs that happen to be in the same box, or having a task running on 9 distributed among multiple cores on multiple CPUs that happen to be in different boxes.

artlembo


3,400 post(s)
#20-Oct-18 13:26

Actually, I thought the same thing because it was so fast. But in actuality, everything was on his computer.

Not only that, when I alerted him to it and sent him the data he just quickly loaded it into BDFS and told me the answer within the hour.

You are correct in that they do reach out to clusters to make things run even faster, and that will make the next stage of my posts even more interesting when I bring in more danger than a single computer can handle.

You are correct in that they do reach out to clusters to make things run even faster, and that will make the next stage of my posts even more interesting when I bring in more data than a single computer can handle. I think Geoanalytics server is a very good option for people with massive amounts of data. Hopefully I can find out more about the process for loading and running things. As I get more information, I will share it on my blog.

Dimitri


7,413 post(s)
#21-Oct-18 10:35

Not only that, when I alerted him to it and sent him the data he just quickly loaded it into BDFS and told me the answer within the hour.

I'm really curious to hear the exact, step-by-step process of the above, and how long each step took.

One way to test, for sure, if it is running local is to a) download the data to a zip drive on some other machine, and then b) disconnect the target machine that is running GeoAnalytics Server from all networks, local WAN and Internet. c) from the zip drive load data into the BDFS (Big Data File Share) that is believed to be running on the local computer. d) run the job. Still the same numbers?

Get it down to that level of detail and it's amazing how often people say, "well, gosh, I was sure my BDFS and processing was all local but I guess it's not...".

It would also be interesting to see performance monitor readout on how many cores are being used.

RonHendrickson
283 post(s)
#22-Oct-18 03:19

Art, I downloaded the geodatabase you linked to in your blog post, but it only had about 2.790MM points to it, not the 16MM you talked about. I was able to run the query in Manifold 9 in about 27 seconds, which is a lot less than you mentioned for Manifold 9. My machine is fast, but not that fast, I don't think.

Could your posted geodatabase be incorrect, as least as far as how many records in it?

artlembo


3,400 post(s)
#22-Oct-18 12:50

Could you check the Geodatabase using ArcGIS? I wonder if Manifold isn’t reading everything.

My friend has the 16M points.

RonHendrickson
283 post(s)
#22-Oct-18 13:45

Sorry, I don't have ArcGIS.

RonHendrickson
283 post(s)
#22-Oct-18 14:01

Using the latest cutting edge, I got the same results. As the .gdb file is loading, the status bar shows only about 1/6 loaded and then it abruptly stops loading and I only get 2.790MM points. I have 16 GB of RAM, and only Manifold 9 is running, and memory is 41% total using the Task Manager.

artlembo


3,400 post(s)
#22-Oct-18 14:12

so, you are loading up until around 1/6 of the time - and that is 2.79M points.

1/6 * 16M = 2.7M

I'm guessing that 9 is not reading in the full amount of data. Maybe you could send the link to tech and have them look. I'm guessing it is possible that 9 is cutting out, but, maybe there is some other issue.

RonHendrickson
283 post(s)
#22-Oct-18 14:44

will do

Dimitri


7,413 post(s)
#22-Oct-18 15:46

I linked to the gdb.

The pickup table has 14522207 records in it. Copying and pasting that table as a local table in the project fails, I suppose because there are either duplicates or NULLs in the field used as a btree index. That's easy to deal with using a simple query to first build a new, local table:

INSERT INTO [pickup]

([OBJECTID], [Shape], [vendor_id], [pickup_dat], [dropoff_da], [passenger_], [trip_dista], [pickup_lon], [pickup_lat], [rate_code], [store_and_], [dropoff_lo], [dropoff_la], [payment_ty], [fare_amoun], [surcharge], [mta_tax], [tip_amount], [tolls_amou], [total_amou], [do_geometr], [tzone])

SELECT

[OBJECTID], [Shape], [vendor_id], [pickup_dat], [dropoff_da], [passenger_], [trip_dista], [pickup_lon], [pickup_lat], [rate_code], [store_and_], [dropoff_lo], [dropoff_la], [payment_ty], [fare_amoun], [surcharge], [mta_tax], [tip_amount], [tolls_amou], [total_amou], [do_geometr], [tzone]

FROM [taxi.gdb]::[pickup]

You can then add an index to it, say, on the mfd_id field. The zones table and drawing are small and easily copy/paste from the linked gdb into the local project.

As you can see from the field names in the query above, the gdb in the zipfile does not match the SQL in the blog post: field names are different. so the query has to be re-written slightly. That might indicate version skew in what currently is in the zip file as compared to what is described in the blog.

Next, the query needs to be edited slightly to work with field names as they are in the gdb that is in the zip file. I haven't had time to do that today, but perhaps tomorrow.

RonHendrickson
283 post(s)
#22-Oct-18 17:08

Thanks I will run your query and then do the field name changes and indexing.

jsperr
143 post(s)
#22-Oct-18 18:46

As you can see from the field names in the query above, the gdb in the zipfile does not match the SQL in the blog post: field names are different.

I noticed that also -- the original data set seemed to have a analytical focus on Tropical Storm Sandy. I have had a hard time working with the queries and importing the data as well.

Bringing it in as a File: gdb Data Source resulted in a very slow Map display. I finally used QGIS to convert it to shape files, imported them to Manifold, and saved it in Manifold format. That gives me an acceptable speed in the map.

I'll try and play with it some more tonight when I get home and give more specific notes about the problems I am seeing.

Can you use the import command on a gdb file? I tried on two different machines and both quit after reading 50,000 records. Thanks -- trying to learn SQL and Python.

Manifold 9.0.168.3 (licensed)

Windows 10 pro

Lenovo D20 workstation with dual XEON six core processors running at 2.67 Ghz

96 GB ECC RAM

Nvidia 460 gpu card plus Nvidia Quadro Pro gpu card

Dimitri


7,413 post(s)
#22-Oct-18 19:32

Can you use the import command on a gdb file?

Sure. The issue with GDB is what happens if when importing the GDB, due to index or other conversions, you hit a disallowed value in a record on which a specific type of index is used. Btree indexes don't allow either duplicates or NULLs in the key field used in the index, for example.

The GDB I downloaded has a btree index, called FDO_OBJECTID, on the OBJECTID field. Suppose while importing the thing around the 50,000th record you hit a value of OBJECTID that's already been used. Btree indexes don't allow duplicates so when you hit that duplicate value the process stops with a "Can't add record" message.

That's easy to get around.

Besides importing the GDB, you can also link it, which leaves the data within a system managed by the Arc driver that's built into Manifold (ESRI's own code). One way to "import" a foreign file is to first link it in and then to copy the data from within the data source hierarchy and paste it into the main part of the project. You can then delete the linked data source, since you don't need it any more.

That's a technique illustrated in the importing a csv topic. It's what I tried to do first with the GDB, but it didn't work because when you go to paste the table, as those records get pasted when you hit whatever record is an illegal duplicate on the index field, you get the "can't add record" message again. In my case, it happened after around 10 million records were reported copied/pasted by the Paste progress bar.

The way to get around that is to link the GDB, and then use the SQL I posted to create a new table in the local part of the drawing that has all the data from the linked GDB but which does not ask the system to build a btree system on the OBJECTID field. Instead, once the new table is created, you just launch edit- schema and add an mfd_id_x index on the mfd_id field. Surprisingly, that takes just half a second to construct. You can then take another few seconds to create a drawing from the table and you're off to the races.

There's no particular reason to have a btree index on the OBJECTID field since it's not used in the queries Art published. The spatial rtree index already used by the GDB on the Shape field (which is what the geom field is called in this GDB) is enough.

By the way, are you sure the entire data set didn't get read? When you open the table and look at the number of records reported in the Contents - Component pane, what does it say? Keep in mind the table window just shows you the first 50,000 records as a sample view. Read the section titled "Big Tables" in the Tables topic if that doesn't sound familiar.

RonHendrickson
283 post(s)
#23-Oct-18 14:18

I am happy to report that I was able to bring in all 14.5MM points by linking the geodatabase file as suggested above and then copying its pickup drawing into the Manifold project. After doing this, then deleting the linked file, and tweaking the names of fields in the query that Art used, the query was able to run. Just running the Dimitri query against the imported (not linked) geodatabase file was not enough, it still stopped at about 2.7 MM points, I guess due to what Dimitri mentioned about missing data messing up the btree index. So linking to the geodatabase file is key.

I was disappointed that Art's blog query ran in slightly over 18 minutes. I am not sure if this is due to my relatively slow hard drive or not. I noticed that the query didn't go into parallel mode until the last couple of minutes. I am guessing that the first 15 minutes or so is just reading in the data, getting ready for the sql operation, I just don't know. I could see the cpu was just loafing along while the hard drive hummed at 90% of capacity for about 15 minutes.

Perhaps others could try this and see if I am missing something. By the way, this is not dissing Manifold 9, I just don't know enough about the system to know if 18 min could be improved. My system may not be set up correctly. It would be fun to bring the data into SQL Server 2017 and have Manifold link to it. I might try it if I can.

Also, I tried running tjb's query suggested earlier in the thread, but I could never get it to run. The fare_amount field kept coming up null in the inner select query.

Thanks to Art for posting, this has helped me to progress in my Radian technology quest.

Dimitri


7,413 post(s)
#23-Oct-18 16:02

Just running the Dimitri query against the imported (not linked) geodatabase file was not enough, it still stopped at about 2.7 MM points, I guess due to what Dimitri mentioned about missing data messing up the btree index. So linking to the geodatabase file is key.

I don't follow the above. It sounds like you ran the query against the linked data. Is that correct?

RonHendrickson
283 post(s)
#23-Oct-18 16:20

Sorry to be unclear. I first tried running your query against the data source geodatabase file, and got only 2.7MM points. Next, I followed your advice and linked the geodatabase file, and then ran your query against the linked file, which got all the points in.

Dimitri


7,413 post(s)
#23-Oct-18 16:36

text

Umm.. still don't get it. These two are the same thing:

running your query against the data source geodatabase file

and

linked the geodatabase file, and then ran your query against the linked file,

Creating a data source from a gdb and linking in the gdb are the same thing. In both cases, the gdb is linked into the project, not imported. So, something was different.

Also, when you ran the query... did you run it against the linked gdb or against tables and drawings in the local project?

RonHendrickson
283 post(s)
#23-Oct-18 17:52

I used bringing in the database via the File-Create-New DataSource file:gdb method and File-Import the gdb file by itself.

When I ran the query I ran it against the tables and drawings in the local project. I had previously deleted the linked geodatabase.

RonHendrickson
283 post(s)
#23-Oct-18 18:27

After reviewing my reply, I forgot to mention that I also did a File-Link gdb Files method, which was the successful linked method mentioned above.

jsperr
143 post(s)
#23-Oct-18 18:26

Check the spelling on the fare column -- as I recall it should be fare_amoun -- without the "t" on the end.

Dimitri


7,413 post(s)
#23-Oct-18 05:20

I finally used QGIS to convert it to shape files

Forgot to mention: the above is not a good idea because shapefiles debase rich data.

DBF format (in which shapefiles store data attributes) does not have as rich a collection of data types as GDB or 9. If your data uses types that exist in GDB but not in DBF, the data gets dumbed down to the lesser types shapefiles use.

It's possible, of course, that the starting data uses only those very simple data types that shapefiles can manage. But, realistically, who goes looking through the schemas of many-field tables to accurately reckon what will be dumbed down and what will not when running the data through a shapefile?

The other issue is that shapefiles are limited in size to only 4 GB, so they're not a good choice if the data might exceed that.

Better to link the file and to then simply do a quick INSERT INTO a new table locally, without any round trips through shapefiles. It will also be quicker.

tjhb
10,094 post(s)
#24-Oct-18 00:54

This thread is getting ridiculous.

The answer is for Art to post a new link, here, to the data he has actually used for his latest tests. And perhaps to re-run them.

The cited tests have lost credibility, due to data problems, which does no one any good.

(And yes, I would like to post my own Manifold 9 timings, to blow postGIS out of the water--or not--but I can't, because the data for comparison is missing or, worse, contradictory.)

RonHendrickson
283 post(s)
#24-Oct-18 03:12

Well yes, Tim, but hopefully fun?

artlembo


3,400 post(s)
#24-Oct-18 03:30

No need for me to post a new link here. I have posted a link on my blog. Reading the taxi data in ArcGIS has the same number of records that were in PostGres and Manifold. If for some reason manifold is not reading all the records in the Geodatabase, that’s not my problem. It also seems that Ron was able to bring in all the data by modifying his method.

I don’t see where this loses credibility.

tjhb
10,094 post(s)
#24-Oct-18 04:41

Then Art, you are not paying attention.

You should fix this.

Dimitri


7,413 post(s)
#24-Oct-18 08:09

Art, I think the issue is that the data in the link seems not to match the description in the blog or the queries. People can adapt to that, sure, but if they don't use the same data you did they cannot have faith they are making an apples to apples comparison.

They also may lack faith they handled interoperability issues correctly. If you tell them the data has 16 million records and when it is imported into their FGIS system they only have 14 million records, they spin their wheels looking for an error they made

The data also seems to have numerous defects, for example, many bogus points that clearly have nothing to do with the New York area. On the one hand it is good that it is real world, messed up data, but on the other hand that may end up skewing performance comparisons because of the very apples-to-oranges way different systems may handle interoperability with messed up data originating in ESRI GDB.

People have a lot of respect for your work and thus they naturally want to learn from it, by duplicating what you've done on their own systems, by extending the comparison to different systems, and so on. To do that, they need the same data used in your tests.

tjhb
10,094 post(s)
#24-Oct-18 08:38

Exactly! I wish I could have put it half as well. That's exactly it.

This is such a worthwhile test that it must be exact. No blather.

mdsumner


4,260 post(s)
#24-Oct-18 10:39

Reproducible. That is all.


https://github.com/mdsumner

artlembo


3,400 post(s)
#24-Oct-18 10:56

Off to Denver to teach some workshops, but will reinvestigate this.

1. I imported the data from CSV to 9 to write my queries

2. Exported pickup locations from 9 into .shp, and imported into .gdb using ArcGIS.

3. Copied data from 9 into Postgres.

There are some bogus points (i.e. no X or Y in the record), but I left those in. NYC Taxi said they have that happen a lot, so I decided to just go with what they had.

I also copied the queries directly from 9 and Postgres right into my blog post, once I had them working.

But, as I said, I will re-download what I posted just to make sure. I should have some downtime in Denver where I can explore this. Perhaps I didn’t save the .map and had some different field names

RonHendrickson
283 post(s)
#25-Oct-18 16:17

I don't want to take away from what Art replies with, but I downloaded the original Oct 12 data in .csv format from the NYC Taxi website, and imported it into SQL Server 2017. The data was pretty messy as mentioned before, but anything that had an impossible value of lats or longs I just converted to zero. I created point geoms within SQL Server from the lats and longs. There were two geoms created, a pickup geom and a dropoff geom. I could have deleted the dropoff geom, but I left it in for future use possibly.

Then in Manifold 9.0.168.3 I created a data source from the SQL Server 2017 database. It brought in the nyctaxi table and the pickup and dropoff drawings. I copied the nyctaxi table in the data source and then pasted it into the Manifold project. I was hoping it would work cleanly as I believed the table to be pretty clean. It did work, and I wound up with the same 14,522,315 records in both the nyctaxi table and the SQL Server table. No need to use the insert query that Dimitri suggested when the indexes could be messed up.

I then deleted the SQL Server data source.

I tried creating the pickup drawing from the table, but Manifold didn't like the spatial index that SQL Server had created, so I deleted that spatial index and created a new one, using the pickup_geom as the field. This worked well, and the new pickup drawing had the same 14,522,315 records as the table.

I then imported the .shp file from the NYC taxi website for the taxi zones.

Then I ran Art's query to sum the fares, the average of fares, the taxi zone and the sum of numrides. This time it ran very fast, a total of 2 minutes and 53 seconds. I am showing the first few rows of the sumtable here, just to make sure I got the right totals. Please let me know if these are wrong:

mfd_idsumfareavgfares_zonenumrides

172148.5570.32022417Newark Airport1026

2494.515.9516129Jamaica Bay31

3880.517.96938776Allerton/Pelham Gardens49

4578750.612.01799531Alphabet City48157

5311.520.76666667Arden Heights15

685429.44827586Arrochar/Fort Wadsworth29

7474794.5714.23970758Astoria33343

I want to thank Art for starting this post, it has been a great learning experience for me. If my totals are wrong, I expect I will have more learning to do.

artlembo


3,400 post(s)
#26-Oct-18 03:52

Excellent job. Yes those are the same numbers on got.

I think you did a good job by re-creating the indexes. Yours is a little bit faster than mine, and that might’ve been the reason.

Did you try to run it in SQL server? I would love to see those numbers as well

RonHendrickson
283 post(s)
#26-Oct-18 04:59

I wish you hadn't asked me about SQL server, for some reason I can't get the SQL server drawings to display. While in SQL Server 2017 (SSMS) I created an identity column in the table, made it the primary key. Then I created a spatial index within SQL Server, using the pickup_geom. It all works great within SSMS, it shows the points in the spatial results window.

But when I bring into Manifold 9 the SQL server database as a data source, the drawings with the geoms in them refuse to display in Manifold 9. No warning messages or anything. The geoms are created with the WGS 84, SRID 4326, the default in SQL Server. I have tried changing coordinate systems, repairing, etc., with no luck.

But when I bring in the same SQL server database within Manifold 8, they display OK. I know I am missing something, I have been reviewing Help and your Udemy tutorial on Radian, but so far nothing has helped. It may be the missing mfd_id field, but trying to add that hasn't worked either.

I will keep working on this tomorrow, I know it is something easy I have overlooked.

artlembo


3,400 post(s)
#26-Oct-18 12:42

Sorry to hear you have had troubles with it. But, I was mostly interested in whether you wrote any SQL in SQL server to see how fast you could do the query

RonHendrickson
283 post(s)
#27-Oct-18 20:48

The following query ran in SQL Server 2017 (not within Manifold 9) in about 25 seconds on average, the quickest being 19 sec. and the worst 30 sec. Quite fast, I must say, it looks like SQL Server 2017 is the winner by far in Art's table of database results:

set statistics time on

go

    SELECT 

        s.[OBJECTID]

 ,SUM(ISNULL((o.[fare_amount]),0)) AS [sumfare]

 ,AVG(ISNULL((o.[fare_amount]),0)) AS [avgfare]

 ,s.zone

 ,COUNT(ISNULL((o.[fare_amount]),0)) AS [numrides]

    FROM

       [dbo].[newtaxizones] AS s

 left join 

 [dbo].[geometric pickup] AS o

      on [geom2].STContains([pickup_geom])=1

    GROUP BY s.[OBJECTID], s.[zone]

    order by s.OBJECTID

set statistics time off

go

I had to change the coordinate system of the taxi zones drawing from Lambert Conformal Conic to ESPG:4326 and then export the drawing within Manifold 9 to a new shape file. I then used the most helpful utility called "Shape2Sql.exe" (https://www.sharpgis.net/page/SQL-Server-2008-Spatial-Tools) to export the .shp file to my SQL Server database. For some reason SQL Server showed the coordinate system to be 0 SRID instead of 4326 SRID, so I had to create a new geometry column called geom2 using the original geom and telling it to create it in 4326 SRID.

I also had to re-import the original taxi pickups .csv file and recreate the spatial pickup_geoms using the geometry spatial class instead of the original geography class used before. I did this because SQL Server will not allow queries with tables using two classes, they both must be of the same class. Since the taxi zones table was created with geometry, it was easier to re-do the pickup drawing to match it.

The above SQL query got almost the exact results as the Manifold sumtable, there were a total of 18 (out of 263) taxi zones that differed in total number of rides, by one or two more or less. The overall variance of rides was a total of 3 rides, out of 14,522,315 rides in all. A very minor error which I attribute to the conversions of the taxi zone drawing to a new SRID number.

Also, the query results were obtained on my computer's localhost sql server. It might even be better if one had access to a standalone souped-up server.

tjhb
10,094 post(s)
#27-Oct-18 23:55

Well that's an additional challenge!

Does either of you (Ron or Art) feel like posting a link to data others can use for comparison? Is that hard for some reason?

Otherwise I will do the same as Ron and download from source.

artlembo


3,400 post(s)
#28-Oct-18 00:28

Not hard, just out of town. My mom died on Tuesday so I am headed out of town again. But, when I get back, I can post it. I’ll make sure it is a cleaned up version.

Would you prefer a .map? Do you want me to create the geometries, or do you want to do that?

tjhb
10,094 post(s)
#28-Oct-18 00:43

Oh Art! Don't worry about anything else, just your Mom. So sorry.

artlembo


3,400 post(s)
#28-Oct-18 00:32

Wow! I bow to you.

Great job.

RonHendrickson
283 post(s)
#28-Oct-18 01:20

Thanks, Art. My sympathies about your Mom's passing.

artlembo


3,400 post(s)
#28-Oct-18 14:48

I take it that you have a full version of SQL Server, not the free, express version. Is that correct? If you run it again, it would be nice to see you open up the performance monitor so that we can see what is happening with the CPUs.

I still can’t get over that you did it in 23 seconds.

RonHendrickson
283 post(s)
#28-Oct-18 15:42

I am using the SQL Server 2017 Developers edition, which is free. I know it doesn't seem possible, but I just now opened up SQL Server Management Studio, opened up the query and ran it fresh, and it ran in 22.7 seconds.

All 12 of my logical processors, from 6 cores, fired up. I have attached a screenshot of the performance monitor.

I would encourage everyone who has an interest in database work to get the developer version. I use it in my data science work since MS has added a machine learning capability to SQL Server using either the R language or Python. No longer are the machine learning algorithms limited to in-memory data, it can now be as big as your database. The developer is a full version, no limits. Lots of fun. I use R mostly, and there are lots of spatial data stuff in R (Python, too) than can be used in SQL Server now. In fact, MS has a tutorial on R in SQL Server that uses a modified and smaller version of the NY Taxi data to run R analyses. Check it out: https://docs.microsoft.com/en-us/sql/advanced-analytics/tutorials/sqldev-in-database-r-for-sql-developers?view=sql-server-2017 I have also attached a screenshot from the tutorial which shows an example of what R can do spatially. Not just pictures but a lot of GIS analysis too.

Attachments:
cpu action.jpg
spaital taxi image.jpg

RonHendrickson
283 post(s)
#28-Oct-18 16:02

Forgot to add that Tim's query in the first reply of this thread was very helpful in putting it together and shortening the time to make it work. Sometimes my query-building can be laborious. I made only a change or two.

RonHendrickson
283 post(s)
#28-Oct-18 21:15

Further update: running the same query as an inline query within Manifold 9, build 168.4, against the SQL Server database gives virtually the same speed (<30 sec) and same results.

artlembo


3,400 post(s)
#28-Oct-18 21:52

Can you get your hands on multiple computers with SQL server on it? Then, you can use an in-line clary on different computers – if you had 12 computer if you can put 1 months of data on each of them. Then, union the results together. You might be able to do an entire year in under a minute. So, 2m 53s vs. 23s is impressive. In many ways, it doesn’t matter it is two minutes or not. Anyone can wait around for that. The real trick is to scale it up and include an entire years worth of data and then see what happens.

You should try to also subdivide the data like a show in my blog post. That probably reduced the time another 30%. You might get your SQL query down to 16s.

Dimitri


7,413 post(s)
#27-Oct-18 06:13

Just a wild guess, but try the latest build 168.4, as that has added stuff for SQL Server.

RonHendrickson
283 post(s)
#27-Oct-18 14:40

Thanks will do.

adamw


10,447 post(s)
#29-Oct-18 12:59

I wish you hadn't asked me about SQL server, for some reason I can't get the SQL server drawings to display. While in SQL Server 2017 (SSMS) I created an identity column in the table, made it the primary key. Then I created a spatial index within SQL Server, using the pickup_geom. It all works great within SSMS, it shows the points in the spatial results window.

Regarding this, did you use GEOGRAPHY or GEOMETRY?

We currently have an issue that if you use GEOGRAPHY, we may ask SQL Server to provide data in the area that is larger than the valid lat/lon range - SQL Server angrily rejects such a request and returns nothing and instead of a map window with objects we have a long error message in the log window. We will fix this.

But if you used GEOMETRY, we'd appreciate more details regarding how to create a table with a geometry field and a spatial index on that field, which we wouldn't then be able to show.

RonHendrickson
283 post(s)
#29-Oct-18 13:59

The earlier thread that you quote, I was using the pickup_geom that was created using a GEOGRAPHY type. It would not display, as you said.

Subsequently, I re-created the pickup_geom using a GEOMETRY type and it displays well within Manifold 9, although it displays a lot slower if brought in from the sql server datasource, which is expected. Copying the table and then creating the drawing within the Manifold project area displays almost instantly.

As you guys have touted, Manifold 9's ability to display almost instantly a huge number of points like this is amazing. And as Art has said, combining a fast database server with Manifold can create wonderful possibilities for the Manifold user.

adamw


10,447 post(s)
#29-Oct-18 12:21

Great blog and great thread!

Several notes regarding the performance right away:

1. I am happy with the query used for 9 - the variant with GeomContains + THREADS is simpler but the variant GeomOverlayXxxPar is faster.

2. We have several improvements planned for geometry operations - for future cutting edge builds - with some related to performance. So, even though it appears that 9 is already pretty competitive, this isn't the final say here. :-)

Also, a note based on other folks trying to repeat the test and running into issues related to the data along the way:

Importing the GDB fails / linking the GDB and then copying and pasting 'pickup' (table) fails. This happens because 'pickup' contains a FLOAT64 field named MFD_ID and that field contains values that are (a) zeros, and (b) duplicates. MFD_ID is a special field in tables in MAP files and it can only contain positive unique values, this is the reason why importing / pasting into the MAP file fails. Now, maybe we should adjust our pasting code somehow, but I am more interested in how specifically MFD_ID with zero values ended up in GDB. Did you produce the GDB by exporting a drawing from Manifold (creating 2.8 million records with unique values of MFD_ID) and then imported more objects into the created drawing (creating 11.8 million objects with MFD_ID of zero)? Maybe we should rename MFD_ID to something else during the export to GDB then.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.