Subscribe to this thread
Home - General / All posts - Slow query speeds in Radian
pierredp
76 post(s)
#15-Aug-19 15:31

At work we finally have a real project with "big data" and I was super exited to see how I could show-off Manifold Radian (9.0.169) to my colleagues since they are mostly in the ESRI camp.

So, I have a PostgreSQL (v9.6) database which has a table with just under 211 million records in it. I have created BTREE indexes on al the fields that is needed in queries (see attached image). I then tried to do a simple select query:

SELECT [objectid_1][originid][destinationid] FROM [od_cost_matrix] WHERE [originid]=3316 OR [destinationid]=3316;

This query takes 73 minutes to complete, below is the results when running the query with Fullfetch On:

> !fullfetch

  fullfetch on

SELECT [objectid_1], [originid], [destinationid] FROM [od_cost_matrix] WHERE [originid]=3316 OR [destinationid]=3316;

  4412.244 sec (30,529 records, 477.0 KB)

I ran the query with Fullfecth Off as well and time wise it completed in 64 minutes. I even imported the table into Radian and tried to run the same query but without any major improvement (59minutes).

When I run the same query in ArcMap 10.6 it takes 178 seconds (just under 3 minutes) and the same query in PostgreSQL takes 42 seconds. All the queries were done from a "cold" starts.

My PC spec is:

OS: Windows 7 Enterprise with SP1

Processor: Intel Xeon E5-2620 2.10GHz

RAM 32 GB

HD: Seagate 512GB SSD

GPU: Nvidia Quadro K4000

I am really baffled by this and as Dimitri said in this thread "Whatever the case it is always something worth digging into as the Manifold team won't rest if a particular query is faster in some GIS than in Radian. :-)...."

Attachments:
Radian Table Schema.png


"Use time and words carefully because neither can be retrieved" - Jackson Brown

atrushwo59 post(s)
#15-Aug-19 16:47

Have you tried creating a composite index with both the origin and destination IDs? I imagine the OR is what's slowing you down.

Dimitri


5,620 post(s)
#15-Aug-19 18:01

That's baffling for sure, especially the result when you import it into Manifold. Pretty much anything you can do with a table and a query runs comparably in Manifold to the DBMS, and spatial stuff can run much faster. It's something to drill into.

1. How fast is it when you launch the query in Manifold's command window, but run in the PostgreSQL native query engine? (See how to do that in this topic.)

2. How fast is it in the current build, 9.0.169.7 ?

The good news with such radically different results is that it usually points to some gross error somewhere. That's good because when such things are found they usually are easy to fix and immediately yield dramatically better results. We just have to find whatever is the problem.

adamw


8,775 post(s)
online
#16-Aug-19 09:48

The problem is OR. If WHERE contained just one term or if the terms were connected with AND, it'd have been different. As it stands now, OR just disables optimizations for indexes. This could be made better, OR can still be optimized, just much less efficiently than AND. We will do this.

In the meantime, you can offload the job to PostgreSQL - you can run PostgreSQL code from a Manifold query via EXECUTE ... . For big amounts of data, it makes sense to consider offloading as much as you can to the server regardless of which particular optimizations our query engine has, because that reduces the number of records that has to be transported to the client.

adamw


8,775 post(s)
online
#16-Aug-19 10:18

Also, I'd like to say a couple of words on !fullfetch.

There were several threads lately with discussions on the performance of queries, and there were mentions of that mysterious !fullfetch command that we have, whether it was used or not, what the running time of the query at hand was if it was or was not used, etc. I'd like to clear all mysteries and just quickly recap what !fullfetch does so that we can all use it and make sense of the results.

Imagine we want to measure the performance of a query.

We take the text of the query, ask the query engine to run it, the query engine runs it and returns the result, and we measure how long it took the query engine to do this. What could be simpler.

And this is all there is to it, except that for some statements the query engine can postpone some or most of the actual computations and return the result sooner than all these computations are done. In particular, when the query is a SELECT ... , the result of running that query is a table, and the query engine can immediately return the table, and it might be only when you start actually fetching records from that table, that the computations will be done. Is this good? Of course. We get the result faster. More, we can start working with the first records without computing further records. More, if we look at the first records and decide that we found what we wanted we can decide not to compute further records. Etc, etc, etc. But while all this is good, all these optimizations mean that we cannot measure the performance of the query just by measuring how long it takes the query engine to return the result.

This is where !fullfetch comes in. It tells the command window that is operating the query engine to not just run each next command but also take the result table and read through all of it, forcing it to compute each record. That's all that happens. The result is exactly the same as normally, but the command window reads through it once after the query engine returns it, and the time to read through it is included in the measurement.

!fullfetch does not affect INSERT / DELETE / UPDATE, etc. It will faithfully read through the result table, but the result table of INSERT / DELETE / UPDATE is just a tiny table with a single number with the count of affected records, the time to read through it is tiny. !fullfetch does affect SELECT. And !fullfetch might affect EXECUTE, depending on what happens inside it.

You can never use !fullfetch at all and just do SELECT INTO instead of SELECT. This would take slightly longer than SELECT with !fullfetch, because SELECT INTO not only has to read through all data, it also has to write all data into a new table, writes takes time. But the effect would be similar to using !fullfetch, the result will be read through in its entirety, guaranteeing that each record is computed. If you are comparing the performance of two queries, you can just make them both SELECT INTO and that would make the comparison more or less apples to apples.

I hope this helps understand what !fullfetch does and how to use it.

adamw


8,775 post(s)
online
#16-Aug-19 11:44

One more thing you can do until we have optimizations for OR:

--SQL9

SELECT [objectid_1][originid][destinationid] FROM [od_cost_matrix]

WHERE [originid]=3316

  UNION

SELECT [objectid_1][originid][destinationid] FROM [od_cost_matrix]

WHERE [destinationid]=3316;

Having an optimization for OR would be better for several reasons, but the above should be much closer to the optimized OR than the current OR.

tjhb

8,958 post(s)
#17-Aug-19 07:00

I think UNION should be UNION ALL here.

UNION filters out duplicates, returning only distinct records. Adding ALL prevents the filtering.

It may be unlikely that there will be duplicate records in this data, but if there are then they should probably be returned (to match the original query in the first post).

Removing the filter also makes a query much faster. (That makes UNION ALL a better default choice, in my opinion.)

UNION (ALL) makes a huge difference though. Some timings:

Setup (10 million random pairs of integers):

--SQL9

CREATE TABLE [t] 

    (

    [mfd_id] INT64,

    [originid] INT32,

    [destinationid] INT32,

    INDEX [mfd_id_x] BTREE ([mfd_id]),

    INDEX [originid_x] BTREEDUP ([originid]),

    INDEX [destinationid_x] BTREEDUP ([destinationid]),

    INDEX [originid_destinationid_x] BTREEDUP ([originid][destinationid])

    );

INSERT INTO [t] ([originid][destinationid])

SELECT

    [Value],

    [Value] DIV ([Value] MOD 3 + 1)

FROM CALL ValueSequenceRandomInt(1e7, 2^8, 0);

(1) 20.305 sec (77,939 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE [originid] = 2^6

OR [destinationid] = 2^6

;

(2) 19.545 sec (77,939 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE CASE

    WHEN [originid] =  2^6 THEN TRUE

    WHEN [destinationid] =  2^6 THEN TRUE

    --ELSE FALSE -- implicit

END

;

(3) 16.389 sec (77,939 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE CASE 2^6 

    WHEN [originid] THEN TRUE

    WHEN [destinationid] THEN TRUE

    --ELSE FALSE -- implicit

END

;

(4) 17.093 sec (77,939 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE 2^6 IN ([originid][destinationid])

;

(5) 0.484 sec (2 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE [originid] = 2^6

UNION

SELECT [originid][destinationid]

FROM [t]

WHERE [destinationid] = 2^6

;

(6) 0.203 sec (77,939 records)

--SQL9

SELECT [originid][destinationid]

FROM [t]

WHERE [originid] = 2^6

UNION ALL

SELECT [originid][destinationid]

FROM [t]

WHERE [destinationid] = 2^6

;

This is local data. !fullfetch was on for each test.

adamw


8,775 post(s)
online
#17-Aug-19 14:19

Yes, the difference in performance between UNION and OR in this case is pretty big.

We want to remove duplicate records because the same record can match both conditions in OR and if that happens, we only want to have a single copy of that record in the output. I think in your tests, the results with UNION looked wrong because there *were* duplicate pairs of ORIGINID / DESTINATIONID and most of the records collapsed. Include MFD_ID into the SELECTs, that will remove duplicate records without removing duplicate pairs of values.

(My times when I do this on a test machine, same setup, query 1 adjusted to select MFD_ID as well: 42.769 sec, query 5 adjusted to select MFD_ID to make UNION work as it should: 1.013 sec, the tables contain the exact same records. To recap, that's a smaller table than in the first post, but the number of times individual IDs repeat is much higher.)

tjhb

8,958 post(s)
#17-Aug-19 15:27

That's two good points both of which I missed. Of course you are right. Thanks for explaining so well.

Manifold User Community Use Agreement Copyright (C) 2007-2019 Manifold Software Limited. All rights reserved.