Subscribe to this thread
Home - General / All posts - SQL contains() versus NOT contains()
yves61
438 post(s)
#13-Mar-15 13:41

I have two drawings Parcels (10200 objects) and Buildings (7700 objects).

I am using contains() to select built parcels like this:

--sql

select P1.id from 

Parcels as P1,

Buildings

where

contains (P1.id,centroid(Buildings.id))

This runs in less than than 2 sec.

Now I want to select the reverse (unbuilt parcels) by using sql. I was trying this code:

--sql

select P1.id from 

Parcels as P1,

Buildings

where

NOT contains (P1.id,centroid(Buildings.id))

Now this sql (not contains()) takes minutes to run !

a) I do not understand why NOT contains() takes so much more time.

b) Who can help me with a better/quicker sql coding to achieve the result (--> NOT contains()) ?

adamw


10,447 post(s)
#13-Mar-15 14:03

I take it with the second query you meant to select parcels which contain no buildings (the query, as written, does something else entirely, that's why it takes so long to run). If so, just use EXCEPT to subtract IDs returned by your first query from all IDs.

yves61
438 post(s)
#13-Mar-15 14:27

Like this ?

-- sql

select P2.id from Parcels as P2

except

(

select P1.id from 

Parcels as P1,

Buildings

where

contains (P1.id,centroid(Buildings.id))

)

Above coding returns in runs in less than 2 seconds too ! Great . Thank you.

firsttube


1,439 post(s)
#13-Mar-15 14:09

I was able to recreate the behaviour you described using your queries. One "quick and dirty" way would be to run the first query, then run this query:

--sql

select Parcels.id from 

Parcels

where

Parcels.[Selection (I)] <> True


"The blessing in life is finding the torture you are comfortable with." - Jerry Seinfeld, 6/26/2013

firsttube


1,439 post(s)
#13-Mar-15 14:19

Adam's "EXCEPT" method is far more elegant.

--sql

select Parcels.id from 

Parcels

except

(

 select P1.id from 

 

 Parcels as P1,

 

 Buildings

 

 where

 

 contains (P1.id,centroid(Buildings.id))

)


"The blessing in life is finding the torture you are comfortable with." - Jerry Seinfeld, 6/26/2013

yves61
438 post(s)
#13-Mar-15 14:29

I just sent my reply to Adam above. Has more or less the same syntax as yours. Yours is neater. See I do not need aliases too ! Thank you.

yves61
438 post(s)
#13-Mar-15 14:33

Had tried this code in the meantime, but it took 140 seconds to run!

--sql 

select P2.id from 

Parcels as P2

where P2.id not in

(select P1.id from 

Parcels as P1,

Buildings

where

contains (P1.id,centroid(Buildings.id))

)

So 'not in' in this sql coding is many times slower than Except !

tjhb
10,094 post(s)
#13-Mar-15 23:19

I almost never use EXCEPT. I'm going to follow Adam's lead here and look for opportunities to do that in future.

I would instinctively do this. It's not as pretty as EXCEPT, unless you have a particular fondness for NULLs. And there's every chance that it's slightly slower.

SELECT [D].[ID]

FROM

    (SELECT

        [D].[ID][E].[ID]

    FROM 

        [Parcels] AS [D]

        LEFT JOIN

        [Buildings] AS [E]

        ON Contains([D].[ID], Centroid([E].[ID]))

    )

WHERE [E].[ID] IS NULL;

I'd like to unpack Adam's

(the query, as written, does something else entirely, that's why it takes so long to run)

a little bit.

Take Yves' first query (in post 1).

--sql

select P1.id from 

Parcels as P1,

Buildings

where

contains (P1.id,centroid(Buildings.id))

What this says logically, in bald English, is:

The left set is the objects in Parcels

The right set is the objects in Buildings

For every element (X) in the left set:

    For every element (Y) in the right set:

        If X contains Y, list (X, Y)

Give me all the X values in that list

Now take Yves' second query.

--sql

select P1.id from 

Parcels as P1,

Buildings

where

NOT contains (P1.id,centroid(Buildings.id))

In English:

The left set is the objects in Parcels

The right set is the objects in Buildings

For every element (X) in the left set:

    For every element (Y) in the right set:

        If X does not contain Y, list (X, Y)

Give me all the X values in that list

Usually, the list for the second query is much, much smaller than the list for the first.

Why is the first query faster so much than the second? That's not the same thing.

I'm not smart enough to give a proper answer to that. (I think: spatial indexes.)

Anyway, what is required here is "something else entirely", as Adam puts it: list each X which contains no Y.

That's equivalent to: list each X which contains a Y, then invert the list (list only the other Xs).

(That could be rephrased as: list each X for which the set of of Ys that it does not contain is the all of the Ys.)

That's completely different from: for each X, list every Y that it does not contain, then throw away the Ys (leaving multiple repeated Xs).

tjhb
10,094 post(s)
#13-Mar-15 23:34

Correction (editing error).

Usually, the list for the second query is much, much smaller larger than the list for the first.

tjhb
10,094 post(s)
#15-Mar-15 00:01

Why is the first query faster so much faster than the second? That's not the same thing.

I'm not smart enough to give a proper answer to that. (I think: spatial indexes.)

Maybe part of the answer is different use of spatial indexes between the two cases, I don't know.

Despite that, part of the answer is much simpler: the quantity of data that must be fetched and listed. (I was too quick to dismiss that.)

In the case of the second query (NOT Contains(...)) I think I had it pretty straight in the last line of that post.

for each X, list every Y that it does not contain, then throw away the Ys (leaving multiple repeated Xs).

For a normal pattern of data, that's a very large set. It has N x (M - C(n)) members, where C(n) is the number of objects which object n contains. Contrast that to N x C(n) for the first query. Where M is much larger than C(n)--which is the usual case--that's a huge difference in quantity of data, regardless of any supposed different use of indexes.

adamw


10,447 post(s)
#15-Mar-15 08:02

Different use of spatial indexes and large amounts of data returned by NOT Contains(...) are related.

Suppose we have two drawings A and B with a spatial index on each, like in the original post.

If we do ... WHERE Contains(A.ID, B.ID), the query goes over all records in B and for each B.ID asks the index on A which A.IDs contain B.ID (technically, it asks which A.IDs might contain B.ID and then runs Contains on each returned candidate, but this doesn't matter much for the illustration). The index on A allows the query to ignore most of the objects in A for each B.ID.

If we do ... WHERE NOT Contains(A.ID, B.ID), the query goes over all records in B and for each B.ID does ...what? It might ask the index on A which A.IDs contain B.ID, but the user asked for A.IDs which *do not* contain B.ID, so even if the query quickly determines which A.IDs contain B.ID, it will still have to find out what all other A.IDs are, because it has to return them. The query has to go over all records in A for each B.ID, the index doesn't help.

yves61
438 post(s)
#15-Mar-15 13:42

Thanks for your contribution to each one of you, and thanks for the clear explanation.

So it is always best to avoid using * NOT ... * if there is a positive alternative to achieve the same goal (like using "Except" in the example above).

Where could I find basic information on optimising the speed of sql queries ?

adamw


10,447 post(s)
#16-Mar-15 06:09

Where could I find basic information on optimising the speed of sql queries ?

You can use pretty much any book on SQL, in parts which are vendor-neutral. The basics are more or less the same everywhere.

The most important vendor-specific thing to remember for Manifold 8 is that drawing tables have a spatial index on the geom and a traditional index on the ID field.

After that, it's searching and asking questions on this forum.

yves61
438 post(s)
#16-Mar-15 07:28

Thanks again.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.