Subscribe to this thread
Home - General / All posts - UPDATING from a second table
artlembo


2,916 post(s)
online
#17-Nov-17 15:50

I am issuing the following SQL to update a value from records in another table:

UPDATE (

    SELECT [tra8696_pts].[mfd_id], stormname, [tra8696_line].name 

    FROM [tra8696_line][tra8696_pts]

    WHERE [tra8696_pts].[YEAR_] = [tra8696_line].[YEAR_]

    AND [tra8696_pts].[STORM_NUM] = [tra8696_line].[YEAR_NUM]

)

SET stormname = name

but I'm getting an error message that says:

Schema should contain at least one unique index.

the mfd_id does is in fact a unique index.

I tested the inner part of the query, and it

These ordinarily work in 8, so I wonder what I am missing.

tjhb

7,545 post(s)
#17-Nov-17 16:17

Art, mfd_id has a BTREE index in the source, but not in the joined table, where it becomes BTREEDUP (because of the join—the compiler must assume that any row from the left table might be matched with more than one row from the right table).

I think Adam has already posted an example how to work with this situation, I will find it.

tjhb

7,545 post(s)
#18-Nov-17 22:42

I was wrong, there is no example that I know of to deal with this exact situation. (A situation which is commonplace.)

The threads I was thinking of are 9.0.141 and 9.0.142. Still well worth re-reading, even if they don't settle it.

This is what I think is the case, starting with the simplest example.

Data is as follows: table A has <field1> (and possibly other fields), table B has <field2> (and possibly other fields). Table

A is joined to B, either on <field1> matching <field2>, or on an expression including <field1> and <field2>. (Whether an exact match or an expression makes a difference, which I will come back to.)

Looking only at the syntax and the schemas (not at the data), the compiler determines in advance whether it is possible that there will be multiple matches for records from one side or the other (or both).

If it is possible for there to be multiple matches on <field1>, then any BTREE index on <field1> will be converted to BTREEDUP. The same on the other side: if it is possible for there to be multiple matches on <field2>, then any BTREE index on <field2> will be converted to BTREEDUP.

That is the usual case.

In fact, the only situation where it is impossible (without looking at the data) for there to be multiple matches for <field1> and/or <field2> is if (a) there are BTREE indexes on both <field1> and <field2>, since this guarantess that both are unique at source, and (b) neither <field1> nor <field2> is altered by an expression.

That seems to be the basic position. Subject to a possible wrinkle (to be dealt with below, something I don't yet understand), that is the only case where the result of a joined table can be directy updated in SQL9.

This is a very strong restriction on updating source tables.

The alternative, when we (but not the compiler) know that values in (say) <field1> will be unique given our expected data, is to write the joined data into a temporary table, then add an explicit BTREE index on it before the UPDATE. This is always possible but obviously more work--and fairly often it will remove the whole point of using an UPDATE query.

I will come back to the wrinkle when I have something sensible to say about it (if ever!).

adamw


7,307 post(s)
#22-Nov-17 08:15

All of the above is true.

To say it simpler, in the SELECT in the first post, a record from a table that has the "stormname" field may have multiple matching records from the other table. If we allow the UPDATE run, it is unclear what to set the "stormname" field to, each of the multiple matching records may have a different value. That's why we require that the join is 1 to 1. If you want to use the maximum / average / first / whatever of the values from the matching records, you can do this using a nested query, for example.

PS: 8 allows the UPDATE to run with the last set value winning.

artlembo


2,916 post(s)
online
#22-Nov-17 12:00

Can you give an example of the nested query to do this?

adamw


7,307 post(s)
#22-Nov-17 12:30

For example:

The setup, creates two tables and fills them with some values, we are going to be joining on a common field v:

--SQL9

CREATE TABLE lines (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHAR);

INSERT INTO lines (v) VALUES (1), (2), (3);

CREATE TABLE points (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHAR);

INSERT INTO points (v, name) VALUES (1, 'a'), (1, 'b'), (2, 'c'), (4, 'd');

The UPDATE query:

--SQL9

FUNCTION f(l_v INT32NVARCHAR AS (

  SELECT Max(name) FROM points WHERE v=l_v

END;

UPDATE (

  SELECT mfd_id, name, f(v) AS p_name FROM lines

SET name = p_name;

I used a function because that's simplest. The nested query can sometimes be inlined.

artlembo


2,916 post(s)
online
#22-Nov-17 12:50

Thanks. This is understandable.

For posterity, and for others who might come across this, can you post the nested version?

adamw


7,307 post(s)
#22-Nov-17 13:44

Not at the moment.

It should look like this:

--SQL9

UPDATE (

  SELECT mfd_id, name,

    (SELECT Max(name) FROM points WHERE points.v=lines.v) AS p_name

  FROM lines

SET name = p_name;

...but the SELECT complains and that might be a bug.

Inlining like this works:

--SQL9

SELECT mfd_id, name FROM lines

WHERE (SELECT mfd_id FROM points WHERE lines.v=points.v) > 0;

Inlining like this doesn't work:

--SQL9

SELECT mfd_id, name, (SELECT mfd_id FROM points WHERE lines.v=points.v)

FROM lines;

...and that might be an issue on our part. (Or I had too much coffee and it shouldn't work. If that's the case, I will say so later.)

artlembo


2,916 post(s)
online
#06-Dec-17 03:06

this actually works quite nicely. But, the next question of course is how to update multiple values. We are updating name = p_name. But, what if we wanted to update other fields. The function returns NVARCHAR. Would we instead return a table, some kind of array?

tjhb

7,545 post(s)
#06-Dec-17 04:01

In that case I think it would be worthwhile writing a temporary table (possibly on a temporary database) and adding an explicit BTREE for the unique joining field(s).

Alternatively, you could sometimes adjust Adam's approach above to use using a function returning a vector type (e.g. float64xN) if that happened to suit. Or sometimes a multipoint, or multi-branched geometry of any type. In other cases you could pack a string with separators. But... is this a bit clunky?

I can't quite see how to do it with COLLECT and/or a function returning a table. Maybe that can be done nicely.

A temporary table might be the best approach, I think.

Do you have a concrete example in mind Art?

adamw


7,307 post(s)
#06-Dec-17 09:42

All these issues occur when the fields used in joins are not specified to be unique via indexes. This creates an N to M matching which obviously raises the question during writes to any table - which specific value out of N or M potential values do you want?

The moment we add an index it all becomes easy.

Ie, in my example for lines vs points above, suppose we specify that line IDs (v) are unique:

--SQL9

CREATE TABLE lines (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHARINDEX v_x BTREE (v));

INSERT INTO lines (v) VALUES (1), (2), (3);

CREATE TABLE points (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHAR);

INSERT INTO points (v, name) VALUES (1, 'a'), (1, 'b'), (2, 'c'), (4, 'd');

Then this query that composes a name for each line out of the names of its points carries that uniqueness over to the result table:

--SQL9

SELECT lines.v, Max(points.name) AS p_name

FROM lines INNER JOIN points ON lines.v=points.v

GROUP BY lines.v;

...check the result table and you will see the index on lines.v.

And the only reason we can't update that just yet is that the result table does not contain the target field we want to write to. You have to join it by running a query on top of this one:

--SQL9

SELECT lines.v, lines.name, computed.p_name FROM lines INNER JOIN (

  SELECT lines.v, Max(points.name) AS p_name

  FROM lines INNER JOIN points ON lines.v=points.v

  GROUP BY lines.v

AS computed ON lines.v=computed.v;

And that can be updated directly:

--SQL9

UPDATE (

SELECT lines.v, lines.name, computed.p_name FROM lines INNER JOIN (

  SELECT lines.v, Max(points.name) AS p_name

  FROM lines INNER JOIN points ON lines.v=points.v

  GROUP BY lines.v

AS computed ON lines.v=computed.v

SET name=p_name;

No functions and no big tricks.

OK, maybe bringing fields to update via a second SELECT can qualify as a trick. But it doesn't seem too big.

tjhb

7,545 post(s)
#06-Dec-17 20:23

I like this syntax.

(1) It is very like what we are used to doing in Manifold 8, except for the need to ensure uniqueness on the join field(s) and add our own index to "declare" the uniqueness to the compiler.

In Manifold 8 we could simply allow multiple matches, and rely on the "last update wins" rule to sort this out. That was very convenient by the way, even if a bit lax. The laxness matters more in SQL9 where we can't control the order among multiple matches, given the possibility of multiple threads--meaning that given a "last update wins" rule we could not control what "last" means. (We can in Manifiold 8, when we need to.) So that would be too loose.

(2) The rule that a BTREE in the left table on the join field(s) in an inner join is inherited in the result table is clever and efficient.

(The same is true for a left join. For a right join the BTREE must become BTREENULL since some rows from the left table may not match a preserved row on the right. If I remember correctly, a table with only a BTREENULL index may still be updated, except that records with NULL values in the indexed field will be skipped.)

My instinct was to write the joined result into a temporary table, then replace the resulting BTREEDUP index on mfd_id taken from the left table with a new BTREE. That will still work I think (allowing a 1:1 join back to the original table, therefore allowing direct update), but it is more effort than your example.

adamw


7,307 post(s)
#07-Dec-17 07:05

The rule that a BTREE in the left table on the join field(s) in an inner join is inherited in the result table is clever and efficient.

In both tables, not just the left. INNER keeps uniqueness on both sides, LEFT and RIGHT on one side, OUTER on neither side. (For indexes built on fields participating in the join condition in the form of a.f=b.f [AND a.g=b.g ...], etc.)

tjhb

7,545 post(s)
#07-Dec-17 07:56

Thanks.

... FULL [OUTER] on neither side?

adamw


7,307 post(s)
#07-Dec-17 08:37

Yes, I meant FULL, sorry.

tjhb

7,545 post(s)
#07-Dec-17 08:55

I was thinking driving over the hill today: we should also think about the possibility of updating tables resulting from spatial joins (on GeomTouches, say), since in these cases there is no way for the compiler (nor usually the user) to predict unique representation of records from either source table in the joined result.

My first thought was that this looks like a blockage for using UPDATE to perform custom spatial overlays.

I didn’t get as far as a second thought but a quick try now: in that sort of case we can pass the joined table to a script function.

adamw


7,307 post(s)
#07-Dec-17 09:23

Just a very quick sketch, hope what is going on is clear from the names:

--SQL9

SELECT states.mfd_id, states.length, computed.r_length

FROM [States Drawing] AS states INNER JOIN (

  SELECT s_mfd_id, Sum(o_Length) AS r_length

  FROM CALL GeomOverlayTouching([States Drawing][Roads Drawing], 0)

  GROUP BY s_mfd_id

AS computed ON states.mfd_id=computed.s_mfd_id;

The overlay function returns a table with no indexes, GROUP computes source values and establishes an index, a SELECT added on top brings in target fields. The result is directly updatable.

In short: everything discussed is applicable to spatial joins as well. If there is uniqueness, you can use it. If there is no uniqueness, you can use functions like here. The overlay function is free to do whatever it wants, there are no hard requirements for its results table, and so it can use multiple threads (use the Par version), reorder records, and in general do whatever.

tjhb

7,545 post(s)
#07-Dec-17 09:41

Yes! That’s great Adam, thanks.

My favourite phrase there is

GROUP computes source values and establishes an index

That is, a new index for the grouping field(s) (a compound index if more than one field, here just one).

I had noticed this in other contexts, but had never thought of using it in this way.

Thanks for another great lesson!

artlembo


2,916 post(s)
online
#07-Dec-17 17:10

great ideas. Forgive me for throwing this query in here, but next time I want to do this, and search the forum, I want to find it :-)

As you know, I'm just getting used to using CALL, so this query accomplishes the same thing in a more traditional way. I have crime locations (points), and census tracts (areas). In this query, I want to update the [place] name from the crimes with the tract name (tractce10) from the census

:

-- $manifold$

UPDATE (

  SELECT [crimes].mfd_id , [T].[TRACTCE10],[place]

  FROM [crimes] INNER JOIN

 (SELECT  [crimes].mfd_id as mid, first([TRACTCE10]as tractce10

    FROM crimes, census WHERE GeomContains([census].[SHAPE],[crimes].[geometry],0)

  GROUP BY [crimes].mfd_id) AS T

 ON mfd_id = mid

)

SET [place] = tractce10

I like the little trick about getting the id value.

artlembo


2,916 post(s)
online
#07-Dec-17 19:45

For those of you who don't like using the INNER JOIN, this one accomplishes the same thing, but might be easier for a novice to read:

UPDATE (

  SELECT [crimes].mfd_id , [T].[TRACTCE10],[place]

  FROM [crimes] ,

 (SELECT  [crimes].mfd_id as mid, first([TRACTCE10]as tractce10

    FROM crimes, census WHERE GeomContains([census].[SHAPE],[crimes].[geometry],0)

  GROUP BY [crimes].mfd_id) AS T

 WHERE mfd_id = mid

)

SET [place] = tractce10 

adamw


7,307 post(s)
#08-Dec-17 07:09

This last query is perhaps the simplest it can get in terms of used constructs - it is just SELECT / WHERE / GROUP / UPDATE with minimal nesting.

Performing the overlay manually in WHERE instead of in a specialized function is slower, but the simplicity is valuable as well.

artlembo


2,916 post(s)
online
#08-Dec-17 15:24

Just for fun, I gave this a shot:

running our first idea with the Function:

FUNCTION f(p geom) NVARCHAR AS (

  SELECT [TRACTCE10] FROM census WHERE GeomContains([census].[SHAPE],p,0)

END;

UPDATE (

  SELECT mfd_id, place, f(geometry) AS p_name FROM crimes

SET place = p_name;

that runs in 43KB/s

then, I ran my query from above:

UPDATE (

 SELECT [crimes].mfd_id , [T].[TRACTCE10],[place]

 FROM [crimes] ,

 (SELECT [crimes].mfd_id as mid, first([TRACTCE10]as tractce10

 FROM crimes, census WHERE GeomContains([census].[SHAPE],[crimes].[geometry],0)

 GROUP BY [crimes].mfd_id) AS T

 WHERE mfd_id = mid

)

SET [place] = tractce10

that ran in 180 KB/s

So, the function seems much slower. Finally, I used my query with the INNER JOIN:

UPDATE (

  SELECT [crimes].mfd_id , [T].[TRACTCE10],[place]

  FROM [crimes] INNER JOIN

 (SELECT  [crimes].mfd_id as mid, first([TRACTCE10]as tractce10

    FROM crimes, census WHERE GeomContains([census].[SHAPE],[crimes].[geometry],0)

  GROUP BY [crimes].mfd_id) AS T

 ON mfd_id = mid

)

SET [place] = tractce10

and that ran in 280 KB/s

So, using the INNER JOIN is definitely much faster.

Also, just for fun, I tried inserting THREADS 4, but each time, the KB/s was actually lower. Is there an effective way to use threads in this case?

tjhb

7,545 post(s)
#08-Dec-17 15:50

There are at least two three ways.

(1) Use CALL GeomOverlayTouchingPar, following Adam’s suggestion.

(2) Use THREADS on the inner SELECT (not enough work in the outer), adjusting your GeomTouches query to first use a very coarse GeomWithin for the join, then a GeomTouches filter in the SELECT list itself.

(3) Add THREADS to the version using a function, to the function itself, but rewrite so that the WHERE GromTouches filter is instead in the SELECT list.

Will try these when I am awake.

adamw


7,307 post(s)
#08-Dec-17 16:20

Measure time to complete, not speed. With queries, measuring speed is mostly useful for when you are running the exact same query against different data sets - the speed is affected by the amount of data a particular stage gets, a faster query can easily be getting less data flowing through the stages that end up reporting speed, that would make the reported speed lower even though the query is faster. In this particular case, you get the speed for UPDATE with the exact same data in all three cases, which makes the comparisons fair, but generally, measure time to complete.

(I would expect the function query to perform worse than others, especially when the number of hits per area is very low - because running one large and many small SELECTs is more expensive than running two big SELECTs.)

(In terms of performance, nothing is going to beat GeomOverlayXxxPar. If anything does, let us know and we will make GeomOverlayXxxPar even faster. :-))

artlembo


2,916 post(s)
online
#08-Dec-17 18:18

Just did some reruns on a dataset with 76,000 records:

Using the FUNCTION route: 40s

Using the INNER JOIN route: 7s

Using the more simplified version: 7s

Using GeomOverlayXxxPar*: under 2s (wow!)

This is the query I used for it:

UPDATE(

 SELECT crimes.mfd_id, o_TRACTCE10, place FROM crimes,

 (SELECT s_mfd_id, first(o_tractce10) as o_tractce10 

  FROM CALL GeomOverlayContainingPar([crimes] ([mfd_id][geometry],place),

 [census] ([TRACTCE10] , [SHAPE]),

 0,ThreadConfig(SystemCpuCount()))

 GROUP BY s_mfd_id) AS computed 

 WHERE crimes.mfd_id = computed.s_mfd_id

    )

SET place = o_TRACTCE10

so, a lesson here for me at least:

if you want simplicity and can deal with the extra time, choose the simple queries I created. But, if you want things to run 7 times faster, use this last one :-)

artlembo


2,916 post(s)
online
#08-Dec-17 23:53

Just another follow up. I ran this query:

-- $manifold$

UPDATE(

 SELECT [pickup Drawing].mfd_id, o_zone, tzone FROM [pickup Drawing],

 (SELECT s_mfd_id, first(o_zone) as o_zone 

  FROM CALL GeomOverlayContainingPar([pickup Drawing] ([mfd_id][pu_geometry],tzone),

 [taxi_zones] ([zone] , [Geom]),

 0,ThreadConfig(SystemCpuCount()))

 GROUP BY s_mfd_id) AS computed 

 WHERE [pickup Drawing].mfd_id = computed.s_mfd_id

    )

SET tzone = o_zone

on the NYC taxi data that I have in my videos.

It ran on a 6GB file with 16M taxi locations and 300 areas in under 8 minutes. And, it was running at over 1.6MB/s. I did this on my Microsoft Surface. Can't wait to try it back in the office on Monday.

When I run it in an ordinary fashion it takes about 27 minutes.

This is definitely going into the workshop :-)

joebocop
297 post(s)
#08-Dec-17 17:01

For those of us more comfortable with PostgreSQL, it looks like CALL() is headed our way in that product as well, so I'd better get cracking on learning Manifold's implementation.

https://blog.2ndquadrant.com/postgresql-11-server-side-procedures-part-1/

adamw


7,307 post(s)
#09-Dec-17 07:54

That CALL is slightly different (we are using CALL to call a function that returns a table, they are using CALL to call a function that returns nothing = a procedure), but yeah, getting better at any flavor of SQL pays in that you are getting better at all of them. A lot of the concepts carry over.

joebocop
297 post(s)
#09-Dec-17 21:24

Oh, PostgreSQL CALL() returns nothing, good call. Pun intended there, big time.

adamw


7,307 post(s)
#06-Dec-17 08:24

Just use multiple functions.

The setup, I added a new field to both tables:

--SQL9

CREATE TABLE lines (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHARpop INT32);

INSERT INTO lines (v) VALUES (1), (2), (3);

CREATE TABLE points (mfd_id INT64INDEX mfd_id_x BTREE (mfd_id),

  v INT32, name NVARCHARpop INT32);

INSERT INTO points (v, name, popVALUES

 (1, 'a', 100), (1, 'b', 150), (2, 'c', 200), (4, 'd', 250);

The update:

--SQL9

FUNCTION anyname(l_v INT32NVARCHAR AS (

  SELECT First(name) FROM points WHERE v=l_v

END;

FUNCTION sumpop(l_v INT32INT32 AS (

  SELECT Sum(popFROM points WHERE v=l_v

END;

UPDATE (

  SELECT mfd_id, name, anyname(v) AS p_name, pop, sumpop(v) AS p_pop FROM lines

SET name = p_name, pop = p_pop;

If you want, you can compute aggregates against the exact same subtable - eg, by having one more function that computes them all and returns them as a table, and converting individual functions to take that table and fetch the relevant value.

Manifold User Community Use Agreement Copyright (C) 2007-2017 Manifold Software Limited. All rights reserved.