Subscribe to this thread
Home - General / All posts - I'm bad at deletes
joebocop
514 post(s)
#02-Aug-18 20:22

My SQL deletes are slow; is there a better way of writing them?

I have a table

CREATE TABLE [dest] (

  [record_id] NVARCHAR,

  [updated_at] FLOAT64,

  [value] NVARCHAR,

  INDEX [record_id_x] BTREE ([record_id]),

  INDEX [updated_at_x] BTREEDUPNULL ([updated_at])

);

[dest] has about 10k records in it.

I have data arriving every day, destined for [dest]. These data are either added to [dest], or they replace existing records in [dest].

Data "arrive" from [incoming_data], which has the same schema and indices as [dest].

To get the "replace existing records in [dest]" behaviour, I do this DELETE

DELETE FROM [dest] WHERE [record_id] IN 

(

SELECT [i].[record_id] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] 

ON [i].[record_id] = [d].[record_id] 

WHERE [i].[updated_at] > [d].[updated_at]

)

Where [incoming_data] has about 500 records, this query takes ~3.5 minutes to run. This is true even when the inner subquery producing the table of record_ids returns 0 records, and itself runs in less than 1s.

What can I be optimizing here? Should I be trying to delete records more like this?

DELETE FROM [dest] As [d] INNER JOIN [incoming_data] As [i] ON [d].[record_id] = [i].[record_id] WHERE [i].[updated_at] > [d].[updated_at]

My thanks yet again for sharing your expertise.

tjhb
10,094 post(s)
#02-Aug-18 22:53

The indexes look good (though there is a tweak you can try). That's the most important thing.

The SQL can be improved. What you have currently is a join, followed by a WHERE filter.

It would be more efficient to use a join with two conditions, and no subsequent filter.

--SQL9

DELETE FROM [dest] WHERE [record_id] IN 

(

SELECT [i].[record_id] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] 

ON [i].[record_id] = [d].[record_id] 

AND [i].[updated_at] > [d].[updated_at]

);

That's just changing WHERE for AND, but is a significant change in meaning. It should be much faster.

Now, both fields used in the join are separately indexed, which is good. Would it be even better to create a composite index including both of them?

I don't know in this case. It might be clearer if we could also use the tuple ([record_id], [updated_at]) in the join condition, but here we can't,* because of the '>' test, which would be true for a greater [record_id] regardless of [updated_at] values, and that would be wrong.

[*As I write this, I think it may be wrong. I'll come back to it.]

If you have time to test and compare timings, I would be very interested to know.

--SQL9

ALTER TABLE [dest] (

  ADD INDEX [record_updated_x] BTREE ([record_id][updated_at])

  );

ALTER TABLE [[incoming_data] (

  ADD INDEX [record_updated_x] BTREE ([record_id][updated_at])

  );

DELETE FROM -- ...as above

We know these indexes can be of type BTREE, because since field [record_id] contains unique values for each table, the tuple ([record_id], [updated_at]) must also be unique for each table.

joebocop
514 post(s)
#03-Aug-18 00:23

Thanks tjhb.

1st test: Changing WHERE to AND in the subquery had no impact on performance. That subquery on its own still returns in less than 1s, but wrap it in the delete, and the whole thing takes close to 4 minutes to run.

2nd test: I left the changes in 1st test intact, and added the composite index to both tables. Performance is unchanged; about 4 minutes to run.

3rd test: I removed the original btree and btreedupnull indexes from the two tables, leaving only the composite bree index from 2nd test on each table. Performance is much worse; ~10 minutes to run.

My impression (wish we had EXPLAIN ANALYZE here) is that the subquery is run once for each row in [dest]. As [dest] grows, I expect this query plan to become unusable.

Is a better approach to UPDATE a dummy field in a subquery that JOINs [incoming_data] and [dest], and then delete from [dest] where dummy_field = 1 or something like that? This greasy little code snippet runs in less than 1s.

ALTER TABLE [dest] (

  ADD [del_me] BOOLEAN, 

  ADD INDEX [del_me_x] BTREEDUPNULL ([del_me])

);

UPDATE (

  SELECT 

    [d].[del_me][d].[record_id]

  FROM 

    [dest] As [d] INNER JOIN [incoming_data] As [i] 

      ON [d].[record_id] = [i].[record_id] AND [i].[updated_at] > [d].[updated_at]

SET [del_me] = 1; 

DELETE FROM [dest] WHERE [del_me];

ALTER TABLE [dest] (

  DROP INDEX [del_me_x],

  DROP [del_me]

);

tjhb
10,094 post(s)
#03-Aug-18 00:33

That's interesting, not what I expected. Thanks for testing.

My impression (wish we had EXPLAIN ANALYZE here) is that the subquery is run once for each row in [dest]. As [dest] grows, I expect this query plan to become unusable.

You might be right, but I don't think that's how it should be, since there is no correlation...

...Hold on...

I just noticed the field you are using to list items for deletion. It's from the other table! This means, I think, that the source index cannot be used in the outer DELETE statement. The lookup becomes unindexed, which is slow.

Can you try changing to

--SQL9

DELETE FROM [dest] WHERE [record_id] IN 

(

SELECT [d].[record_id]

FROM [incoming_data] AS [i] INNER JOIN [dest] AS [d] 

ON [i].[record_id] = [d].[record_id] 

AND [i].[updated_at] > [d].[updated_at]

);

A related idea: use an outer join, since we only care about testing one table. E.g.

--SQL9

DELETE FROM [dest] WHERE [record_id] IN 

(

SELECT [d].[record_id]

FROM 

    [dest] AS [d] LEFT JOIN [incoming_data] AS [i]

    ON [d].[record_id] = [i].[record_id]

    AND [d].[updated_at] > [i].[updated_at]

);

joebocop
514 post(s)
#03-Aug-18 00:43

Thank you.

I just noticed the field you are using to list items for deletion. It's from the other table! This means, I think, that the source index cannot be used in the outer DELETE statement.

Interesting, and good catch.

I modified the subquery to expose the field from the same table. Performance is, again, around 4 minutes. I didn't forget to re-create the BTREE and BTREEDUPNULL indexes on [record_id] and [updated_at], respectively, before running this test.

I'll try the LEFT JOIN this evening. Kids home soon, supper needs to be ready!

Thank you again; I always learn a tonne on threads where you weigh in. Much appreciated.

tjhb
10,094 post(s)
#03-Aug-18 00:49

Well, bother.

I don't think the LEFT JOIN will help much at all since, as you say, the join itself is already lightning fast.

I can't see why the DELETE statement should be slow. If I have any more startling thoughts to waste your time, I'll let you know.

Bon appetit!

tjhb
10,094 post(s)
#03-Aug-18 00:56

OK, for pudding. I often forget this.

The join will transform the BTREE on [d].[record_id] into BTREEDUP (for INNER JOIN) or BTREEDUPNULL (for LEFT JOIN). This is because it is possible for each record in [d] to be matched with multiple records from [i] or, in the case of LEFT JOIN, no records from [i].

(You can see the converted index type by running just the subquery then pressing Ctrl-E for its schema.)

So the lookup in the DELETE is matching a BTREE in the original DEST table with a BTREEDUP or BTREEDUPNULL from the subquery.

Does that slow down the lookup? Can the different index types be used (efficiently), or do we fall back to an unindexed search?

tjhb
10,094 post(s)
#03-Aug-18 03:49

Two quick comments on my last post.

1.

I was wrong about BTREE being converted to BTREEDUP or BTREEDUPNULL in this case. The engine knows that the [record_ID] fields in both tables are unique, because they both have BTREEs, so multiple matches are not possible.

In the case of INNER JOIN, this means a BTREE index can be kept in the output, and it is.

In the case of LEFT JOIN, the same. There might be NULLs on the right, but no duplicate matches; again the BTREE from the left can be kept, and it is.

(For a RIGHT JOIN I would have expected conversion to a BTREENULL index, but instead we get no output index at all. To me this looks like a misjudgement on the part of the engine, but I have been wrong about so many things in this thread already that I will stay quiet about it.)

2.

It turns out that it is *much* better here to list records for deletion by their [mfd_id], than by some other field that carries a BTREE index (here [record_id]).

The [mfd_id] field is somehow privileged. If there is no [mfd_id], it seems worth adding one for this purpose.

tjhb
10,094 post(s)
#03-Aug-18 04:07

[Need to check some things.]

joebocop
514 post(s)
#03-Aug-18 04:10

WRT mfd_id being more efficient, my [record_id] field is type NVARCHAR, but the values themselves are UUID strings like 'c0f96f4d-6707-4715-bcf8-681c6c1c00da'. I see there is a field type UUID in 9, but imagined it was only useful as a calculated field using UuidMakeNew().

When I'm back at the desk tomorrow, I'll try converting these [record_id] NVARCHAR fields to type UUID and see if that speeds things up. I'll also try deleting where mfd_id IN (...).

tjhb
10,094 post(s)
#03-Aug-18 04:14

Good thinking. This is all fun, but I'm sorry to be getting so much wrong, and it all makes my head hurt!

Thanks for your patience.

adamw


10,447 post(s)
#04-Aug-18 08:39

Missed this in the reply below.

It turns out that it is *much* better here to list records for deletion by their [mfd_id], than by some other field that carries a BTREE index (here [record_id]).

The special treatment for mfd_id should be limited to auto-generating values. For scans it should make no difference whether the indexed field is mfd_id or something else.

Perhaps it's the type difference, like joebocop says above, INT64 is more efficient than NVARCHAR.

For that matter, UUID is more efficient than NVARCHAR as well. (And INT64 and UUID are largely the same except UUID is bigger so it's slower to that extent.)

adamw


10,447 post(s)
#04-Aug-18 08:35

Hard to choose which post to reply under, will reply here.

The indexes are all good, like Tim says.

SELECT [i].[record_id] ... and SELECT [d].[record_id] ... both work because both [d] and [i] have all necessary indexes.

The original DELETE ... WHERE [record_id] IN (<join>) is not wrong, the join does expose the index that the IN can use and IN ends up using that index, but there is overhead related to seeding the join for each record. We do not re-evaluate the join for each record, of course, but there are internal structures which need to be re-initialized and that ends up costing too much because there are many records. We will look into what we can do to improve this.

What helps is splitting the task into two parts: compute which records have to be deleted and put the result somewhere static, then delete the records. Setting a boolean field and then deleting records based on that field would perhaps work, but here is a more straightforward solution:

The setup, 50,000 records in dest, 10,000 in incoming_data, all "new":

--SQL9

CREATE TABLE [dest] (

  [record_id] NVARCHAR,

  [updated_at] FLOAT64,

  [value] NVARCHAR,

  INDEX [record_id_x] BTREE ([record_id]),

  INDEX [updated_at_x] BTREEDUPNULL ([updated_at])

);

INSERT INTO [dest] ([record_id][updated_at][value])

SELECT 'c' & CAST([value] AS NVARCHAR), 3, [value]

  FROM CALL ValueSequence(1, 50000, 1);

 

CREATE TABLE [incoming_data] (

  [record_id] NVARCHAR,

  [updated_at] FLOAT64,

  [value] NVARCHAR,

  INDEX [record_id_x] BTREE ([record_id]),

  INDEX [updated_at_x] BTREEDUPNULL ([updated_at])

);

INSERT INTO [incoming_data] ([record_id][updated_at][value])

SELECT 'c' & CAST([value] AS NVARCHAR), 4, [value] 

  FROM CALL ValueSequence(20000, 29999, 1);

First, deleting using the query in the first post:

--SQL9

DELETE FROM [dest] WHERE [record_id] IN 

(

SELECT [i].[record_id] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] 

ON [i].[record_id] = [d].[record_id] 

WHERE [i].[updated_at] > [d].[updated_at]

);

-- 5.591 sec

Then restore original data:

--SQL9

DELETE FROM [dest];

INSERT INTO [dest] ([record_id][updated_at][value])

SELECT 'c' & CAST([value] AS NVARCHAR), 3, [value]

  FROM CALL ValueSequence(1, 50000, 1);

Now deleting using a modified method: put all deleted keys into a temporary table, then delete all keys found in that table, then drop the temporary table:

--SQL9

SELECT [i].[record_id] INTO [temp]

  FROM [incoming_data] As [i] INNER JOIN [dest] As [d]

  ON [i].[record_id] = [d].[record_id] 

  WHERE [i].[updated_at] > [d].[updated_at];

DELETE FROM [dest] WHERE [record_id] IN [temp];

DROP TABLE [temp];

-- 0.945 sec

The second query is a minor restructuring of the original query, but runs noticeably faster.

Also, putting SELECT into a VALUE won't help for now, but we might add a function that will cache the table and then we could use VALUE and avoid creating temporary tables in the database. Like this:

--SQL9

VALUE @delete TABLE = CALL Cache-- missing the bolded call now

  SELECT [i].[record_id]

  FROM [incoming_data] As [i] INNER JOIN [dest] As [d]

  ON [i].[record_id] = [d].[record_id] 

  WHERE [i].[updated_at] > [d].[updated_at]

);

DELETE FROM [dest] WHERE [record_id] IN @delete;

But first we'd like to check whether there is something we can do to reduce the amount of internal bookkeeping we have to do for the very first query. I will put this onto the list of things to do.

Hope this helps.

tjhb
10,094 post(s)
#04-Aug-18 21:11

Interesting that WHERE IN can operate so much faster on a fully-formed table (temporary table) than on a virtual table from a subquery. I expect that a corrollary is that creating virtual tables can remain very fast, i.e. subquery (and FUNCTION... TABLE) overhead can remain very small.

The temporary table used can of course be added to a temporary root database.

On your last example, for the future, could you also consider allowing adding and changing indexes, and perhaps properties, to a table attached to a VALUE? That would often be very useful (especially indexes). I suppose this would either require, or perhaps entail, caching?

adamw


10,447 post(s)
#05-Aug-18 10:28

We were thinking about maybe allowing ALTER TABLE @value (...), but we are not sure this is a good idea, because (a) we'd likely have to restrict ALTER TABLE on a value to disallow adding / dropping properties, and (b) neither CREATE TABLE @newvalue (...) nor DROP TABLE @value look particularly good or desirable. This feels like the point where it is time for the query to bite the bullet and use a temporary database.

Cache(...) is different only because it is very straightforward and lets us have both table values that are lightweight and do not cache data and table values that do cache data, because these scenarios are both very frequent (if table values were currently caching data, we'd have been thinking about how to add the lightweight variant).

Maybe we should add some syntax to reference temporary databases in the query to make them easier to use: instead of CREATE ROOT x / USE ROOT x / CREATE DATASOURCE y AS ROOT and then writing queries that can use tables on temporary database (name) and on persistent database (y::name) we could perhaps have just CREATE ROOT x and just immediately allow queries to refer to tables on persistent database (name) and on temporary database (with something like @y::name, maybe using a different character).

adamw


10,447 post(s)
#04-Aug-18 08:51

For completeness, I'll say that there is a variation of the first query that immediately came to mind when I saw it, but it also suffers from our current internal overhead:

--SQL9

DELETE FROM [dest] WHERE EXISTS

(

SELECT [i].[record_id] FROM [incoming_data] AS [i]

 WHERE [i].[record_id] = [dest].[record_id]

 AND [i].[updated_at] > [dest].[updated_at]

);

-- 10.073 sec

This works and does use indexes, but converting to a join makes things faster.

tjhb
10,094 post(s)
#03-Aug-18 00:23

(Sorry for extra [ in second ALTER statement above.)

tjhb
10,094 post(s)
#02-Aug-18 23:35

*I worked through my further thought about using a tuple for the join, and the second thought was wrong. Right the first time.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.