Hard to choose which post to reply under, will reply here. The indexes are all good, like Tim says. SELECT [i].[record_id] ... and SELECT [d].[record_id] ... both work because both [d] and [i] have all necessary indexes. The original DELETE ... WHERE [record_id] IN (<join>) is not wrong, the join does expose the index that the IN can use and IN ends up using that index, but there is overhead related to seeding the join for each record. We do not re-evaluate the join for each record, of course, but there are internal structures which need to be re-initialized and that ends up costing too much because there are many records. We will look into what we can do to improve this. What helps is splitting the task into two parts: compute which records have to be deleted and put the result somewhere static, then delete the records. Setting a boolean field and then deleting records based on that field would perhaps work, but here is a more straightforward solution: The setup, 50,000 records in dest, 10,000 in incoming_data, all "new": --SQL9 CREATE TABLE [dest] ( [record_id] NVARCHAR, [updated_at] FLOAT64, [value] NVARCHAR, INDEX [record_id_x] BTREE ([record_id]), INDEX [updated_at_x] BTREEDUPNULL ([updated_at]) ); INSERT INTO [dest] ([record_id], [updated_at], [value]) SELECT 'c' & CAST([value] AS NVARCHAR), 3, [value] FROM CALL ValueSequence(1, 50000, 1); CREATE TABLE [incoming_data] ( [record_id] NVARCHAR, [updated_at] FLOAT64, [value] NVARCHAR, INDEX [record_id_x] BTREE ([record_id]), INDEX [updated_at_x] BTREEDUPNULL ([updated_at]) ); INSERT INTO [incoming_data] ([record_id], [updated_at], [value]) SELECT 'c' & CAST([value] AS NVARCHAR), 4, [value] FROM CALL ValueSequence(20000, 29999, 1); First, deleting using the query in the first post: --SQL9 DELETE FROM [dest] WHERE [record_id] IN ( SELECT [i].[record_id] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] ON [i].[record_id] = [d].[record_id] WHERE [i].[updated_at] > [d].[updated_at] ); -- 5.591 sec Then restore original data: --SQL9 DELETE FROM [dest]; INSERT INTO [dest] ([record_id], [updated_at], [value]) SELECT 'c' & CAST([value] AS NVARCHAR), 3, [value] FROM CALL ValueSequence(1, 50000, 1); Now deleting using a modified method: put all deleted keys into a temporary table, then delete all keys found in that table, then drop the temporary table: --SQL9 SELECT [i].[record_id] INTO [temp] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] ON [i].[record_id] = [d].[record_id] WHERE [i].[updated_at] > [d].[updated_at]; DELETE FROM [dest] WHERE [record_id] IN [temp]; DROP TABLE [temp]; -- 0.945 sec The second query is a minor restructuring of the original query, but runs noticeably faster. Also, putting SELECT into a VALUE won't help for now, but we might add a function that will cache the table and then we could use VALUE and avoid creating temporary tables in the database. Like this: --SQL9 VALUE @delete TABLE = CALL Cache( -- missing the bolded call now SELECT [i].[record_id] FROM [incoming_data] As [i] INNER JOIN [dest] As [d] ON [i].[record_id] = [d].[record_id] WHERE [i].[updated_at] > [d].[updated_at] ); DELETE FROM [dest] WHERE [record_id] IN @delete; But first we'd like to check whether there is something we can do to reduce the amount of internal bookkeeping we have to do for the very first query. I will put this onto the list of things to do. Hope this helps.
|