Subscribe to this thread
Home - General / All posts - max value in a row of columns in m9
tgazzard
133 post(s)
online
#08-May-18 07:58

hi,

I would like to do the following (which works in m8):

update d set max1 = max(a,b,c,d,e,f,g);

In m9 it returns

"Invalid object reference"

I have looked at the help but it isn't obvious as to whether this is possible.

Also is there a way to select the 2nd highest number (or 3rd highest etc) in the row using a single query.

The table is 63 million rows long and it will be most efficient to write only one query rather than multiple iif statements.

A similar query is set out in the following location

http://www.georeference.org/forum/t105766.7#105775

Thanks

tjhb
8,089 post(s)
#08-May-18 09:18

It is not possible that way in 9. You can use nested pairwise ValueMax() functions, each taking just two arguments—that is a little bit better than sequential CASE tests, but not much.

Another way might be a pivot into multiple rows (using CASE) followed by a COLLECT statement. That might be very worthwhile for the second objective, i.e. finding the second or third ranked item (etc.).

I could help with this tomorrow (it is late here).

tgazzard
133 post(s)
online
#08-May-18 09:53

Thanks.

I will have a go at the valuemax approach. If possible could you provide an example of the pivot approach?

tg

tjhb
8,089 post(s)
#08-May-18 09:56

Yes, assuming I can make it work! More tomorrow (unless I am beaten to it).

adamw


7,968 post(s)
#08-May-18 10:16

Maybe something like this:

--SQL9

FUNCTION highestN(a1 INT32, a2 INT32, a3 INT32, a4 INT32, a5 INT32, a6 INT32,

    n INT32INT32 AS (

  SELECT v FROM (VALUES (a1), (a2), (a3), (a4), (a5), (a6) AS (v))

  ORDER BY v DESC OFFSET n FETCH 1

END;

You can then use this in your query.

#

> ? highestN(100, 500, 800, 200, 300, 700, 2)

int32: 500

The only gotcha is that the offset is zero-based (so, if you want the biggest number, pass 0).

tjhb
8,089 post(s)
#08-May-18 10:25

Oooh that is very nice! Much more elegant than what I had in mind. One to keep.

If we really wanted to change to a one-based index we could use

--SQL9

ORDER BY v ASC OFFSET 7 - n FETCH 1

but that is less readable.

tjhb
8,089 post(s)
#08-May-18 11:09

Oops, OFFSET 6 - n (not 7 - n)

tgazzard
133 post(s)
online
#09-May-18 06:30

This is very neat.

I have this working now using tjhb variation and added in some additional values (13 and changed to float 64). This is very nifty with the ability to choose the number of interest at will.

I was wondering if there is a way to speed this up. Takes about 2 minutes (on a high spec'd machine) to run through about 500000 records..... For example the use of pragma or threads....

PRAGMA ('gpgpu'='aggressive', 'gpgpu.fp'='32');

update d set max1 =

HighestN

(vari_20131218,vari_20140204,vari_20140815,vari_20141221,

vari_20150106,vari_20151005,vari_20151208,vari_20151224,

vari_20160109,vari_20160430,vari_20170228,vari_20170620,vari_20180420,

1);

Threads 8 Batch 10000

Does having the pragma and threads bits make any difference here? Still haven't got my head around whether they are needed... Or rather when are they needed?

tjhb
8,089 post(s)
#09-May-18 06:49

More later, but the PRAGMA here makes no difference, and THREADS... BATCH... might be better with much smaller BATCH (I would start with 1 in this case, then try increasing powers of 2).

To speed things up further it may be better to inline the function (there is some minor overhead).

tjhb
8,089 post(s)
#09-May-18 07:19

I don’t know (at all) if an inline cross join would be faster than a function per record.

No idea!

It would be really great to try with some same data.

(Not at all a trivial question, important.)

tgazzard
133 post(s)
online
#10-May-18 23:29

Hi tjbh,

Some data to have a look at. I tried putting 100000 rows in the dataset, but can't post the file at 20mb in size.

The data comes from a series of landsat 8 images. We are creating some live fuel moisture content outputs (lfmc).

The dataset is halfway through the lfmc process.

Attachments:
vari_10000_rows.mxb

adamw


7,968 post(s)
#10-May-18 14:18

If there are concerns about performance, then it is best to rewrite the function to do a series of ValueMax calls if you are looking for the highest value, or call a script if you are looking for the N-th highest value. The slowdown likely comes from wrapping numeric values into an intermediate table with VALUES, then sorting it as a table - doing this 500,000 times does have a cost.

THREADS should help. Rewrite the query to UPDATE (SELECT ...) and put THREADS into the SELECT. Eg: UPDATE (SELECT mfd_id, max1, highestN(...) AS val FROM ... THREADS 8) SET max1=val. BATCH does not matter much, although 10,000 is perhaps too high (no need to buffer that much data, this just eats memory for no reason), use 256 or so.

tgazzard
133 post(s)
online
#10-May-18 23:22

Hi Adam,

Thanks for the tips. I tried rewriting the query, but it has similar performance outcomes.

SQL9

update (

select mfd_id, max1, HighestN

(vari_20131218,vari_20140204,vari_20140815,vari_20141221,vari_20150106,

vari_20151005,vari_20151208,vari_20151224,vari_20160109,vari_20160430,

vari_20170228,vari_20170620,vari_20180420,

1) as val from d threads 8 batch 256

)

 set max1 = val;

This is the sql8 version that gets the outcome I was after. It was ultimately faster then using the function approach. But obviously requires additional writing time. Everything is a trade-off.

It would probably be worth applying sloots approach below.... This is probably likely to the fastest approach in 9 (from my basic understanding). Although nothing beats the basic max(a,b,c,d,e,f,g) query in 8.

SQL8

update d set max1 = max(vari_20131218, vari_20140204, vari_20140815, vari_20141221, vari_20150106, vari_20151005, vari_20151208,

 vari_20151224, vari_20160109, vari_20160430, vari_20170228, vari_20170620, vari_20180420);

update d set max2 = iif(vari_20131218=max1,vari_20140204,vari_20131218);

update d set max2 = iif(vari_20140204 < max1 and vari_20140204 > max2,vari_20140204,max2);

update d set max2 = iif(vari_20140815 < max1 and vari_20140815 > max2,vari_20140815,max2);

update d set max2 = iif(vari_20141221 < max1 and vari_20141221 > max2,vari_20141221,max2);

update d set max2 = iif(vari_20150106 < max1 and vari_20150106 > max2,vari_20150106,max2);

update d set max2 = iif(vari_20151005 < max1 and vari_20151005 > max2,vari_20151005,max2);

update d set max2 = iif(vari_20151208 < max1 and vari_20151208 > max2,vari_20151208,max2);

update d set max2 = iif(vari_20151224 < max1 and vari_20151224 > max2,vari_20151224,max2);

update d set max2 = iif(vari_20160109 < max1 and vari_20160109 > max2,vari_20160109,max2);

update d set max2 = iif(vari_20160430 < max1 and vari_20160430 > max2,vari_20160430,max2);

update d set max2 = iif(vari_20170228 < max1 and vari_20170228 > max2,vari_20170228,max2);

update d set max2 = iif(vari_20170620 < max1 and vari_20170620 > max2,vari_20170620,max2);

update d set max2 = iif(vari_20180420 < max1 and vari_20180420 > max2,vari_20180420,max2);

update d set max3 = iif(vari_20131218=max2,vari_20131218,-2);

update d set max3 = iif(vari_20140204 < max2 and vari_20140204 > max3,vari_20140204,max3);

update d set max3 = iif(vari_20140815 < max2 and vari_20140815 > max3,vari_20140815,max3);

update d set max3 = iif(vari_20141221 < max2 and vari_20141221 > max3,vari_20141221,max3);

update d set max3 = iif(vari_20150106 < max2 and vari_20150106 > max3,vari_20150106,max3);

update d set max3 = iif(vari_20151005 < max2 and vari_20151005 > max3,vari_20151005,max3);

update d set max3 = iif(vari_20151208 < max2 and vari_20151208 > max3,vari_20151208,max3);

update d set max3 = iif(vari_20151224 < max2 and vari_20151224 > max3,vari_20151224,max3);

update d set max3 = iif(vari_20160109 < max2 and vari_20160109 > max3,vari_20160109,max3);

update d set max3 = iif(vari_20160430 < max2 and vari_20160430 > max3,vari_20160430,max3);

update d set max3 = iif(vari_20170228 < max2 and vari_20170228 > max3,vari_20170228,max3);

update d set max3 = iif(vari_20170620 < max2 and vari_20170620 > max3,vari_20170620,max3);

update d set max3 = iif(vari_20180420 < max2 and vari_20180420 > max3,vari_20180420,max3);

tjhb
8,089 post(s)
#11-May-18 04:02

That is laudable, but nuts!

In M8, why make ~27 updates when one would be enough?

There is a real need for actual data here. Abstract questions only go so far.

tgazzard
133 post(s)
online
#11-May-18 06:38

Actual data has been provided. See above.

Happy to be shown another way to get the second and third max value.

The writing it out is pretty quick using excel. Just need to write one query and reference the column names.

Runs significantly faster then function approach (like hours). Sure maybe nuts, but efficient

tjhb
8,089 post(s)
#11-May-18 07:19

No, it is not efficient!

But I will have a look at the map file you attached, to try something better.

Sorry I missed it above. Thanks.

adamw


7,968 post(s)
#11-May-18 17:07

Thanks.

When you say you tried rewriting the query, did you change the internals of HighestN from the SELECT ... ORDER to a series of ValueMax? The latter should perform much better - as confirmed by Sloots below.

We agree we should have a way of just writing ValueMax(..., ..., ..., ......) with all values in a single call - and perhaps even a built-in version of HighestN with all values and an explicit index to select. That's something for the wishlist.

Sloots

423 post(s)
#08-May-18 10:18

I have created a table with four columns (a, b, c, d, an mfd_id and a field called mx). This query updates the mx field with the maximum value of a, b, c and d.

-- $manifold$

update (

select

[mfd_id], [mx],

ValueMax(

ValueMax([a], [b]),

ValueMax([c], [d])

) AS new_mx

from

[Table]

) set [mx] = [new_mx]


http://www.mppng.nl/manifold/pointlabeler

tgazzard
133 post(s)
online
#09-May-18 06:34

Thanks Sloots. I can see how this would work.... Given the number of columns being evaluated I have opted to go with the function approach outlined above by adamw and tjhb

Sloots

423 post(s)
#10-May-18 19:59

I did some timing on a 500K table with four columns.

AdamW's (elegant SQLish) solution:

2018-05-10 20:44:56 -- Query: [Query 2] (79.105 sec)

2018-05-10 20:47:25 -- Query: [Query 2] (87.099 sec)

2018-05-10 20:49:25 -- Query: [Query 2] (83.481 sec)

My (less flexible) nested function solution:

2018-05-10 20:35:07 -- Query: [Query] (2.727 sec)

2018-05-10 20:36:03 -- Query: [Query] (2.579 sec)

2018-05-10 20:41:37 -- Query: [Query] (2.510 sec)

Approx. 33 times faster. I often work with over 10 million records. Expected time 25 minutes vs 50 seconds. I have made up my mind ;-)


http://www.mppng.nl/manifold/pointlabeler

Manifold User Community Use Agreement Copyright (C) 2007-2017 Manifold Software Limited. All rights reserved.