Subscribe to this thread
Home - General / All posts - TRANSFORM ... PIVOT using M9
apo52 post(s)
#19-Sep-18 07:17

Loving the power of M9 with huge dataset I'm looking for a mean to replicate this very useful SQL statement TRANSFORM.

I know we can replicate this using CASE WHEN but it means we have to generate it manually for each item. My interest is more in finding a couple of functions in order to automatize this. Is there a way to generate SQL using SQL function? Or did anybody already gave birth to such a tool in M9.

I would appreciate any idea.

tjhb

8,280 post(s)
#19-Sep-18 07:29

Discussion here, with detailed explanation and teaching by Adam.

apo52 post(s)
#19-Sep-18 12:18

Thanks for the link to the discussion but it is exactly what I want to avoid. As mentioned by Art Lembo in this discussion it requires an a priori knowledge of the fields I want inject in the colums. This is the reason why I'm looking for way to inject the result of the selection of the categories in the pivot sql without having to type or copy/paste the list.

This will be useful for me extracting data from my silos with a list of categories.

tjhb

8,280 post(s)
#19-Sep-18 13:13

Well no, it requires a posteriori knowledge, and as Adam says, that can be discovered known by you--or by a script function. The trick is to make it as reusable as possible, with zero typing on each use.

apo52 post(s)
#19-Sep-18 16:29

I might be asleep but I can't find the a posteriori in the number of times I have to replicate the catx line! Could you explain me youra posteriori?

I agree on the flexibility of Adam's approach, especially for the differentiated aggregation schema by column. But it requires that I compose the query. In my case, the problem is more simple in that I just want to retrieve the value without aggregation but for a lot of columns.

I have a cube with the following dimensions

  • around 6 data vintages
  • around 10 geographical levels
  • around 140'000 geometries from the various levels
  • around 2'000 variables related to each geometry in each vintage.

A normal extraction will for example be to have a table with

  1. for a given vintage

  • geometries of one level as rows
  • vars as columns (between 50 and 250)For this reason in Mysql I was using the Vontobel's dynamic query for a crosstab approach and I will try to mimic that in M9. This procedure ask for SQL to be automatically generated and run. I'm sure we can do this using a script to generate the SQL code and run it, but can we do this in a function called from the SQL.

  • tjhb

    8,280 post(s)
    #20-Sep-18 03:32

    The a posteriori is in Adam's explanation.

    A TRANSFORM query must first inspect the data--using the equivalent of a SELECT DISTINCT. That result is the "evidence" it needs to build the PIVOT.

    We can do the same, but only by obtaining similar evidence in advance. The query structure is based on facts derived from the data (so, a posteriori).

    can we do this in a function called from the SQL

    Yes, after running a SELECT DISTINCT, and if we are prepared to do a lot of typing. (Lots of CASE and NULL expressions.)

    It's just much better to use a script.

    tjhb

    8,280 post(s)
    #20-Sep-18 03:56

    In 9, scripts can run SQL expressions and queries very economically.

    If they will be run more than once they can be pre-compiled.

    Just as queries can call script functions.

    Scripts and SQL are *both* first-class citizens in 9, and can call each other. (And yes, query within script within query... no problem.)

    This is a huge change. The situation in 8 was fundamentally different.

    Now the script-query choice is simply open.

    apo52 post(s)
    #20-Sep-18 07:18

    This highlights several ways I can follow now to imbricate the queries and the script. La suite au prochain numéro

    Thanks

    tjhb

    8,280 post(s)
    #20-Sep-18 08:17

    SVP de nous bien tenir au courant.

    Your data model sounds like an education (at least for me).

    We need more of this. More cubes!

    [And if you can be bothered, please always correct my French—or even my English.]

    apo52 post(s)
    #20-Sep-18 10:44

    It is with pleasure that I will post my latest insights.

    I use the cubes with the named dimensions to extract data in mainly three situations:

    • fixed vintage (year), geom as rows and vars as cols
    • fixed geom, vars as rows and vintages as cols
    • fixed var, geom as rows and vintages as cols

    The transform was a must for me juggling with the dimensions and I want to find that back in M9 mainly because of its speed. My test silo is made of 130 mio of records. To be honest I love using M9 even for non-spatial databases.

    I'm targeting a SQL function in which I have to give

    • the input table
    • the fixed, rows and cols dimensions
    • and maybe an operator

    Far be it from me to correct the English of anyone, as well as French, but my kids.

    apo52 post(s)
    #21-Sep-18 07:48

    My first step corresponds to a test of the type of query I want to use to extract the data. In steps 2, 3... I will focus on the automatic creation of the query.

    Basically two options at this stage between a CASE or a COLLECT approach. My cube table is as follow.

    --SQL

    CREATE TABLE [silo] (

      [mfd_id] INT64,

      [varid] INT32,

      [varname] NVARCHAR,

      [geomid] INT32,

      [levelid] INT32,

      [varvalue] INT32,

      [vintageid] INT32,

      INDEX [mfd_x] BTREE ([mfd_id]),

      INDEX [var_x] BTREEDUP ([varid]),

      INDEX [lvl_x] BTREEDUP ([levelid])

    );

    I started with the comparison based on a specific extraction of six variables

    using the CASE approach first

    --SQL

    SELECT [geomid],

        SUM(CASE WHEN [varid] = 1396 THEN [varvalue] ELSE 0 ENDAS PTOT,

        SUM(CASE WHEN [varid] = 1321 THEN [varvalue] ELSE 0 ENDAS PM,

        SUM(CASE WHEN [varid] = 1216 THEN [varvalue] ELSE 0 ENDAS PF,

        SUM(CASE WHEN [varid] = 1079 THEN [varvalue] ELSE 0 ENDAS P0014,

        SUM(CASE WHEN [varid] = 1103 THEN [varvalue] ELSE 0 ENDAS P1539,

        SUM(CASE WHEN [varid] = 1133 THEN [varvalue] ELSE 0 ENDAS P4064

    FROM silo

    WHERE [levelid] =4

    GROUP BY [geomid];

    This approach is efficient considering any level (from 1 geom to 120'000 geoms), but can be improved injecting a constraint in the WHERE on the [varid]. Here two options:

    1. a list of [varid] = x or [varid] = y...
    2. using a [varid] IN (...) clause which pleased me as it proves to be more efficient (which is not the case in all the database system). Good job Manifold

    You might have notice that I didn't inject any index on the [geomid] field. It was the case initially but I found that it slows everything down to hell for a reason I can't identify. The number of implied geoms doesn't change anything to this fact. If you have any information on this it will be helpfull because this is annoying. OK in the case of this query I can kill this index but the idea of the cube is to rotate it and if I have to remove/create an index according to the needs I makes no sense (time speaking). My questions:

    • why does it slows the query?
    • is there a mean to force the indexes we want to take profit from?

    Finally my query looks like this having an INTO to keep the results

    --SQL

    SELECT [geomid],

        SUM(CASE WHEN [varid] = 1396 THEN [varvalue] ELSE 0 ENDAS PTOT,

        SUM(CASE WHEN [varid] = 1321 THEN [varvalue] ELSE 0 ENDAS PM,

        SUM(CASE WHEN [varid] = 1216 THEN [varvalue] ELSE 0 ENDAS PF,

        SUM(CASE WHEN [varid] = 1079 THEN [varvalue] ELSE 0 ENDAS P0014,

        SUM(CASE WHEN [varid] = 1103 THEN [varvalue] ELSE 0 ENDAS P1539,

        SUM(CASE WHEN [varid] = 1133 THEN [varvalue] ELSE 0 ENDAS P4064

    INTO test

    FROM silo

    WHERE [levelid] =6

    AND [varid] IN (1396,1321,1216,1079,1103,1133)

    GROUP BY [geomid];

    apo52 post(s)
    #21-Sep-18 08:26

    I have to add a nuance to what I said on the index issue. The problem is always here when the !fullfetch is on. OK that might be linked to the fullfetch to be avoid with large queries.

    But It is also present with the INTO. The strange thing is that the process window remains open until I close it with a cancel but the records are for long time already written in the table. I can cancel 10 minutes or 10 seconds after run, the data are here!?! The problem is solved removing the index on [geomid]

    adamw


    8,139 post(s)
    #24-Sep-18 14:36

    You might have notice that I didn't inject any index on the [geomid] field. It was the case initially but I found that it slows everything down to hell for a reason I can't identify.

    We'll check what is going on here, this seems wrong.

    You cannot force the indexes to use, but you can have your query copy all data into a temporary table with just the indexes that you want - that way the indexes that slow things down won't be there. This is a nuclear option, but it is there as a last resort.

    apo52 post(s)
    #24-Sep-18 14:39

    Yes I could, but this is a bit slower compare to killing the index and reinjecting it afterward, which try to substitute the switch off

    apo52 post(s)
    #26-Sep-18 19:31

    Don't have all the answers to my questions now especially on the index issues but it doesn't matter for my next step, namely having a script to generate the SQL and run it.

    The following code works well based on my experience, but might surely be improved in any way. At this stage it retrieves the variable names based on the ids using a first query and generate the second query (CASE WHEN...) based on the first result to finally generate the pivot table.

    // C#

    class Script

    {

    static Manifold.Context Manifold;

    static void Main()

    {

        Manifold.Application app = Manifold.Application;

        using (Manifold.Database db = app.GetDatabaseRoot())

        {

            string varidlist = "1396,1321,1216,1079,1103,1133";

            string text = "SELECT * FROM varlist WHERE varid IN ("+varidlist+")";

            string levelid = "2";

            using (Manifold.Table table = db.Run(text))

            {

                using (Manifold.Sequence sequence = table.SearchAll(new string[] { "varid""varname" }))

                {

                    if (sequence != null)

                    {

                        string newtext = "SELECT [geomid] ";

                        while (sequence.Fetch())

                        {

                            Manifold.ValueSet values = sequence.GetValues();

                            newtext = newtext +", SUM(CASE WHEN [varid] = "+values[0].Data+" THEN [varvalue] ELSE 0 END) AS ["+values[1].Data+"] ";

                        }

                        newtext = newtext +" INTO test ";

                        newtext = newtext +" FROM silo ";

                        newtext = newtext +" WHERE [levelid] ="+levelid+" ";

                        newtext = newtext +" AND [varid] IN ("+varidlist+") ";

                        newtext = newtext +" GROUP BY [geomid]; ";

                        app.Log(newtext);

                        using (Manifold.Table newtable = db.Run(newtext))

                        {

                        }

                    }

                }

            }

        }

    }

    }

    My next step is to embed this script in a sql function call and call it from a query passing mainly three parameters at this stage being the silo table name, the level id and the list of var ids to be pivoted. Any idea where I can find a example of query -> function -> script call?

    tjhb

    8,280 post(s)
    #26-Sep-18 19:49

    There are exhaustive and up-to-date examples in the release notes for 9.0.163.10.

    There are also examples in the API docs [at least I think so; for now I can't find them].

    apo52 post(s)
    #26-Sep-18 20:12

    great exactly what I was looking for. My next step is on its way. Thanks

    apo52 post(s)
    #26-Sep-18 21:38

    Few changes in the script to publish the function and put a return at the right place, and the third step is done.

    The script is contained in a script component named PivotScript, its code

    // C#

    class Script

    {

    public static int Pivot(string silo, string lvl, string vars)

    {

     Manifold.Application app = Manifold.Application;

        using (Manifold.Database db = app.GetDatabaseRoot())

        {

            string varidlist = vars.ToString();

            string text = "SELECT * FROM varlist WHERE varid IN ("+varidlist+")";

            string levelid = lvl.ToString();

            using (Manifold.Table table = db.Run(text))

            {

                using (Manifold.Sequence sequence = table.SearchAll(new string[] { "varid""varname" }))

                {

                    if (sequence != null)

                    {

                        string newtext = "SELECT [geomid] ";

                        while (sequence.Fetch())

                        {

                            Manifold.ValueSet values = sequence.GetValues();

                            newtext = newtext +", SUM(CASE WHEN [varid] = "+values[0].Data+" THEN [varvalue] ELSE 0 END) AS ["+values[1].Data+"] ";

                        }

                        newtext = newtext +" INTO test ";

                        newtext = newtext +" FROM "+silo.ToString()+" ";

                        newtext = newtext +" WHERE [levelid] ="+levelid+" ";

                        newtext = newtext +" AND [varid] IN ("+varidlist+") ";

                        newtext = newtext +" GROUP BY [geomid]; ";

                        using (Manifold.Table newtable = db.Run(newtext))

                        {

     

                        }

                    }

                }

            }

    return 0;

        }

    }

    static Manifold.Context Manifold;

    static void Main()

    {

       //Pivot("silo", "2", "1396,1321,1216,1079,1103,1133");

    }

    }

    The query to pivot a table is rather simple

    FUNCTION pivottable(@silo NVARCHAR, @levelid NVARCHAR, @vars NVARCHAR) INT32 AS SCRIPT [PivotScript] ENTRY 'Script.Pivot';

    SELECT pivottable('silo', '2', '1396,1321,1216,1079,1103,1133') FROM [silo];

    The next step is now to handle the rotations of the cube between the dimensions of it. La suite au prochain numéro.

    adamw


    8,139 post(s)
    #27-Sep-18 09:17

    You should perhaps change the last query to:

    --SQL9

    FUNCTION pivottable(...) ...;

    VALUE @p INT32 = pivottable(...);

    You want to run the function and you want to run it just once, correct? So, do that. :-)

    SELECT ... FROM [silo] may run the function as many times as there are records in [silo] and, more importantly, it only runs it as a side effect of the result table of SELECT ... being shown in the command window. If you add one more statement after SELECT ... FROM [silo];, the SELECT, as written, won't make any calls to the function at all.

    VALUE forces the right side to run.

    If the function was returning a table, forcing it to run would be even simpler:

    --SQL9

    FUNCTION pivottable(...) TABLE ...;

    TABLE CALL pivottable(...);

    Or you could continue to use VALUE.

    apo52 post(s)
    #27-Sep-18 09:37

    Marvelous and thanks for the insights. It is exactly what I was facing with a multiple runs, boooh.

    The VALUE approach is perfect in my case as I don't need anything back from the function, the pivot table being created directly by the script. But in case... As said I'm now upgrading the script to handle the rotation of the cube, as the possiblity to select the aggregation mode...

    apo52 post(s)
    #21-Sep-18 07:58

    The second approach is replication of the Adam's COLLECT method mentioned above by tjhb taking profit from the insights of the CASE method.

    FUNCTION countcat(@t TABLE) INT32 AS (

     SELECT Sum([varvalue]) FROM @t

    END;

    SELECT [geomid],

      countcat((COLLECT [varvalue] WHERE [varid] = 1396)) AS PTOT,

      countcat((COLLECT [varvalue] WHERE [varid] = 1321)) AS PM,

      countcat((COLLECT [varvalue] WHERE [varid] = 1216)) AS PF,

      countcat((COLLECT [varvalue] WHERE [varid] = 1079)) AS P0014,

      countcat((COLLECT [varvalue] WHERE [varid] = 1103)) AS P1539,

      countcat((COLLECT [varvalue] WHERE [varid] = 1133)) AS P4064

    FROM [silo]

    WHERE [levelid] = 4

    AND [varid] IN (1396,1321,1216,1079,1103,1133)

    GROUP BY [geomid]

    ;

    Using this approach the query is slightly slower but nothing dramatic. The issue is somewhere else.

    This works perfectly for the first levels with a restricted number of geoms (1-150) but once I want to extract the levels with 4000 or 120000 geoms I get a Invalid object reference in the face. My question at this stage, is there a limit of size of data to be passed to a function?

    adamw


    8,139 post(s)
    #24-Sep-18 14:43

    There are limits, yes, not onto the amount of data passed to a function, but rather on the amount of data passed to an aggregate. But they are fairly large - 2 billion records per table at the very least (and frequently no such limit even, this depends on the aggregate) - so it is unlikely that you are hitting them. Something else is going on.

    Could you upload data somewhere (compress it into MXB, this will save space) and send a link to tech support? Or ask them where to upload data and they will direct you.

    apo52 post(s)
    #24-Sep-18 16:19

    Uploaded and msg send to tech.

    tjhb

    8,280 post(s)
    #20-Sep-18 04:21

    By the way in Adam's example, where he used repeated (COLLECT...), we can now use multiple references to VALUE @t TABLE = SELECT... which is neater (and probably faster since cached I think).

    tjhb

    8,280 post(s)
    #20-Sep-18 04:36

    Strike that, I was wrong. COLLECT... GROUP is still the correct form.

    Manifold User Community Use Agreement Copyright (C) 2007-2017 Manifold Software Limited. All rights reserved.