Subscribe to this thread
Home - General / All posts - TRANSFORM ... PIVOT using M9
apo53 post(s)
#19-Sep-18 07:17

Loving the power of M9 with huge dataset I'm looking for a mean to replicate this very useful SQL statement TRANSFORM.

I know we can replicate this using CASE WHEN but it means we have to generate it manually for each item. My interest is more in finding a couple of functions in order to automatize this. Is there a way to generate SQL using SQL function? Or did anybody already gave birth to such a tool in M9.

I would appreciate any idea.

tjhb

8,399 post(s)
#19-Sep-18 07:29

Discussion here, with detailed explanation and teaching by Adam.

apo53 post(s)
#19-Sep-18 12:18

Thanks for the link to the discussion but it is exactly what I want to avoid. As mentioned by Art Lembo in this discussion it requires an a priori knowledge of the fields I want inject in the colums. This is the reason why I'm looking for way to inject the result of the selection of the categories in the pivot sql without having to type or copy/paste the list.

This will be useful for me extracting data from my silos with a list of categories.

tjhb

8,399 post(s)
#19-Sep-18 13:13

Well no, it requires a posteriori knowledge, and as Adam says, that can be discovered known by you--or by a script function. The trick is to make it as reusable as possible, with zero typing on each use.

apo53 post(s)
#19-Sep-18 16:29

I might be asleep but I can't find the a posteriori in the number of times I have to replicate the catx line! Could you explain me youra posteriori?

I agree on the flexibility of Adam's approach, especially for the differentiated aggregation schema by column. But it requires that I compose the query. In my case, the problem is more simple in that I just want to retrieve the value without aggregation but for a lot of columns.

I have a cube with the following dimensions

  • around 6 data vintages
  • around 10 geographical levels
  • around 140'000 geometries from the various levels
  • around 2'000 variables related to each geometry in each vintage.

A normal extraction will for example be to have a table with

  1. for a given vintage

  • geometries of one level as rows
  • vars as columns (between 50 and 250)For this reason in Mysql I was using the Vontobel's dynamic query for a crosstab approach and I will try to mimic that in M9. This procedure ask for SQL to be automatically generated and run. I'm sure we can do this using a script to generate the SQL code and run it, but can we do this in a function called from the SQL.

  • tjhb

    8,399 post(s)
    #20-Sep-18 03:32

    The a posteriori is in Adam's explanation.

    A TRANSFORM query must first inspect the data--using the equivalent of a SELECT DISTINCT. That result is the "evidence" it needs to build the PIVOT.

    We can do the same, but only by obtaining similar evidence in advance. The query structure is based on facts derived from the data (so, a posteriori).

    can we do this in a function called from the SQL

    Yes, after running a SELECT DISTINCT, and if we are prepared to do a lot of typing. (Lots of CASE and NULL expressions.)

    It's just much better to use a script.

    tjhb

    8,399 post(s)
    #20-Sep-18 03:56

    In 9, scripts can run SQL expressions and queries very economically.

    If they will be run more than once they can be pre-compiled.

    Just as queries can call script functions.

    Scripts and SQL are *both* first-class citizens in 9, and can call each other. (And yes, query within script within query... no problem.)

    This is a huge change. The situation in 8 was fundamentally different.

    Now the script-query choice is simply open.

    apo53 post(s)
    #20-Sep-18 07:18

    This highlights several ways I can follow now to imbricate the queries and the script. La suite au prochain numéro

    Thanks

    tjhb

    8,399 post(s)
    #20-Sep-18 08:17

    SVP de nous bien tenir au courant.

    Your data model sounds like an education (at least for me).

    We need more of this. More cubes!

    [And if you can be bothered, please always correct my French—or even my English.]

    apo53 post(s)
    #20-Sep-18 10:44

    It is with pleasure that I will post my latest insights.

    I use the cubes with the named dimensions to extract data in mainly three situations:

    • fixed vintage (year), geom as rows and vars as cols
    • fixed geom, vars as rows and vintages as cols
    • fixed var, geom as rows and vintages as cols

    The transform was a must for me juggling with the dimensions and I want to find that back in M9 mainly because of its speed. My test silo is made of 130 mio of records. To be honest I love using M9 even for non-spatial databases.

    I'm targeting a SQL function in which I have to give

    • the input table
    • the fixed, rows and cols dimensions
    • and maybe an operator

    Far be it from me to correct the English of anyone, as well as French, but my kids.

    apo53 post(s)
    #21-Sep-18 07:48

    My first step corresponds to a test of the type of query I want to use to extract the data. In steps 2, 3... I will focus on the automatic creation of the query.

    Basically two options at this stage between a CASE or a COLLECT approach. My cube table is as follow.

    --SQL

    CREATE TABLE [silo] (

      [mfd_id] INT64,

      [varid] INT32,

      [varname] NVARCHAR,

      [geomid] INT32,

      [levelid] INT32,

      [varvalue] INT32,

      [vintageid] INT32,

      INDEX [mfd_x] BTREE ([mfd_id]),

      INDEX [var_x] BTREEDUP ([varid]),

      INDEX [lvl_x] BTREEDUP ([levelid])

    );

    I started with the comparison based on a specific extraction of six variables

    using the CASE approach first

    --SQL

    SELECT [geomid],

        SUM(CASE WHEN [varid] = 1396 THEN [varvalue] ELSE 0 ENDAS PTOT,

        SUM(CASE WHEN [varid] = 1321 THEN [varvalue] ELSE 0 ENDAS PM,

        SUM(CASE WHEN [varid] = 1216 THEN [varvalue] ELSE 0 ENDAS PF,

        SUM(CASE WHEN [varid] = 1079 THEN [varvalue] ELSE 0 ENDAS P0014,

        SUM(CASE WHEN [varid] = 1103 THEN [varvalue] ELSE 0 ENDAS P1539,

        SUM(CASE WHEN [varid] = 1133 THEN [varvalue] ELSE 0 ENDAS P4064

    FROM silo

    WHERE [levelid] =4

    GROUP BY [geomid];

    This approach is efficient considering any level (from 1 geom to 120'000 geoms), but can be improved injecting a constraint in the WHERE on the [varid]. Here two options:

    1. a list of [varid] = x or [varid] = y...
    2. using a [varid] IN (...) clause which pleased me as it proves to be more efficient (which is not the case in all the database system). Good job Manifold

    You might have notice that I didn't inject any index on the [geomid] field. It was the case initially but I found that it slows everything down to hell for a reason I can't identify. The number of implied geoms doesn't change anything to this fact. If you have any information on this it will be helpfull because this is annoying. OK in the case of this query I can kill this index but the idea of the cube is to rotate it and if I have to remove/create an index according to the needs I makes no sense (time speaking). My questions:

    • why does it slows the query?
    • is there a mean to force the indexes we want to take profit from?

    Finally my query looks like this having an INTO to keep the results

    --SQL

    SELECT [geomid],

        SUM(CASE WHEN [varid] = 1396 THEN [varvalue] ELSE 0 ENDAS PTOT,

        SUM(CASE WHEN [varid] = 1321 THEN [varvalue] ELSE 0 ENDAS PM,

        SUM(CASE WHEN [varid] = 1216 THEN [varvalue] ELSE 0 ENDAS PF,

        SUM(CASE WHEN [varid] = 1079 THEN [varvalue] ELSE 0 ENDAS P0014,

        SUM(CASE WHEN [varid] = 1103 THEN [varvalue] ELSE 0 ENDAS P1539,

        SUM(CASE WHEN [varid] = 1133 THEN [varvalue] ELSE 0 ENDAS P4064

    INTO test

    FROM silo

    WHERE [levelid] =6

    AND [varid] IN (1396,1321,1216,1079,1103,1133)

    GROUP BY [geomid];

    apo53 post(s)
    #21-Sep-18 08:26

    I have to add a nuance to what I said on the index issue. The problem is always here when the !fullfetch is on. OK that might be linked to the fullfetch to be avoid with large queries.

    But It is also present with the INTO. The strange thing is that the process window remains open until I close it with a cancel but the records are for long time already written in the table. I can cancel 10 minutes or 10 seconds after run, the data are here!?! The problem is solved removing the index on [geomid]

    adamw


    8,242 post(s)
    #24-Sep-18 14:36

    You might have notice that I didn't inject any index on the [geomid] field. It was the case initially but I found that it slows everything down to hell for a reason I can't identify.

    We'll check what is going on here, this seems wrong.

    You cannot force the indexes to use, but you can have your query copy all data into a temporary table with just the indexes that you want - that way the indexes that slow things down won't be there. This is a nuclear option, but it is there as a last resort.

    apo53 post(s)
    #24-Sep-18 14:39

    Yes I could, but this is a bit slower compare to killing the index and reinjecting it afterward, which try to substitute the switch off

    adamw


    8,242 post(s)
    #30-Oct-18 09:23

    Following up.

    We investigated and found why adding an index on the grouping field above harms performance. This is the case of the query engine currently being unable to select an optimal strategy out of multiple competing strategies.

    When we run SELECT ... WHERE ... GROUP ... on a table where both WHERE and GROUP can be optimized but using different indexes, the query engine has to choose whether to (a) do WHERE using the index it hits, reading only a subset of records from the table but losing the ordering from the second index, then do GROUP on the resulting records re-creating the ordering (the second index is not used), or (b) do GROUP using the ordering from the second index, and filter records using WHERE when reading them (the first index is not used). That's a simplified picture, but it is close enough.

    Which strategy is going to be faster depends on how selective WHERE is. If it throws away a lot of the records, (a) wins. If it does not throw away enough records, (b) wins. Without statistics, we cannot know whether WHERE is selective enough, so we currently end up being conservative and select (b), because that's safest.

    A simple example.

    Create a new MAP file. Open a command window. Run the following:

    --SQL9

    CREATE TABLE t (

      v INT32INDEX v_x BTREEDUP (v),

      g INT32INDEX g_x BTREEDUP (g)

    );

    INSERT INTO t (v, g)

    SELECT [value] MOD 10, [value] MOD 100 FROM CALL ValueSequence(1, 100000, 1);

    This creates a test table with 100,000 records, 2 fields, a BTREEDUP index on each field.

    Run these queries one by one and observe processing times in the log (I put processing times from my tests into the text):

    --SQL9

     

    SELECT g, Sum(v) INTO u1 FROM t WHERE v = 1

    GROUP BY g; -- 0.263

     

    SELECT g, Sum(v) INTO u2 FROM t WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)

    GROUP BY g; -- 0.354

     

    SELECT Sum(v) INTO u3 FROM t WHERE v = 1

    GROUP BY g+1; -- 0.104

     

    SELECT Sum(v) INTO u4 FROM t WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)

    GROUP BY g+1; -- 0.930

    All queries are SELECT ... INTO to force reading the whole table.

    The first query allows using indexes in both WHERE and GROUP, the query engine chooses to use the index in GROUP, WHERE selects 10% of all records and the query completes in 0.263 sec. It would make more sense to use the index in WHERE because that would reduce the number of processed records so much, but the query engine does not know that WHERE is this good and goes the safest route.

    The second query also allows using indexes in both WHERE and GROUP, the query engine again chooses to use the index in GROUP, WHERE selects 80% of all records and the query completes in 0.354 sec. You can see that while the number of records passed by WHERE increased a lot, the processing time didn't grow much. Using the index in GROUP is the right strategy here.

    The third query groups by an expression and so disallows using an index in GROUP. The query engine now turns to use the index in WHERE, WHERE selects 10% of all records (great) and the query completes in 0.104 sec. This is noticeably faster than the first query. Using the index in WHERE is the right strategy for this query and it pays off.

    The fourth query also groups by an expression and disallows using an index in GROUP. However, WHERE is not selective enough and keeps most of the records (80%), so not being able to use an index in GROUP costs significant time. The query completes in 0.930 sec, way slower than query 2. Using the index in WHERE is the wrong strategy here.

    In sum, adding an index can sometimes make a query slower. We aren't alone in this, it happens with other databases as well. If you care about performance, then after you make any changes to a table, you are supposed to re-test all queries that end up using that table to both see whether they still work and also to see how they now perform, that's not news. However, we agree we can help here. In general, we want to do two things. First, let the query engine make better use of field statistics so that it can make better judgement when making choices. Second, allow seeing what the query engine decided to do. Finally, we already have ample means to influence the decisions of the query engine with temporary tables and temporary databases, but we might add a couple more constructs / functions for that as well.

    tjhb

    8,399 post(s)
    #30-Oct-18 12:47

    This is great reading! So interesting.

    Worth adding perhaps: sometimes it could be worth separating the query into two nodes, either GROUP in an outer node over WHERE in a subquery, or occasionally the reverse, so that both indexes can be used (fully under user control).

    This might be a good strategy for apo's query here, with an index added to [geom_id]?

    (I don't know if it would be a good strategy in Adam's example case because the data is small. The cost of the extra node might outweigh the benefit from using both indexes.)

    NO I am wrong, for the reason in Adam's second psragraph. Using WHERE removes the ordering that the index on the grouping column(s) relies upon, so the GROUP would be unindexed (or would have to create a new index). Adding a node does not prevent this AFAIK.

    tjhb

    8,399 post(s)
    #30-Oct-18 13:06

    I will have to make some tests tomorrow, on cases where a table having a composite index on the grouping fields is first filtered (in a subquery) by WHERE. I had thought that the composite index did survive the filter, my testing might have been wrong.

    Or is it possible that a BTREEDUP (or BTREEDUPNULL) will survive, while a BTREE would not?

    adamw


    8,242 post(s)
    #30-Oct-18 14:01

    If you have a table with two fields and a separate BTREExxx index on each field (does not matter whether duplicates are allowed or not, same for NULLs), filtering using one of the indexes returns records unordered with respect to the second field. There is no getting around that.

    If you mean a situation where one index is a superset of the other, that is, there are fields A and B, the first index is on A, and the second is on A and then B (the order is important), then, yes, filtering by the second index returns records already ordered by A and if we need records to be in the order of the first index, no further ordering is needed. This sometimes does happen, but usually it is even simpler - you have a single index on A and then B and in some parts of your query you use partial ordering (just by A) and in other parts full ordering (by A and then B). We should do the right thing if a situation like this happens with WHERE and GROUP like in the scenario above.

    tjhb

    8,399 post(s)
    #30-Oct-18 23:54

    Sorry I wasn't clear. Let me go back to

    I will have to make some tests tomorrow, on cases where a table having a composite index on the grouping fields is first filtered (in a subquery) by WHERE. I had thought that the composite index did survive the filter, my testing might have been wrong.

    and apply that to your examples. For your first query...

    --SQL9

    SELECT g, Sum(v) INTO u1 FROM t WHERE v = 1

    GROUP BY g; -- 0.263

    I get a typical time of 0.156s, with the same table and indexes.

    Putting the WHERE filter into a separate node...

    --SQL9

    SELECT g, Sum(v) INTO x1 

    FROM 

        (

        SELECT * FROM t

        WHERE v = 1

        )

    GROUP BY g;

    I get very close to the same time (typically 0.157s). I don't quite understand why.

    I would have expected the inner query to execute independently, with the index on v used for its WHERE filter, and (from the discussion above) for that to prevent the use of the index on g for GROUP in the outer node.

    The similar timings suggest (or I infer, perhaps wrongly) that the index on v is not used here; and that the index on g does survive and is still used.

    For your second query I typically get 0.219s. Again, I get the same 0.219s if it is rewritten with a wrap:

    SELECT g, Sum(v) INTO x2

    FROM 

        (

        SELECT * FROM t

        WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)

        )

    GROUP BY g;

    Does the choice of index "look through" node boundaries?

    tjhb

    8,399 post(s)
    #31-Oct-18 00:39

    I get the same respective timings (0.156 and 0.219s) rewriting further as

    VALUE @u TABLE =

        (

        SELECT * FROM t WHERE v = 1

        );

    SELECT g, Sum(v) INTO x1 

    FROM @u

    GROUP BY g;

    and

    VALUE @u TABLE =

        (

        SELECT * FROM t

        WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)

        );

    SELECT g, Sum(v) INTO x2

    FROM @u

    GROUP BY g;

    So here again it looks as if the index on g is used (for GROUP), but the index on v (for WHERE) is not. At first sight this seems more strange.

    adamw


    8,242 post(s)
    #01-Nov-18 08:12

    What you are seeing is that the query nodes are smart enough to work together to apply global optimizations. This is actually desired. :-)

    We want the user to be able to nest SELECTs logically without losing the ability to apply optimizations as much as possible. Some constructs necessarily kill some optimizations (eg, SELECT DISTINCT kills all optimizations based on indexes on non-DISTINCT fields), but in general we want to keep as many optimization options as possible for as long as possible, because then the query engine can make a more informed decision on what to do.

    There should be clear boundaries at which optimizations do stop though. Right now, one such boundary is tables in temporary databases - you can put an intermediate result there and then you have full control over what indexes it has, and can add or remove indexes freely, according to what you need in further steps of the query. We will likely provide other, terser ways to establish such boundaries as well (eg, something like TableCache(<table>) returning another table which would materialize the table, if you do GROUP on CALL TableCacle(SELECT ... WHERE ...), then WHERE will do its own optimization inside and GROUP will do whatever it will be able to do with the materialized result outside).

    tjhb

    8,399 post(s)
    #02-Nov-18 01:11

    What you are seeing is that the query nodes are smart enough to work together to apply global optimizations. This is actually desired. :-)

    We want the user to be able to nest SELECTs logically without losing the ability to apply optimizations as much as possible.

    Two more tests.

    (1) Do the same optimizations survive if the subquery is moved into a separate query in the same datasource?

    Create separate queries...

    CREATE QUERY q1

        (PROPERTY 'Text' 'SELECT g, v FROM t WHERE v = 1;')

        ;

    CREATE QUERY q2 -- in root

        (PROPERTY 'Text' 'SELECT g, v FROM t WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8);')

        ;

    ...then call them:

    SELECT g, Sum(v) INTO u1 FROM q1

    GROUP BY g;

     

    SELECT g, Sum(v) INTO u2 FROM q2

    GROUP BY g;

    Answer: yes (judging by SELECT timings ±identical with previous tests).

    (2) How about if we move the separate queries into a temporary (Manifold) datasource?

    Call the same q1 and q2 as above (in the project root) from a temporary peer datasource:

    --SQL9

    CREATE ROOT [#temp];

    USE ROOT [#temp];

    CREATE DATASOURCE [#root] AS ROOT;

     

    SELECT g, Sum(v) INTO u1 -- in #temp

    FROM [#root]::q1

    GROUP BY g;

     

    SELECT g, Sum(v) INTO u2 -- in #temp

    FROM [#root]::q2

    GROUP BY g;

    Answer: yes again (same timings for the SELECT queries).

    In other words, in these cases too, the engine currently chooses to use the index on g (for GROUP) in favour of the index on v (for WHERE), optimizing across the division into separate queries (in 1) and even into separate Manifold datasources (in 2).

    (And when the optimization strategy becomes more discerning, we might see the new optimizations persist across such boundaries as well? The capacity is there.)

    This is pretty remarkable! One execution graph to rule them all.

    tjhb

    8,399 post(s)
    #02-Nov-18 02:07

    Correction to 3rd to last paragraph above:

    In other words, in these cases too, the engine currently chooses to use the index on g (for GROUP) in favour of in preference to the index on v (for WHERE), optimizing across the division into separate queries (in 1) and even into separate Manifold datasources (in 2).

    apo53 post(s)
    #01-Nov-18 20:26

    Many thanks for this very interesting explanation of the issue I was facing.

    Just to be sure. In your example with the A and B index. Which letter is in the WHERE and in the GROUP BY?

    adamw


    8,242 post(s)
    #02-Nov-18 06:33

    Do you mean this example:

    If you mean a situation where one index is a superset of the other, that is, there are fields A and B, the first index is on A, and the second is on A and then B (the order is important), then, yes, filtering by the second index returns records already ordered by A and if we need records to be in the order of the first index, no further ordering is needed.

    ?

    Then with respect to WHERE and GROUP we are talking about roughly this:

    --SQL9

    CREATE TABLE t (id INT32, NVARCHAR, b NVARCHARINDEX a_b_x BTREE (a, b));

    INSERT INTO t (id, a, b) VALUES

      (1, 'eu''uk'),

      (2, 'eu''france'),

      (3, 'am''usa');

     

    SELECT a, count(b) FROM t GROUP BY a;

    -- uses a_b_x in GROUP

     

    SELECT a, count(b) FROM t WHERE a = 'am' GROUP BY a;

    -- uses a_b_x in GROUP and in WHERE

    There is only one index on A and B, but we could have a second one on just A, it wouldn't change anything because the system automatically assumes that an index on A and B can be used as an index just on A in most cases.

    The first SELECT groups by just A and the system is going to use the index because it understands that the ordering by A and then B is ordering by A as well.

    The second SELECT uses the same logic for GROUP and since ordering requested by GROUP is friendly to the criteria in WHERE, WHERE optimizes its filtering and skips most of records without ever analyzing them.

    There are situations other than GROUP and WHERE where optimizations with indexes combine in this way.

    apo53 post(s)
    #26-Sep-18 19:31

    Don't have all the answers to my questions now especially on the index issues but it doesn't matter for my next step, namely having a script to generate the SQL and run it.

    The following code works well based on my experience, but might surely be improved in any way. At this stage it retrieves the variable names based on the ids using a first query and generate the second query (CASE WHEN...) based on the first result to finally generate the pivot table.

    // C#

    class Script

    {

    static Manifold.Context Manifold;

    static void Main()

    {

        Manifold.Application app = Manifold.Application;

        using (Manifold.Database db = app.GetDatabaseRoot())

        {

            string varidlist = "1396,1321,1216,1079,1103,1133";

            string text = "SELECT * FROM varlist WHERE varid IN ("+varidlist+")";

            string levelid = "2";

            using (Manifold.Table table = db.Run(text))

            {

                using (Manifold.Sequence sequence = table.SearchAll(new string[] { "varid""varname" }))

                {

                    if (sequence != null)

                    {

                        string newtext = "SELECT [geomid] ";

                        while (sequence.Fetch())

                        {

                            Manifold.ValueSet values = sequence.GetValues();

                            newtext = newtext +", SUM(CASE WHEN [varid] = "+values[0].Data+" THEN [varvalue] ELSE 0 END) AS ["+values[1].Data+"] ";

                        }

                        newtext = newtext +" INTO test ";

                        newtext = newtext +" FROM silo ";

                        newtext = newtext +" WHERE [levelid] ="+levelid+" ";

                        newtext = newtext +" AND [varid] IN ("+varidlist+") ";

                        newtext = newtext +" GROUP BY [geomid]; ";

                        app.Log(newtext);

                        using (Manifold.Table newtable = db.Run(newtext))

                        {

                        }

                    }

                }

            }

        }

    }

    }

    My next step is to embed this script in a sql function call and call it from a query passing mainly three parameters at this stage being the silo table name, the level id and the list of var ids to be pivoted. Any idea where I can find a example of query -> function -> script call?

    tjhb

    8,399 post(s)
    #26-Sep-18 19:49

    There are exhaustive and up-to-date examples in the release notes for 9.0.163.10.

    There are also examples in the API docs [at least I think so; for now I can't find them].

    apo53 post(s)
    #26-Sep-18 20:12

    great exactly what I was looking for. My next step is on its way. Thanks

    apo53 post(s)
    #26-Sep-18 21:38

    Few changes in the script to publish the function and put a return at the right place, and the third step is done.

    The script is contained in a script component named PivotScript, its code

    // C#

    class Script

    {

    public static int Pivot(string silo, string lvl, string vars)

    {

     Manifold.Application app = Manifold.Application;

        using (Manifold.Database db = app.GetDatabaseRoot())

        {

            string varidlist = vars.ToString();

            string text = "SELECT * FROM varlist WHERE varid IN ("+varidlist+")";

            string levelid = lvl.ToString();

            using (Manifold.Table table = db.Run(text))

            {

                using (Manifold.Sequence sequence = table.SearchAll(new string[] { "varid""varname" }))

                {

                    if (sequence != null)

                    {

                        string newtext = "SELECT [geomid] ";

                        while (sequence.Fetch())

                        {

                            Manifold.ValueSet values = sequence.GetValues();

                            newtext = newtext +", SUM(CASE WHEN [varid] = "+values[0].Data+" THEN [varvalue] ELSE 0 END) AS ["+values[1].Data+"] ";

                        }

                        newtext = newtext +" INTO test ";

                        newtext = newtext +" FROM "+silo.ToString()+" ";

                        newtext = newtext +" WHERE [levelid] ="+levelid+" ";

                        newtext = newtext +" AND [varid] IN ("+varidlist+") ";

                        newtext = newtext +" GROUP BY [geomid]; ";

                        using (Manifold.Table newtable = db.Run(newtext))

                        {

     

                        }

                    }

                }

            }

    return 0;

        }

    }

    static Manifold.Context Manifold;

    static void Main()

    {

       //Pivot("silo", "2", "1396,1321,1216,1079,1103,1133");

    }

    }

    The query to pivot a table is rather simple

    FUNCTION pivottable(@silo NVARCHAR, @levelid NVARCHAR, @vars NVARCHAR) INT32 AS SCRIPT [PivotScript] ENTRY 'Script.Pivot';

    SELECT pivottable('silo', '2', '1396,1321,1216,1079,1103,1133') FROM [silo];

    The next step is now to handle the rotations of the cube between the dimensions of it. La suite au prochain numéro.

    adamw


    8,242 post(s)
    #27-Sep-18 09:17

    You should perhaps change the last query to:

    --SQL9

    FUNCTION pivottable(...) ...;

    VALUE @p INT32 = pivottable(...);

    You want to run the function and you want to run it just once, correct? So, do that. :-)

    SELECT ... FROM [silo] may run the function as many times as there are records in [silo] and, more importantly, it only runs it as a side effect of the result table of SELECT ... being shown in the command window. If you add one more statement after SELECT ... FROM [silo];, the SELECT, as written, won't make any calls to the function at all.

    VALUE forces the right side to run.

    If the function was returning a table, forcing it to run would be even simpler:

    --SQL9

    FUNCTION pivottable(...) TABLE ...;

    TABLE CALL pivottable(...);

    Or you could continue to use VALUE.

    apo53 post(s)
    #27-Sep-18 09:37

    Marvelous and thanks for the insights. It is exactly what I was facing with a multiple runs, boooh.

    The VALUE approach is perfect in my case as I don't need anything back from the function, the pivot table being created directly by the script. But in case... As said I'm now upgrading the script to handle the rotation of the cube, as the possiblity to select the aggregation mode...

    apo53 post(s)
    #21-Sep-18 07:58

    The second approach is replication of the Adam's COLLECT method mentioned above by tjhb taking profit from the insights of the CASE method.

    FUNCTION countcat(@t TABLE) INT32 AS (

     SELECT Sum([varvalue]) FROM @t

    END;

    SELECT [geomid],

      countcat((COLLECT [varvalue] WHERE [varid] = 1396)) AS PTOT,

      countcat((COLLECT [varvalue] WHERE [varid] = 1321)) AS PM,

      countcat((COLLECT [varvalue] WHERE [varid] = 1216)) AS PF,

      countcat((COLLECT [varvalue] WHERE [varid] = 1079)) AS P0014,

      countcat((COLLECT [varvalue] WHERE [varid] = 1103)) AS P1539,

      countcat((COLLECT [varvalue] WHERE [varid] = 1133)) AS P4064

    FROM [silo]

    WHERE [levelid] = 4

    AND [varid] IN (1396,1321,1216,1079,1103,1133)

    GROUP BY [geomid]

    ;

    Using this approach the query is slightly slower but nothing dramatic. The issue is somewhere else.

    This works perfectly for the first levels with a restricted number of geoms (1-150) but once I want to extract the levels with 4000 or 120000 geoms I get a Invalid object reference in the face. My question at this stage, is there a limit of size of data to be passed to a function?

    adamw


    8,242 post(s)
    #24-Sep-18 14:43

    There are limits, yes, not onto the amount of data passed to a function, but rather on the amount of data passed to an aggregate. But they are fairly large - 2 billion records per table at the very least (and frequently no such limit even, this depends on the aggregate) - so it is unlikely that you are hitting them. Something else is going on.

    Could you upload data somewhere (compress it into MXB, this will save space) and send a link to tech support? Or ask them where to upload data and they will direct you.

    apo53 post(s)
    #24-Sep-18 16:19

    Uploaded and msg send to tech.

    tjhb

    8,399 post(s)
    #20-Sep-18 04:21

    By the way in Adam's example, where he used repeated (COLLECT...), we can now use multiple references to VALUE @t TABLE = SELECT... which is neater (and probably faster since cached I think).

    tjhb

    8,399 post(s)
    #20-Sep-18 04:36

    Strike that, I was wrong. COLLECT... GROUP is still the correct form.

    Manifold User Community Use Agreement Copyright (C) 2007-2017 Manifold Software Limited. All rights reserved.