Subscribe to this thread
Home - General / All posts - SQL help with Group By
Mike Pelletier

2,122 post(s)
#08-Feb-18 23:51

The new help manual is great in there is much more info than with 8, but still having trouble. My goal with the data is to calculate the average annual growth rate of values for each account for the 7 years the data is provided and be able to use where clauses to run it on different subsets. In the attached mxb file "Query 3" attempts to do that but fails. Redoing the Group By portion and simplifying it as "Query" in the project is what is shown below and returns null values. Any help would be much appreciated.

select  [TAX AREA],  Avg((select [IMP VAL] 

from [Res Value by year]

where [TAX YEAR] = '2010')) 

from [Res Value by Year]

group by [TAX AREA]

Attachments:
Res Value by year.mxb

tjhb
10,094 post(s)
#09-Feb-18 00:27

I wouild help but.

tjhb
10,094 post(s)
#09-Feb-18 00:58

Blindly for now, the main problem is that you are repeating the averaging of [IMP VAL] (for the whole table) once for every record. So if there are 80 rows (I think there are more), then you are calculating the same average 80 times over (with caching).

tjhb
10,094 post(s)
#09-Feb-18 01:06

I think you want to do something like this.

SELECT

    [TAX AREA],

    AVG([IMP VAL]AS [average imp val]

FROM 

    (SELECT *

    FROM [Res Value by year]

    WHERE [TAX YEAR] = '2010'

    )

GROUP BY [TAX AREA]

;

(I can't quite see what Query 3 does yet. Might need help on that.)

tjhb
10,094 post(s)
#09-Feb-18 01:38

Here's a try at Query 3.

SELECT

    [TAX AREA]

    Pow(

        AVG([IMP VAL]) / Log(AVG([IMP VAL])),

        7

        ) AS [A]

    -- you might need other things here

FROM 

    (

    SELECT *

    FROM [Res Value by year]

    WHERE [TAX YEAR] = '2010' 

    AND [TAX AREA] = '100'

    --AND [QUALITY] = 'Average'

    )

GROUP BY [TAX AREA]

;

Mike Pelletier

2,122 post(s)
#09-Feb-18 22:01

Thanks Tim. Been working on this without much success. My goal is to take the average [imp val] for each account in 2010 and 2017 and figure out the compound annual growth rate. That formula I believe is Pow( ([2017 avg]/[2010 avg]) , (1/7) ) - 1. I've been trying to do it with joins and functions with not much luck. Here's a function approach (without the full growth formula) that just returns 0.

Function f1() Table As

(SELECT 

    [TAX AREA],

    AVG([IMP VAL]) AS [average imp val 2017]

FROM 

    (SELECT *

    FROM [Res Value by year] 

    WHERE [TAX YEAR] = '2017'

 and [tax area] = '100'

    )

GROUP BY [TAX AREA]

 ) End;

Function f2() Table As

(SELECT 

    [TAX AREA],

    AVG([IMP VAL]) AS [average imp val 2010], [average imp val 2017]

FROM call f1(),

    (SELECT *

    FROM [Res Value by year] 

    WHERE [TAX YEAR] = '2010'

 and [tax area] = '100'

    )

GROUP BY [TAX AREA]

 ) End;

Also, another task with this table is to weed out accounts that don't have values in all the years available in the table. I've tried some self joins with where clauses for the year, but I'm really at a loss how to approach this. Thanks in advance.

tjhb
10,094 post(s)
#09-Feb-18 22:30

If that's the whole query, then running it will only compile the two functions--since there is nothing else. A result of 0 just means that compilation is successful, i.e. there are no syntax errors. (Zero effectively means "well done".)

On the whole I think this can be simplified quite a bit. I'll have a go and be back soon.

tjhb
10,094 post(s)
#09-Feb-18 22:44

My goal is to take the average [imp val] for each account in 2010 and 2017 and figure out...

Can you expand on "for each account" Mike?

Take all the records for 2017. For this group, all values in [ACCOUNTNO] are unique. So averaging for each [ACCOUNTNO] would just be the average of one [IMP VAL].

The same for 2010.

Should it read

... to take the average [imp val] for each account all accounts in 2010 and 2017 ...

?

tjhb
10,094 post(s)
#09-Feb-18 23:04

Assuming that it should read "for all accounts" not "for each account", try this.

FUNCTION AvgImpVal(t TABLE, year NVARCHAR, area NVARCHAR) FLOAT64 AS

    (SELECT AVG([IMP VAL])

    FROM t

    WHERE [TAX YEAR] = year

    AND [TAX AREA] = area

    )

    END

;

SELECT

    [avg 2017], [avg 2010],

    Pow([avg 2017] / [avg 2010], 1 / 7) - 1 AS [compound rate]

FROM

    (

    VALUES ( 

        AvgImpVal([Res Value by Year], '2017', '100'),

        AvgImpVal([Res Value by Year], '2010', '100')

        )

    AS ([avg 2017], [avg 2010])

    )

;

Now the table needs prefiltering:

Also, another task with this table is to weed out accounts that don't have values in all the years available in the table.

Using a function means we can just plug the filtered table in place of the original.

Mike Pelletier

2,122 post(s)
#10-Feb-18 04:24

Sorry for poor phrasing on my part Tim. Your assumption is correct. Look forward to giving this a go later. There's a lot of learning in this query. Super helpful Tim!

tjhb
10,094 post(s)
#10-Feb-18 07:19

Great Mike. There might be tweaks and problems for the above, but to move onto the prefiltering...

The objective is keep the records for a given account only if they include records for all available years.

In other words (probably clearer), to remove the records for any account that does not have records for all years.

First list all available years. (Already mentioned above.)

SELECT DISTINCT [TAX YEAR]

FROM [Res Value by Year]

ORDER BY [TAX YEAR] ASC

;

This lists (2010, 2011, 2013, 2015, 2017). So records are available for 5 years.

We can count that total number directly, by wrapping the above:

SELECT COUNT(*) AS [n years]

FROM

    (

    SELECT DISTINCT [TAX YEAR]

    FROM [Res Value by Year]

    --ORDER BY [TAX YEAR] ASC

    )

;

The ORDER BY clause would make no difference here so it is commented out.

[Note that in Manifold 8 we could count distinct values without using a wrap.

SELECT COUNT(DISTINCT [TAX YEAR]AS [n years]

FROM [Res Value by Year]

;

We can't do the same in Manifold 9. I assume that is intentional--there will probably be a very good reason. I hadn't noticed this before but should should have.

]

So that is the total number of years for the whole table, combining all accounts together: 5.

Now for each account, if it has records available for 5 years, it is in. Is it has records available for fewer than 5 years, it is out. (It can't have records for more than 5 years, or the count for the whole table would also be more than 5.)

So now we need to know the number of years of data available for each account, taken separately.

In Manifold 8 we could have done this:

SELECT COUNT(DISTINCT [TAX YEAR]as [n years]

FROM [Res Value by Year]

GROUP BY [ACCOUNTNO]

;

In Manifold 9 we need* to use a wrap. [*Or see the alternative below.]

SELECT

    [ACCOUNTNO],

    COUNT([TAX YEAR]AS [n years] -- not COUNT(*)

FROM

    (

    SELECT DISTINCT [ACCOUNTNO][TAX YEAR]

    FROM [Res Value by Year]

    )

GROUP BY [ACCOUNTNO]

;

[*An alternative is to use COLLECT, passing the result to a custom aggregate function:

FUNCTION CountRecords(t TABLE) FLOAT64 AS

    (SELECT COUNT(*) FROM t)

    END

;

SELECT

    [ACCOUNTNO],

    CountRecords(

        (COLLECT DISTINCT [TAX YEAR])

        ) AS [n years]

FROM [Res Value by Year]

GROUP BY [ACCOUNTNO]

;

]

So now we have two results: the number of available years for all accounts taken together, and the number of available years for each record taken separately.

We want to exclude any records where the second number is less than the first.

adamw


10,447 post(s)
#10-Feb-18 15:03

For the last query, CountRecords is just ValueCount - we have that function built-in.

We don't have DISTINCT *inside* Count(...) because we want to avoid keywords inside of functions. We'd rather make it easy to do what keywords do via other means (eg, COLLECT). We want to avoid keywords inside of functions mostly because we want functions to be extensible and come from scripts / DLLs / etc, and we neither want custom functions to be second-class citizens and be unable to use keywords, nor do we want to provide means to define custom keywords for them.

tjhb
10,094 post(s)
#11-Feb-18 00:16

Thanks Adam, that's much cleaner.

Let's compare timings for the three versions, using Mike's dataset (fullfetch on, rough average of 3 successive runs for each version). Apologies for repeating queries from above, no other obvious way to be clear.

(1) Using DISTINCT -> wrap -> COUNT:

SELECT

    [ACCOUNTNO],

    COUNT([TAX YEAR]AS [n years]

FROM

    (

    SELECT DISTINCT [ACCOUNTNO][TAX YEAR]

    FROM [Res Value by Year]

    )

GROUP BY [ACCOUNTNO]

;

0.390s.

(2) Using COLLECT -> custom function:

FUNCTION CountRecords(t TABLE) FLOAT64 AS

    (SELECT COUNT(*) FROM t)

    END

;

SELECT

    [ACCOUNTNO],

    CountRecords(

        (COLLECT DISTINCT [TAX YEAR])

        ) AS [n years]

FROM [Res Value by Year]

GROUP BY [ACCOUNTNO]

;

0.750s. (Without recompilation the same, 0.750s.)

(3) Using COLLECT -> ValueCount (no custom function), Adam's suggestion:

SELECT

    [ACCOUNTNO],

    ValueCount(

        (COLLECT DISTINCT [TAX YEAR])

        ) AS [n years]

FROM [Res Value by Year]

GROUP BY [ACCOUNTNO]

;

0.422s.

Comment: COLLECT is very fast.


Thanks Adam also for explaining the general absence of <aggregate-function>(DISTINCT... ) constructs.

All the same, standard SQL aggregate functions are not quite the same as Manifold-specific aggregate functions* or other built-in functions (or custom functions of course).* They are just that: standard SQL.

I think the departure from standard warrants a note on the manual, along with guidance towards what to do instead. Viz., use DISTINCT then wrap, or DISTINCT inside COLLECT, as above (and maybe other things).

(*Or maybe they are the same, in SQL9. Maybe that's the point. Anyway a note would not hurt.)

Mike Pelletier

2,122 post(s)
#14-Feb-18 00:08

Tim this query and the prefilter query down below are working very well. Many thanks for that!

It seems interesting how that Function works. I suppose the function sets up a future table called "t" and the 2nd query somehow knows to put the VALUES statement into "t". I'm sure that is a very poor explanation. What would be a better way to explain it?

I created a new table with the value info, [n years], and a parcel geom that matches the accountno. Then created the following SQL which allows me to run the growth rate calculation on the selection in the parcels. Much better for exploring the data. Thanks!

Function f1() Table As

(select '1' as A, avg([land val]) as B

 from call selection([Res Value by Year with n year], true

    WHERE [TAX YEAR] = '2017'

 AND [N YEARS] = 5

)

End;

Function F2() Table As

(select '1' as AA, avg([land val]) as c

 from call selection([Res Value by Year with n year], true

    WHERE [TAX YEAR] = '2010'

 AND [N YEARS] = 5

)

END;

Select 

 A, 

 RoundDecs((Pow([B] / [C], 1 / 7) - 1) * 100, 2)  AS [compound rate]

from call f1()

join call f2()

on A = AA

adamw


10,447 post(s)
#14-Feb-18 07:24

You can drop artificial fields named 'A' and 'AA' and reword your join to: ... FROM CALL f1(), CALL f2();

In fact, you can make functions return values:

--SQL9

FUNCTION f1() FLOAT64 AS (

  SELECT avg([land val])

  FROM CALL selection([Res Value by Year with n year], true) 

  WHERE [TAX YEAR] = '2017' AND [N YEARS] = 5

END;

FUNCTION f2() FLOAT64 AS (

  SELECT avg([land val])

  FROM CALL selection([Res Value by Year with n year], true) 

  WHERE [TAX YEAR] = '2010' AND [N YEARS] = 5

END;

VALUES (

  RoundDecs((Pow(f1() / f2(), 1 / 7) - 1) * 100, 2)

);

Better yet, we can use a function with a descriptive name and parameters:

--SQL9

FUNCTION AvgLandVal(y NVARCHAR, n INT32FLOAT64 AS (

  SELECT avg([land val])

  FROM CALL selection([Res Value by Year with n year], true) 

  WHERE [TAX YEAR] = y AND [N YEARS] = n

END;

VALUES (

  RoundDecs(

    (Pow(AvgLandVal('2017', 5) / AvgLandVal('2010', 5), 1 / 7) - 1) * 100, 2)

);

(Used model data to test it quickly. The important bit is to not forget to select something.)

Mike Pelletier

2,122 post(s)
#14-Feb-18 17:51

Thanks Adam. I like the approach of using functions to break up queries into a series of steps. A longer query is a reasonable price to pay for making it easier to understand and test portions of the query. Fast queries are great but 99% of the time my brain is the hangup.

Also, I used that approach because in the example with descriptive names and parameters, I don't see how to do a more complicated condition like "n" = 3, 4, or 5. Is that possible?

adamw


10,447 post(s)
#15-Feb-18 07:03

... I don't see how to do a more complicated condition like "n" = 3, 4, or 5. Is that possible?

Do you mean filtering with [N YEARS] equaling not just 5, but, say, 3 or 4? If so, that's what the parameter already does: AvgLandVal('2017', 3) will compute the average on records with [TAX YEAR] equal to '2017' and [N YEARS] equal to 3.

Mike Pelletier

2,122 post(s)
#15-Feb-18 15:27

Sorry for not being clear Adam. Can the parameter somehow be a condition like in a list: IN ("3", "4", "5") ?

tjhb
10,094 post(s)
#16-Feb-18 03:22

That syntax is perfectly good in itself Mike. The question is how you would pass several values to the function.

First, if you know that there will be exactly three values, then you can pass them as arguments a, b, c, then use IN (a, b, c) inside the function. That's fine.

You could also pack them into an xN value, for compatible value types. (Not strings, although you could pack and unpack with extra work.)

Otherwise, what can we do?

We can't pass a list or tuple, as such, to a function.

But can pass a table, and a table with a single record resembles a tuple/list/vector, just with a different name.

A specific example might help.

Mike Pelletier

2,122 post(s)
#16-Feb-18 03:54

Thanks for the info Tim. I'm happy just sticking with function that returns values and allows for conditions using a list of values. More flexible, easier to understand (and remember!), and not much more writing. At least for what I'm currently doing :-)

tjhb
10,094 post(s)
#09-Feb-18 01:40

Note that you can easily adapt a query like this to use parameters, so you can easily run it on different subsets.

Mike Pelletier

2,122 post(s)
#09-Feb-18 02:51

Very cool Tim. Thanks for tackling this. Look forward to working on this tomorrow.

This got me thinking about a way to better understand the results of a query beyond just getting the number of records returned via the log. What about a button or keystroke to run stats on the results? Something like for each field the number of unique values, high/low values, etc. Maybe right click on a field heading in the result table and get that info for just that field would be better. Seems like it could be helpful in diagnosing problems with the query as well as understanding the result of good queries.

Getting more complicated, maybe a way to see relationships between the results and the original data, such as in your 2nd post above. Maybe be able to select a value in one field and do a special click in another field and get stats on the values in the 2nd field that match records (in the first field) from the first click. Maybe shut up and let Manifold work on other more important things :-)

Dimitri


7,413 post(s)
#09-Feb-18 06:28

Maybe shut up and let Manifold work on other more important things :-)

No, never shut up. :-) Good ideas are always great! Bad ideas are good too! They may not happen tomorrow, or next month, or even next year, but always it helps to smarten up the thinking that goes into the interface, into what is possible, what might be useful.

I think (thinking out loud...) where all this might belong is in the Component panel of the Contents pane, where maybe there is a analytics tab that provide such overall characteristics of tables. Or, maybe this is a role for some sort of new floating read-out instrument, like a viewbot, that reports what you want for whatever table (including, of course, a results table) has the focus.

dchall8
1,008 post(s)
#09-Feb-18 23:48

My goal with the data is to calculate the average annual growth rate of values for each account for the 7 years the data is provided and be able to use where clauses to run it on different subsets.

I can't comment on the query but I can comment on the logic. Here is how I interpret your goal:

My goal with the data is to calculate the average annual growth rate of IMP VAL and LAND VAL for each ACCOUNTNO for the 7 TAX YEAR (s) the data is provided and be able to use WHERE clauses to run it on different subsets.

I don't see you picking up each ACCOUNTNO. If you want average growth for the 7 years for each ACCOUNTNO, you find the newest value of IMP VAL for that ACCOUNTNO based on the most recent TAX YEAR, subtract the oldest value of IMP VAL for that ACCOUNTNO based on the oldest TAX YEAR, and divide by the difference between the newest and oldest IMP VAL. I don't see that you have that logic working, because...

1. you are not selecting IMP VAL data by ACCOUNTNO for the TAX YEAR,

2. the data is not clean. Not every ACCOUNTNO has 7 years coverage of data (some have 1 point (so no average growth), some have 2 years, 3, 5, and 7).

3. Not every ACCOUNTNO increases steadily. ACCOUNTNO R000960 starts at IMP VAL of 67,000, jumps to 116,865 in one year, falls to 74,448 two years later, falls slightly to 70,000 over the next two years, and doubles to 148,428 over the final two years. What kind of analysis are you doing that would be valid through wild value swings?

tjhb
10,094 post(s)
#10-Feb-18 03:03

Most of that is right--I had the same concern as you see, though I went the other way.

One small correction: the range of tax years is 2010 to 2017 (7), but the number of tax years for which there is any data is only 5, namely for 2010, 2011, 2013, 1015, 2017.

How to tell that quickly and accurately? Dimitri and I have both been trying to persuade you to give SQL a shot. Here is an ideal chance. Try running this query. The last line is just cosmetic--you can leave it out.

SELECT DISTINCT [TAX YEAR]

FROM [Res Value by Year]

ORDER BY [TAX YEAR] ASC

;

Maybe that will start to persuade you! (Yes, in Manifold 8 you can do the same in the Query bar, using Uniques, then clicking on the field header to do the sorting. Fair point.)

Anyway, getting back to Mike's goal--above all you are right that getting the purpose straight in English is most important.

Here is how I read it--writing it out in full the way you have done (which is great).

My goal with the data is to calculate a notional compounding annual growth rate of IMP VAL, averaged over the period 2010 to 2017, and over all accounts for which there is data for all of the 5 available years.

But it's not clear and, being more familiar with US property tax data than I will ever be (we don't have any property taxes in New Zealand), you are more likely to be right than I am.

Mike Pelletier

2,122 post(s)
#10-Feb-18 04:45

Correct interpretation Tim. The story is that taxes are assessed once every 2 years and 2010 is really just 2009 data.

Mike Pelletier

2,122 post(s)
#10-Feb-18 04:43

Thanks for spotting that. I hadn't yet looked at that level of detail but was going to after I figured out how to remove unwanted accounts that didn't have 5 years worth of data.

Tools for easily exploring the data would be helpful as Dimitri discussed a bit above.

dchall8
1,008 post(s)
#14-Feb-18 15:48

Would an Excel solution work for you? I put the table into Excel and did the following

  1. Added ten new columns (2010, 2011, 2013, 2015, 2017, All 5?, Avg Imp Growth, Avg Land Growth, AVG AVG Imp, and AVG AVG Land)

  • Sort table by ACCOUNTNO
  • Added a Filter on the top row (select top row then DATA menu > Filter)
  • Filtered TAX YEAR by 2010 and put an x into the 2010 column for selected accounts
  • Repeat step 4 filtering for each tax year column
  • Turned off the TAX YEAR filters

  • Put an IF statement into the All 5 ? column to flag those with x's in all five tax year columns

  • Put an IF statement into Avg Imp Growth column to average the growth over the 7 years for flagged records

  • Put an IF statement into Avg Land Growth column to average the growth over the 7 years for flagged records

  • Filtered the All 5? column for TRUE (There were 394 hits with all five tax years).

  • Put a SUBTOTAL statement into the top cell for the AVG AVG IMP and the AVG AVG Land columns

    The overall average growth for imps is -$301.94 per year and for land it's -$4,752.72 per year.

  • dchall8
    1,008 post(s)
    #14-Feb-18 16:13

    I'll try to attach the 3MB Excel file

    Attachments:
    Res Value by Year.xlsx

    Mike Pelletier

    2,122 post(s)
    #14-Feb-18 17:34

    Thanks for doing this. It helps show the pluses and minuses of learning the SQL approach. Using excel breaks the steps into little pieces, is easily repeatable in the future, and is less likely to suffer from a forgetful mind that doesn't use SQL often enough. On the flip side, SQL merges many steps and can rapidly be rerun on new datasets. The SQL knowledge can be used in other projects. It's the typewriter vs. word processor scenario. If your going to use it often enough, it's worth it. If not, then excel is a great alternative.

    dchall8
    1,008 post(s)
    #14-Feb-18 19:18

    Agree.

    I have a friend who used to be on the development team for a Windows database called Fox. He once told me to use a spreadsheet when it doesn't matter if the data gets corrupted. Use a database when a corrupted database is unforgivable. That made all kinds of sense to me. In other words, use spreadsheets for quick and dirty analysis. At the time (late 1980s) I was working in jet engine logistics using spreadsheets to predict parts failures. Programming the math in a spreadsheet was not hard. I don't see why it would be any harder to do in a database, but at that time we did not have a desktop database in the office.

    Also I did not get into filtering for your subset data. I'm not sure which subsets you want, but the filters are already there. This works a lot better if you insert a new row in Excel above the top row. Then move the SUBTOTAL cells into that new row above the columns they are totaling. Then you can filter, for example, TAX AREA = 200 to find the AVG AVG IMP and Land values to be $20,047.55 and -$11,862.36. I could further filter on OCCCODE = 30 and get new averages. I could further filter on IMP SF > 1,000 and get new averages.

    I'm not arguing against SQL. I'm getting a quick and dirty answer using the tool I am familiar with. Those who are SQL wizards do quick and dirty analysis all the time in preference to spreadsheets.

    Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.