Subscribe to this thread
Home - General / All posts - SQL to create pivot table showing cumulative sum of homes built by year
Mike Pelletier

2,122 post(s)
#26-Jan-10 14:57

I'm wondering if its possible to write SQL that will create a pivot table showing the cumulative total number of homes built in various areas by year. Below is a working SQL statement that provides a pivot table showing the number of homes built in EACH year by area. The area drawing is called "watershed". The drawing determining house location is called "taxparcel". The table with year built and residential structure information is "ASR_CAMA_DS_DATA_FF01".

'SQL

transform count(*)

select act_year_blt from watershed, taxparcel

join (select * from ASR_CAMA_DS_DATA_FF01

where model like "SFR%" OR MODEL IN ("CONDOS", "MANUFACTURED_H"))

on ParcelNumber = ParcelNB

where contains(watershed.id, centroid(taxparcel.id))

and taxparceltype in (0, 1, 4)

and ParcelNumber <> ""

group by act_year_blt

pivot [area]

Mike Pelletier

2,122 post(s)
#26-Jan-10 15:21

For those interested, attached is a basic map that might help simplify trying solutions.

Attachments:
Cumulative sum pivot.map

Gustavo Palminha

1,010 post(s)
#26-Jan-10 16:04

Nice. :)


Stay connected:

Linkedin >> http://www.linkedin.com/in/gustavopalminha

Twitter >> http://twitter.com/gustavopalminha

Other >> http://www.spatialmentor.com

tjhb
10,094 post(s)
#26-Jan-10 17:11

Not a pivot query in this version.

The first query counts the houses by year.

-- Query 1

SELECT

    [D1].[Name][D2].[Year],

    COUNT(*) AS [Houses]

FROM

    [Areas] AS [D1]

    LEFT JOIN

    [Houses] AS [D2]

    ON Contains([D1].[Geom (I)][D2].[Geom (I)])

GROUP BY [D1].[Name][D2].[Year];

The second query calls the first query twice. For each area, for each year, the number of houses built in the same area during that year or earlier is listed, then those rows are grouped for a cumulative sum.

-- Query 2

SELECT

    [T2].[Name][T2].[Year],

    SUM([T1].[Houses]AS [Houses]

FROM

    [Query 1] AS [T1]

    LEFT JOIN

    [Query 2] AS [T2]

    ON [T1].[Name] = [T2].[Name]

    AND [T1].[Year] <= [T2].[Year]

GROUP BY [T2].[Name][T2].[Year];

tjhb
10,094 post(s)
#26-Jan-10 17:52

[More natural to switch [T1] and [T2] around for the second query. But makes no difference.]

-- Query 2

SELECT

    [T1].[Name][T1].[Year],

    SUM([T2].[Houses]AS [Houses]

FROM

    [Query 1] AS [T1]

    RIGHT JOIN

    [Query 2] AS [T2]

    ON [T1].[Name] = [T2].[Name]

    AND [T1].[Year] >= [T2].[Year]

GROUP BY [T1].[Name][T1].[Year];

pcardoso


1,664 post(s)
#27-Jan-10 02:47

Query 2 circular reference.

tjhb
10,094 post(s)
#27-Jan-10 11:11

Yes oops, a bad typo. Both instances should say "Query 1" of course.

Mike Pelletier

2,122 post(s)
#27-Jan-10 09:37

Excellent Tim. With a little tweaking I got it working. Per your description in the second query, changing the second query reference from [Query 2] to [Query 1] gets rid of the "circular reference" Paulo mentions below. Also adding this 3rd query below transforms the result from the second query into the desired pivot table. Thanks again!

'SQL

TRANSFORM sum([houses]) select [year] from [query 2]

group by [year]

PIVOT [Name]

Attachments:
Cumulative sum pivot.map

Sergio40 post(s)
#19-May-16 16:42

Nice indeed.

I am looking for a way to build a final output where the cumulative value would remain constant, inheriting that of the previous year, when no change is taking place. Using the same example, something like:

Year ; A

2000 ; 3

2001 ; 3

2008 ; 4

2009 ; 5

instead of:

Year ; A

2000 ; 3

2001 ; NIL

2008 ; 4

2009 ; 5

As in the example above. Can anyone help me please ?


s

tjhb
10,094 post(s)
#19-May-16 22:12

Sergio,

where the cumulative value would remain constant, inheriting that of the previous year, when no change is taking place

On the assumptions in the example data above, that can't happen.

Either there some object (house) attributed with a given year (say, 2001) or there is none.

If there is some (one or more), then the count for that year is > 0, and the cumulative value must increase.

If there is none, then there is no group for that year--so, not a count of zero, but no count at all (even if we use Coalesce or a different counting method).

I think you must be thinking of a slightly different scenario. Can you post some sample data?

Sergio40 post(s)
#20-May-16 14:45

Sorry I did not manage to make myself clear. I think the "Cumulative sum pivot".map attached may well help illustrating the purpose. Start looking at the total number of houses built by area in each year.

In year 2000:

- 3 houses were built in area A

- No house was built in any the other area

Therefore the Cumulative sum of houses built for that year is 3 for A and zero for the other areas (assuming there were no houses built in any area prior to 2000).

In year 2001:

- 3 new houses in B

- No new house in any other area.

Therefore the calculated value of the Cumulative sum should be:

- 3 in area A (the nr of houses built in the previous years=3 plus zero, the amount of this year).

- 3 in area B (the nr of houses built in the previous years =0 plus 3, the amount of this year).

In subsequent years:

my assumption is that the cumulative number of houses built in a specific given area and given year should be the sum of the houses newly built in that given area and year plus the number of houses built in that area all previous years.

Attachments:
01 Total houses by area in each year.jpg
02 Cumulative sum of houses by area by year.jpg


s

Sergio40 post(s)
#20-May-16 15:04

... () I realise that the query adds the number of new houses to the previous one IF there is one house built in the area. What I am after, is a variation of the query that adds zero to that value if no house is built, so that the cumulative sum remains constant when there is no new house built.


s

tjhb
10,094 post(s)
#20-May-16 16:51

Thank you so much, that is very clear. I should have understood better before.

So yes: we need to make a list of all distinct years (for all points), and CROSS JOIN that to the table of areas before LEFT JOINing back to the points for grouping on Touches(). That will give an empty group for each area for each year in which no houses were built. Then we can do the cumulative sum as before.

I also think the original query had a bug (for its own purpose).

tjhb
10,094 post(s)
#20-May-16 23:38

So try this.

New query 1b:

-- Query 1b

SELECT

    [T].[Name][T].[Year],

    COUNT([D2].[ID]AS [Houses]

FROM

    (SELECT

         -- pivot areas by year

        [D1].[ID][D1].[Name],

        [P].[Year]

    FROM

        [Areas] AS [D1]

        CROSS JOIN

        (SELECT DISTINCT [Year]

        FROM [Houses]

        ) AS [P]

    ) AS [T]

    LEFT JOIN

    [Houses] AS [D2]

    ON [T].[Year] = [D2].[Year]

    AND Touches([T].[ID][D2].[ID]-- native e

    -- NB switch order of conditions

    -- for a very large dataset

GROUP BY [T].[Name][T].[Year]

;

New query 2b (calls 1b):

-- Query 2b

SELECT

    [Name][Year],

    [Houses]

FROM

    (SELECT

        [T1].[Name][T1].[Year],

        SUM([T2].[Houses]AS [Houses]

    FROM

        [Query 1b] AS [T1]

        RIGHT JOIN

        [Query 1b] AS [T2]

        ON [T1].[Name] = [T2].[Name]

        AND [T1].[Year] >= [T2].[Year]

    GROUP BY [T1].[Name][T1].[Year]

    )

ORDER BY

    [Name] ASC,

    [Year] ASC

;

Attachments:
Query 1b.txt
Query 2b.txt

tjhb
10,094 post(s)
#20-May-16 23:41

Revised versions of the original queries. The bug in query 1 is noted in comments. (It only arose for an area containing no points--not in the example data.)

-- Query 1a

SELECT

    [D1].[Name][D2].[Year],

    --COUNT(*) AS [Houses']

        -- incorrect with LEFT JOIN

        -- counts unmatched rows (if any)

    COUNT([D2].[ID]AS [Houses]

        -- does not count unmatched rows

FROM

    [Areas] AS [D1]

    LEFT JOIN

    [Houses] AS [D2]

    ON Contains([D1].[Geom (I)][D2].[Geom (I)])

GROUP BY [D1].[Name][D2].[Year]

;

.

-- Query 2a

SELECT

    [Name][Year],

    [Houses]

FROM

    (SELECT

        [T1].[Name][T1].[Year],

        SUM([T2].[Houses]AS [Houses]

    FROM

        [Query 1a] AS [T1]

        RIGHT JOIN

        [Query 1a] AS [T2]

        ON [T1].[Name] = [T2].[Name]

        AND [T1].[Year] >= [T2].[Year]

    GROUP BY [T1].[Name][T1].[Year]

    )

ORDER BY

    [Name] ASC,

    [Year] ASC

;

Attachments:
Query 1a.txt
Query 2a.txt

tjhb
10,094 post(s)
#20-May-16 23:42

Map file for both a and b versions.

Attachments:
Cumulative sum pivot ab.map

Sergio40 post(s)
#21-May-16 12:58

Thank you!

it works exactly as expected !


s

pcardoso


1,664 post(s)
#27-Jan-10 06:18

Let me give a R perspective, if i got the problem correctly.

if (!require(sp)) install.packages("sp",repos="http://cran.r-project.org");library(sp)

if (!require(maptools)) install.packages("maptools",repos="http://cran.r-project.org");library(maptools)

if (!require(ggplot2)) install.packages("ggplot2",repos="http://cran.r-project.org");library(ggplot2)

#! == Reading Shapefiles ===============

#! === Areas ===

areas <- readShapeSpatial('D:/R/Areas.shp') #! read shapefile

areasCentroids <- data.frame(coordinates(areas)) #! get centroids - just for ploting reasons

areasID <- data.frame(areas@data,idI=seq(1:nrow(areas@data))) #! Dataframe com a ordem ads quadículas na shapefile e o respectivo ID (código)

str(areasID) #! verify dataframe structure

#areasID <- cbind(areasID,areasCentroide)

#plot(grelhaID$X1,grelhaID$X2);text(grelhaID$X1,grelhaID$X2,labels=grelhaID$Codigo)

#! === Points ===

houses <- readShapeSpatial('D:/R/Houses.shp') #! read shapefile

houses$houseID <- row.names(houses) #! add an ID for each entry of the dataframe

plot(areas);plot(houses,add=T);text(areasCentroids,labels=areasID$Name,col="red");text(houses@ coords+8,labels=houses$houseID) #! ploting

#! == Overlay ==========================

##Note:

#### points on a polygon boundary and points corresponding to a polygon vertex are considered to be

#### inside the polygon

cross <- overlay(areas, houses);cross$houseID <- row.names(cross) #! Overlay points to areas

cross_house <- merge(cross,houses,by="houseID",all=T) #! associate houses data to overlay table

cross_house <- cross_house[order(cross_house$Name,cross_house$Year),] #! ordering things

cross_house

#str(cross_house)

#! == Cummulative Sums of houses by area along the years ===

TNumb <- ddply(cross_house,c("Name""Year"),summarise,Numb=length(Name))

cum_Sum <- ddply(TNumb, "Name", transform, NT = cumsum(Numb))

 

Mike Pelletier

2,122 post(s)
#27-Jan-10 09:43

Thanks for adding this Paulo. I haven't yet dived into R as I've been hoping Manifold 9 might make it easier to work between the two packages and seem to be able to keep finding answers inside Manifold. Looks like R might sometimes offer an easier solution than SQL.

pcardoso


1,664 post(s)
#27-Jan-10 11:21

As you see, the data manipulation "strictly" dependent of R is restricted to the last tow lines, which turns the problem very easy to solve with R. Any supported integration between R and Manifold would make things much easier and extend perspectives of data analysis

mdsumner


4,260 post(s)
#19-May-16 22:51

Hi pcardoso, some updates for current R version - also can use manifoldr for direct read from .map (this is optional - requires devtools for installation).

Attachments:
v_3.3.0.r


https://github.com/mdsumner

tjhb
10,094 post(s)
#19-May-16 22:57

Very nice. So like SQL. (To a foreigner.)

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.