Subscribe to this thread
Home - General / All posts - Reclassify image using GPU
tjhb
10,094 post(s)
#13-Sep-19 00:35

In this thread of Volker's I mentioned in brackets that

(BTW in 9 this step [image reclassification] could be done all on GPU, using multiplication and addition to avoid a conditional.)

I thought I should say what I meant, now that I have tried doing this and having got useful results.

I am using Volker's task from that thread as the example. That is, to reclassify an image showing aspect into two classes: where aspect is from -67 to 113° (through ±180° or south), class 1; and where aspect is from -67 to 113° (through 0° or north) , class 2.

(I treat the boundaries slightly differently from Volker, using >= -68 and < 113 for class 1, and < -68 or >= 113 for class 2, dividing the compass exactly in half.)

The data I am using is an 8m DEM of the North Island of New Zealand, 73728 x 106496 pixels, single-precision floating-point (FLOAT32).


(1)

First, how would we do this in Manifold 8? We have two language interfaces: Spatial SQL and Surface Tools.

In SQL we might do this (remembering that Height (I) in this case encodes aspect):

--SQL

UPDATE [Surface]

  SET [Height (I)] =

  CASE

    WHEN [Height (I)] >= -67 AND [Height (I)] < 133 THEN 1

    WHEN [Height (I)] < -67 OR [Height (I)] >= 133 THEN 2

  END;

We don't really need the second condition, since if the first condition fails then the second follows necessarily. I include both for the sake of the example.

This query will execute entirely on CPU, since Spatial SQL is not GPU-enabled in Manifold 8.

Using Surface Tools, we can do it like this:

IIf([Surface] >= -67, 1,

  IIf([Surface] < 133, 1,

    IIf([Surface] < -67, 2,

      IIf([Surface] >= 133, 2,

      Null

      ))))

Many Surface Transform functions in Manifold 8 are GPU-enabled. However, a conditional expression is always evaluated on CPU. That is because, while conditional statements can be evaluated on GPU (using CUDA C), there is no point in doing so: the necessary branch synchronization negates any advantage to be gained from GPU parallelism. CPUs are simply better at branching.

(As you'll see, I'm begging a question here. Why not use the approach used below for Manifold 9 in Manifold 8 as well? We'll come back to that at the end.)


(2)

Now, what are our options in Manifold 9?

In 9, images are made of tiles, and tiles are made of pixels. So one option would be to unpack all the tiles in the image, using the TileToValues function, then perform the classification on the pixels using SQL (much as in the example for Manifold 8), then rebuild all the pixels back into tiles.

That would work, but it would be a bad idea. First because the unpacking and repacking of tiles is wasteful. Secondly because none of the three processing stages can benefit from parallelism: the unpacking and repacking are just data transport, and classification using a conditional will be kept on CPU.

Ideally in Manifold 9, we would like to operate on tiles per se, without unpacking.

OK, but how do we do that, if the objective is classification? We can't use the ordinary <, <=, >, >= or BETWEEN comparison operators on tile data. If we could, what would it mean? Would we mean tiles or pixels here?

Well, on the other hand, we can use basic arithmetic operations (+, -, *, /, DIV, MOD, ^) on tile data, plus bitwise operators, and a large family of special Tile* functions, including powers, roots and logs, trigonometry, rounding and other functions. In all of these cases it is clear that we do mean to operate on pixel values--without unpacking their tiles. I think the reason these functions work for tiles but the standard comparision operators do not, is again that comparison operators involve branching, or conditional statements, which are a poor fit for tile-by-tile operations just as they are a poor fit for GPGPU.

However, amongst the Tile* functions we also have some special comparision operators, that do not branch, but simply store (or pass on) the result of their comparison. These functions are TileCompare, TileMin, TileMax, and TileSign. They can be nested, and in combination we can use them to build non-branching equivalents to the comparison operators <, <=, >, >= which we need for image classification.

OK, how? To cut a long story short, like this.

To test whether x < y or x > y (strictly less or greater than), we can take the difference x - y, compare it to zero, and check the sign.

iff x < y then

    -Sign(Min(x - y, 0)) = 1

    (else 0)

iff x > y then

    Sign(Max(x - y, 0)) = 1

    (else 0)

To test whether x <= y or x >= y, we take the opposite difference y - x, again compare to zero, and check the complement of the sign.

iff x <= y then

    1 - Sign(Min(y - x, 0)) = 1

    (else 0)

iff x >= y

    1 - Sign(Max(y - x, 0)) = 1

    (else 0)

Since the results of these four tests are binary (either 0 or 1), we can multiply and add them together to produce compound tests, still without using a conditional expression or introducing branching.


So now we can rewrite the tests used in Manifold 8 for Manifold 9, operating on whole tiles rather than pixels. Moreover, because we gave got rid of the conditional statements, we can throw the whole image worth of tiles at the GPU, for efficient massively parallel processing. That's pretty good!

Here is a query to do Volker's image classification in Manifold 9. We only need the expressions for < and >= in this case.

--SQL9

VALUE @target_table TABLE = [Aspect];

VALUE @target_image TABLE = [Aspect image];

PRAGMA (

    'gpgpu' = 'auto',

    --'gpgpu' = 'aggressive',

    --'gpgpu' = 'none',

    'gpgpu.fp' = '32'

    );

UPDATE

    (

    SELECT 

        [mfd_id][Tile],

        CASTV(

            -- when >= -68 and < 113

            1 *

                (1 - TileSign(TileMax(-68 - [Tile], 0)))

                    -- 1 if [Tile] >= -68 else 0

                *

                -TileSign(TileMin([Tile] - 113, 0))

                    -- 1 if [Tile] < 113 else 0

            +

            -- when < -68 or >= 113

            2 *

                (

                -TileSign(TileMin([Tile] - -68, 0))

                    -- 1 if [Tile] < -68 else 0

                +

                (1 - TileSign(TileMax(113 - [Tile], 0)))

                    -- 1 if [Tile] >= 113 else 0

                )

            AS FLOAT32

            ) AS [Tile']

    FROM @target_table

    THREADS SystemCpuCount()

    )

SET [Tile] = [Tile']

;

TABLE CALL TileUpdatePyramids(@target_image)

;

Note the nested SELECT in the UPDATE statement. We could more simply write

--SQL9

UPDATE @target_table

SET [Tile] = CASTV(...)

However, that would be suboptimal, because the UPDATE phase of the query will only make effective use of a single thread (since it is mainly transport). Putting the main work into a subquery with a THREADS directive allows that part of the query to utilize multiple CPU threads (which are used to feed GPU parallelism).

I'll come back soon with some timings (and a note about the question I was begging).

tjhb
10,094 post(s)
#13-Sep-19 01:32

Sorry about the typos in the first section and in the Manifold 8 SQL and Surface Transform code!

Range boundaries should be -68 and < 113 to match Manifold 9 code, not (variously) -67 or 133. Very bad proofreading.

tjhb
10,094 post(s)
#13-Sep-19 02:37

Here are some comparative timings for Manifold 9, with and without GPGPU.

Hardware is Intel i7-4790K (4 physical cores, 8 logical), NVIDIA GTX TITAN 6GB, 32GB RAM, TEMP on SSD.

Project is as mentioned in first post: aspect derived from 8m DEM of the North Island of New Zealand, 73728 x 106496 pixels, FLOAT32.

Total times are for 3 phases combined: nested SELECT, UPDATE, TileUpdatePyramids. CPU and GPU usage are given for the first phase (the only phase that is varied). CPU usage for the second phase is about 15%, and for the third phase about 100%.

(1) PRAGMA ('gpgpu' = 'auto')

  • (a) Using 5 threads. First phase CPU 70%, GPU 94%. Total time 158s (2mn 38s).
  • (b) Using 8 threads. First phase CPU 100%, GPU 94%. Total time 159s (2mn 39s).

(2) PRAGMA ('gpgpu' = 'aggressive')

  • (a) Using 5 threads. First phase CPU 70%, GPU 94%. Total time 150s (2mn 30s).
  • (b) Using 8 threads. First phase CPU 100%, GPU 94%. Total time 151s (2mn 31s).

(3) PRAGMA ('gpgpu' = 'none')

  • (a) Using 5 threads. First phase CPU 66%, GPU 0%. Total time 253s (4mn 13s).
  • (b) Using 8 threads. First phase CPU 98%, GPU 0%. Total time 225s (3mn 45s).

[Added.]

An interesting variation. I was initially suprised by this, so wanted to confirm it.

This is a variation to the nested SELECT:

SELECT 

    [mfd_id][Tile],

    CASTV(

        [Tile] * 0 -- replace current values

        +

        -- when >= -68 and < 113

        1 *

            (1 - TileSign(TileMax(-68 - [Tile], 0)))

                -- 1 if [Tile] >= -68 else 0

            *

            -TileSign(TileMin([Tile] - 113, 0))

                -- 1 if [Tile] < 113 else 0

        +

        -- when < -68 or >= 113

        2 *

            (

            -TileSign(TileMin([Tile] - -68, 0))

                -- 1 if [Tile] < -68 else 0

            +

            (1 - TileSign(TileMax(113 - [Tile], 0)))

                -- 1 if [Tile] >= 113 else 0

            )

        AS FLOAT32

        ) AS [Tile']

FROM @target_table

THREADS 5 

With this change, using PRAGMA ('gpgpu' = 'aggressive') and 5 threads, I get a consistently improved time of ~133s (2mn 13s), which is significant. I imagine this might have to do with what data is copied to GPU or when. (But basically I don't know. Interesting.)

Next some timings for Manifold 8...

danb

2,064 post(s)
#17-Sep-19 05:19

Here are my timings, same image as Tim.

'gpgpu' = 'auto'

2019-09-17 14:05:00 -- Query: [Query] (129.102 sec)

'gpgpu' = 'aggressive'

2019-09-17 14:08:30 -- Query: [Query] (128.827 sec)

'gpgpu' = 'none'

2019-09-17 14:12:33 -- Query: [Query] (196.838 sec)

I have two Quadro K2200 on my box. Both show a similarly shaped graph as CPU usage on Compute_0 with ~80% utilization and close to 100% on CPU for the first part of the transform execution.

6 physical cores, 12 logical

Temp also on SSD


Landsystems Ltd ... Know your land | www.landsystems.co.nz

Dimitri


7,413 post(s)
#17-Sep-19 18:43

The data I am using is an 8m DEM of the North Island of New Zealand, 73728 x 106496 pixels, single-precision floating-point (FLOAT32).

Is that LINZ data? Do you have a link?

danb

2,064 post(s)
#19-Sep-19 04:55

Hi Dimitri, It sure is. It was made by Tim and can be downloaded from:

https://data.linz.govt.nz/layer/51768-nz-8m-digital-elevation-model-2012/


Landsystems Ltd ... Know your land | www.landsystems.co.nz

tjhb
10,094 post(s)
#19-Sep-19 05:56

Sorry for my silence on that. I put my sources on Dropbox today (they took a while...) and will post .mxb links very soon.

Dimitri


7,413 post(s)
#19-Sep-19 08:57

Thanks for the link. I tried downloading it, but the page says at 16+GB it is over the size limit of 3.5GB for downloads. Must it be downloaded in sub 3.5GB portions?

tjhb
10,094 post(s)
#19-Sep-19 22:38

The links are up now.

North Island (DEM 2.1 north m9 c.mxb, 5.5GB): public link.

South Island (DEM 2.1 south m9 c.mxb, 7.26GB ): public link.

These DEMs differ slightly from the publicly available versions. The only difference I know of is that the as-produced versions have elevations (largely nominal) for known rocks and reefs in the sea, while the public version masks them to zero.

A replacement version has long been promised (by me) and still might happen. Expanses of low relief are currently poorly represented because hydrology data is ignored.

No guarantee as to how long the links will stay up.

tjhb
10,094 post(s)
#13-Sep-19 05:48

I made a Manifold 8 test, then realised that my Surface Transform function was fundamentally incorrect in logic.

It should have been

IIf([Aspect] >= -68 And [Aspect] < 113, 1,

  IIf([Aspect] < -68 Or [Aspect] >= 113, 2,

    Null))

(Much more like the Manifold 8 and Manifold 9 SQL.)

I will test again.

tjhb
10,094 post(s)
#13-Sep-19 07:46

The corrected Manifold 8 test completed in approximately 1h 15mn.

Approximately because the History pane in Manifold 8 does not time Surface Transform operations fully. Some initial setup is excluded. So as well as noting the logged time, we need to note start and finish wall time, which can be a bit hit and miss (just a question of noticing).

FWIW the time logged in the History pane was 2829s (47mn 9s), excluding initial setup.

There was no GPU usage, as expected since all expressions are conditional.

CPU usage was 15% or less, sometimes much less when the task became strongly transport-bound (shown as sustained heavy disk usage on the drive holding TEMP). Memory usage never approached maximum system RAM.

In all, this was a test at which Manifold 8 was never going to shine, and comparison with Manifold 9 almost seems unfair.

But there we are. ~2mn 30s versus ~1h 15mn--I can live with that.

What strikes me most, I think, is the broadly similar times recorded for Manifold 9 with and without GPU. Even when GPGPU is not in play, Manifold 9 appears to take best leverage from vectorized processing on CPU (Intel AVX). Fantastic.

Dimitri


7,413 post(s)
#13-Sep-19 08:40

Manifold 9 appears to take best leverage from vectorized processing on CPU

It's easy to forget how powerful CPUs are at math. Given we are on the edge of seeing huge price drops per core in modern CPUs I think we'll see an even bigger role for parallel CPU computations, with fewer scenarios where it is worth dispatching to GPU.

AMD is really shaking up the per-core CPU market with the new Ryzen 9 generation and low prices on newer Threadrippers. There is still way more demand than supply for 12 core, 24 thread Ryzen 9 3900x, so prices are above suggested retail, but as AMD gets more of them on the market, and when the 16 core, 32 thread Ryzen 9 3950x comes out at the end of September (rumor has it 30 September is the day), prices on the 3900x will decline fairly rapidly to suggested retail of $499 or less.

That's the same price as an 8 core / 16 thread Core i9, so Intel has got to respond.

You can buy a second generation 24 core / 48 thread 2970WX Threadripper for about $900. If you figure that you'll spend around $500 anyway for the CPU in a new GIS machine for a working professional, that's only $400 extra to take you from 16 threads using an Intel CPU to 48 threads using the AMD Threadripper.

For compute-bound jobs, you'll likely be using an M.2 SSD, so it's suddenly maybe wiser to put the extra $400 into getting a zillion more threads in the CPU than it is to either a) put it into more expensive RAM or very large RAM, or b) a much more expensive GPU card.

Manifold has traditionally focused on showing how parallelism lets even very inexpensive configurations, like a $90 AMD FX CPU, perform better than exotic, expensive CPUs run single-core. I think that's good, but maybe it's time to move the needle upwards: Instead of doing comparisons using $90 CPUs, maybe it is time to start doing comparisons with what is becoming mainstream for professional users, Ryzen 9 and Threadrippers that can deliver 24, 32, or 48 threads.

tjhb
10,094 post(s)
#13-Sep-19 09:00

I will be buying a Ryzen 9 3950x as soon as they are available. Thanks for the heads up re 30 September.

But also very much in favour of the solo-chiplet 3700x. Which might be a sweetspot benchmark CPU, since it is so affordable and widely available.

Dimitri


7,413 post(s)
#16-Sep-19 11:42

I will be buying a Ryzen 9 3950x as soon as they are available. Thanks for the heads up re 30 September.

Unfortunately, it could be months after 30 Sept that the 3950x will be available in sufficient quantities for prices to get down to recommended retail. But I agree, it's a great CPU to buy if you can get one anywhere near the rumored $700 list price.

Only now is availability on the 3900x (12 core / 24 threads) loosening up, with prices starting to come down. At $499 or less the 3900x is a cosmic deal. I *love* the 3900x at that price. :-)

I've been playing around with a 3900x machine running off fast M.2 SSD, and Manifold on it is supernaturally fast. It's wild to see 24 cores in use 100%. Below is a screenshot with a very small GPU card (384 cores, a GeForce 1030 just used for booting), so all 24 cores go to 100% use computing a curvature.

But what's also interesting is being able to work with larger vectors, like all roads in the United States, and everything happens in real time In fact, 9 working with all roads in the US, even re-projecting them on the fly from lat/lon into Pseudo Mercator in a map, can pan and zoom much faster than Bing can serve tiles. It's basically realtime, no delay panning zooming, etc, with over 22 million vector objects.

24 cores at 100%:

Attachments:
24 cores at 100 percent.png

tjhb
10,094 post(s)
#14-Sep-19 04:31

Returning to the question--

Why not use the approach used below for Manifold 9 in Manifold 8 as well?

--the answer is that we can, but execution time is not improved.

Using the following Surface Transform in Manifold 8, on the same hardware and test data as used for all the above tests--

1 *

    (1 - Sgn(Max(-68 - [Aspect], 0)))

    *

    -Sgn(Min([Aspect] - 113, 0))

+

2 *

    (

    -Sgn(Min([Aspect] - -68, 0))

    +

    (1 - Sgn(Max(113 - [Aspect], 0)))

    )

--I get a total time of 1h 28mn. That includes 52mn logged to the History pane, plus 36mn uncounted setup time at the beginning.

CPU usage was generally 1-2%, with occasional spikes to a full core's usage (10-15%). Disk usage on the TEMP drive was high, varying from 15 to 80% (active time), most commonly around 40%. AFAIK, no GPGPU was used, unless usage was less than 1% and could not be detected in Task Manager. So this task in Manifold 8 is disk-bound.

This is no surprise and no criticism of Manifold 8, which was built for much smaller data with software tools from another time, without ground-up parallelism.

The comparison just shows how far Manifold has come with version 9. Happy days.

volker

1,086 post(s)
#15-Sep-19 11:13

Thank you Tim for all of this


http://www.thegisservicesector.de

tjhb
10,094 post(s)
#16-Sep-19 17:57

Unlikely perhaps but I have more notes to add to the thread tomorrow.

Thank you to anyone who has bothered to read this far. I will be brief.

tjhb
10,094 post(s)
#20-Sep-19 00:17

Further questions:

(1) Is updating an existing table the best way to do reclassification, or is it better to make a new table and image?

(2) How about halfway between those two options, i.e. writing a new Tile field in the existing table, with a new image drawn from it?


Answers:

(1) Writing a new table with its drawing is much faster than updating the existing table.

Relative timings on same hardware and with same test data as before (always updating pyramids, and discarding the first trial to mitigate caching differences):

Update existing field in source table (as earlier examples): 148s.

Write new table with new drawing: 96s. Code below.

--SQL9

VALUE @source_table TABLE = [Aspect];

VALUE @source_image TABLE = [Aspect image];

PRAGMA (

    'gpgpu' = 'aggressive'-- slightly faster than auto

    'gpgpu.fp' = '32'

    );

CREATE TABLE [Classified]

    (

    [mfd_id] INT64,

    [X] INT32,

    [Y] INT32,

    [Tile] TILE,

    INDEX [mfd_id_x] BTREE ([mfd_id]),

    INDEX [X_Y_Tile_x] RTREE ([X][Y][Tile] TILEREDUCE 'INDEXED' TILESIZE (128,128) TILETYPE UINT8),

    PROPERTY 'FieldCoordSystem.Tile' ComponentCoordSystem(@source_image),

    PROPERTY 'FieldTileSize.Tile' '[ 128, 128 ]',

    PROPERTY 'FieldTileType.Tile' 'uint8'

    );

CREATE IMAGE [Classified image]

    (

    PROPERTY 'FieldTile' 'Tile',

    PROPERTY 'FieldX' 'X',

    PROPERTY 'FieldY' 'Y',

    PROPERTY 'Rect' '[ 0, 0, 73728, 106496 ]',

    PROPERTY 'StylePixel' '{ "Channel": 0, "Value": 6842472, "Values": { "1": 6842472, "2": 13421772 } }',

    PROPERTY 'Table' '[Classified]'

    );

INSERT INTO [Classified]

    (

    --[mfd_id],

    [X][Y][Tile]

    )

SELECT

    --[mfd_id],

    [X][Y],

    CASTV(

        [Tile] * 0

        +

        -- when >= -68 and < 113

        1 *

            (1 - TileSign(TileMax(-68 - [Tile], 0)))

                -- 1 if [Tile] >= -68 else 0

            *

            -TileSign(TileMin([Tile] - 113, 0))

                -- 1 if [Tile] < 113 else 0

        +

        -- when < -68 or >= 113

        2 *

            (

            -TileSign(TileMin([Tile] - -68, 0))

                -- 1 if [Tile] < -68 else 0

            +

            (1 - TileSign(TileMax(113 - [Tile], 0)))

                -- 1 if [Tile] >= 113 else 0

            )

        AS UINT8

        )

FROM @source_table

THREADS 5

--THREADS SystemCpuCount()

;


(2) Writing to a new field in an existing table is almost as good, again much better than updating the existing field.

Write to new field in existing table: 102s. Code below.

--SQL9

VALUE @source_image TABLE = [Aspect image];

PRAGMA (

    'gpgpu' = 'aggressive'-- slightly faster than auto

    'gpgpu.fp' = '32'

    );

ALTER TABLE [Aspect]

    (

    ADD [Tile 2] TILE,

    ADD INDEX [X_Y_Tile_2_x] RTREE ([X][Y][Tile 2] TILEREDUCE 'INDEXED' TILESIZE (128,128) TILETYPE UINT8),

    ADD PROPERTY 'FieldCoordSystem.Tile 2' ComponentCoordSystem(@source_image),

    ADD PROPERTY 'FieldTileSize.Tile 2' '[ 128, 128 ]',

    ADD PROPERTY 'FieldTileType.Tile 2' 'uint8'

    );

VALUE @source_table TABLE = [Aspect];

    -- must follow (or be repeated after) ALTER TABLE

    -- if schema has changed

CREATE IMAGE [Classified image]

    (

    PROPERTY 'FieldTile' 'Tile 2',

    PROPERTY 'FieldX' 'X',

    PROPERTY 'FieldY' 'Y',

    PROPERTY 'Rect' '[ 0, 0, 73728, 106496 ]',

    PROPERTY 'StylePixel' '{ "Channel": 0, "Value": 6842472, "Values": { "1": 6842472, "2": 13421772 } }',

    PROPERTY 'Table' '[Aspect]'

    );

UPDATE

    (

    SELECT

        [mfd_id],

        [Tile 2],

        CASTV(

            [Tile] * 0 -- note 1

            +

            -- when >= -68 and < 113

            1 *

                (1 - TileSign(TileMax(-68 - [Tile], 0)))

                    -- 1 if [Tile] >= -68 else 0

                *

                -TileSign(TileMin([Tile] - 113, 0))

                    -- 1 if [Tile] < 113 else 0

            +

            -- when < -68 or >= 113

            2 *

                (

                -TileSign(TileMin([Tile] - -68, 0))

                    -- 1 if [Tile] < -68 else 0

                +

                (1 - TileSign(TileMax(113 - [Tile], 0)))

                    -- 1 if [Tile] >= 113 else 0

                )

            AS UINT8

            ) AS [Tile']

    FROM @source_table

    THREADS 5

    --THREADS SystemCpuCount()

    )

SET [Tile 2] = [Tile']

;


I would say that with Radian storage technology, two hands are better than one.

These timings are all with CUDA enabled. I will start a separate thread on how to manage that, since it is unexpectedly fragile.

danb

2,064 post(s)
#20-Sep-19 01:48

Writing a new table GPGPU aggressive

All threads:

2019-09-20 12:42:35 -- Query: [QryNewTbl] (67.133 sec)

5 threads:

2019-09-20 12:45:10 -- Query: [Qry NewTbl] (56.043 sec)


Landsystems Ltd ... Know your land | www.landsystems.co.nz

tjhb
10,094 post(s)
#21-Sep-19 23:39

I like the cut of your gib there Dan.

Since your Xeon E5-1650 has 6 physical cores (12 virtual), and is also pumping dual Maxwell GPUs, I would guess that performance for your machine might peak at 8 or 9 threads.

48s or less and your refurbished box would be twice as far as mine.

danb

2,064 post(s)
#22-Sep-19 21:49

Hi Tim,

I can't quite trim enough to reach 48 seconds ...

10 Threads

2019-09-23 08:38:29 -- Query: [Qry NewTbl] (55.402 sec)

9 Threads

2019-09-23 08:31:30 -- Query: [Qry NewTbl] (55.033 sec)

8 Threads

2019-09-23 08:34:17 -- Query: [Qry NewTbl] (54.918 sec)

7 Threads

2019-09-23 08:44:59 -- Query: [Qry NewTbl] (56.249 sec)


Landsystems Ltd ... Know your land | www.landsystems.co.nz

tjhb
10,094 post(s)
#22-Sep-19 22:34

Thanks!

It is really interesting how similar those times are. Out of my depth but out of curiosity I would infer that Intel is just so good at allocating shared execution over hyperthreads that it almost doesn't matter whether we use only physical cores or all logical cores. There does seem to be a sweet spot in allocating enough threads to use all physical cores plus one or two to mop up the last crumbs of latency. That is, given that Manifold 9 is such an efficient data pump itself. (Using more threads would presumably benefit less efficient software more.)

It would/will be interesting to see whether the same holds for AMD processors. Do they rely more on sharing the load cleverly across cores, or on core independence? In the latter case performance might scale more linearly than with Intel.

Dimitri


7,413 post(s)
#23-Sep-19 10:04

Thanks for sharing the data! I created an Aspect table and Aspect image using radius 3 Aspect, and then ran your query, the one where you reported 96 seconds. Here are some timings:

Win 10 x64, Ryzen 9 3900x (12 cores / 24 threads), GTX 1060 6GB, 1TB SSD, 64GB RAM

GPU Aggressive, 24 threads:

2019-09-23 10:59:50 -- Query: [Write new table] (77.724 sec)

GPU Auto, 24 threads:

2019-09-23 11:04:29 -- Query: [Write new table] (77.416 sec)

No GPU, 24 threads:

2019-09-23 11:07:25 -- Query: [Write new table] (109.233 sec)

My guess on this (will be interesting to see what adamw has to say when he surfaces from the lastest build...) is that the query is not significantly enough math intensive for either GPU or CPU to get a workout, but instead the dominating factor is disk/SSD access for reads and writes. So variations in the number of threads and the speed of each thread may play a role, but if the bottleneck really is overall data access (however you may parallelize it), then the results of varying the number of threads may not be as different as one might expect. But it's still fun to see Performance Monitor show 24 CPU cores 100% busy (image at the end). :-)

About the GPU card: There are variations of GTX 1060 CPUs. Cards with 3GB of memory use chips with 1152 CUDA cores and can be had for $170. With 6GB of memory you get 1280 CUDA cores at a price around $210. On-card memory is less important these days since the most recent CUDA editions (now used by Manifold) can utilize main RAM for GPU memory. Manifold leverages that, but you must have (of course) the latest driver for your card to be sure you have up-to-date CUDA that can use main memory.

Therefore, I don't think the difference in cores between the 3GB and 6GB cards is that big a deal if you want to save $40 or $50 on your card, for GPU use with Manifold. But... on-card memory may be valuable for games and other applications that can use GPU.

Do they rely more on sharing the load cleverly across cores, or on core independence?

Don't know. That's an interesting question. Earlier AMD CPUs, like the FX, I've always run just physical cores (8) with great results. For such an inexpensive chip, at $90, getting 8 physical cores is really super. The new, third generation chips are just wonderful in terms of high core count, high speed, and low power. They also provide very fast fourth generation PCI interfaces, so they pair very well with GPU cards and fast, inexpensive M.2 memory. It's really dazzling what can be done at impressively low cost.

What is also interesting is how part of AMD's announcement that the 16 core Ryzen 9 3950x won't be out until November, (to give them a chance to get caught up with demand and try to tamp down price gouging) is an announcement of third generation, 7nm process Epyc processors in November, at least to 24 cores.

Epyc is thought of as a server chip, but for parallel spatial data work it is a premium desktop chip as well. You can buy a second gen 64 core Epyc for $4500, which is not a bad price for 64 cores and 128 threads, and there are motherboards out there with two Epyc sockets for 128 cores / 256 threads on the desktop. It's true that Epyc versions designed for dual-socket use are more expensive, at about $6500 for the 64 core version, but still... $13,000 for 128 cores attached to the same local RAM, with 256 lanes of PCIe 4 connectivity, is a very good price compared to the cost of renting cloud time, where you can spend thousands of dollars on a single run that involves a few terabytes of data. The high speed links between Epyc and GPU made possible by so many very high speed connections, by the way, is one reason AMD has won 100% of the new supercomputer projects announced in 2019, like Oak Ridge's Frontier project.

All that is very exotic for now, but I think we're less than a year away from 16 or 32 cores on the desktop being totally routine at very accessible prices, like $500, for the CPU. The third gen EPYC chips use the same socket as second gen, so as data centers swap out to third gen to get the benefits of profoundly lower operating cost (7nm process means the chips take way less power), we'll have a flood of used second gen parts coming on the market. Probably could pick up a 64 core second gen Epyc for under $500. :-)

Screenshot: 24 cores busy running Tim's query...

Attachments:
timings.png

tjhb
10,094 post(s)
#23-Sep-19 10:30

Thanks Dimitri, great discussion. BTW there seems to be a suggestion of a new Threadripper CPUs coming also in November (or thereabouts).

It's great to see timings for the 3900X. I don't suppose you have time to test it also with just 12 threads (and, say, 14 threads) on the same task (throwing away the first pass to assume away the effect of Windows caching).

That would be very interesting too.

Dimitri


7,413 post(s)
#23-Sep-19 12:49

Additional timings:

Ryzen 9 3900x (12 cores / 24 threads), GTX 1060 6GB, 1TB SSD, 64GB RAM

Auto GPU, 24 threads, second of two runs:

2019-09-23 14:03:31 -- Query: [Write new table] (77.785 sec)

AutoGPU, 14 threads, second of two runs:

2019-09-23 14:09:06 -- Query: [Write new table] (75.002 sec)

AutoGPU, 12 threads, second of two runs:

2019-09-23 14:16:24 -- Query: [Write new table] (74.710 sec)

A difference of two or three seconds isn't significant. I think what we are seeing is that the task depends a lot on the speed of disk, and then after that is so dominated by GPU that it doesn't matter much how many CPU threads you launch (see additional comments below).

I tried running it on a conventional hard disk (no SSD) with weaker CPU and weak GPU and here is what I got:

Core i7 960, GT 1030 + GT 710 384+192=576 cores, 24GB RAM

GPU Auto, 8 threads, first of two runs:

2019-09-23 14:33:04 -- Query: [Write new table] (401.826 sec)

GPU Auto, 8 threads, second of two runs:

2019-09-23 14:41:26 -- Query: [Write new table] (386.271 sec)

Slower disk makes the biggest difference, I suspect, plus use of weaker GPU.

About that "doesn't matter much how many CPU threads": Going from 16 threads to 128 and more threads no doubt will change how one might want to optimize use of those threads for maximum performance. Manifold does OK out of the box with many threads, so if you launch more threads than can effectively be used on the margin (due to being storage bound, or whatever) nothing bad happens. But I expect that can be improved to get the most use possible out of very many threads.

Biospatial1 post(s)
#24-Sep-19 05:30

Thanks for this useful thread and test dataset. I've recently purchased a high spec computer but it probably isn't very good bang for buck with manifold.

AMD Ryzen Threadripper 2950X 16-Core 32-thread Processor 3.5GHz

128GB installed RAM

2x Nvidia RTX 2080 Ti GPU’s

4x 2Tb SSD's

Utilizing 4 threads performed best @ 75 seconds.

I purchased this system for a different software package knowing it doesn't perform as well as intel processors (have a look at Pugent systems testing info) but I could run multiple instances of the software across all 30 odd threads with a small drop in performance for each instance but achieve overall improvements in processing/workflow efficiency.

I tried opening 4 instances of the same manifold map project (different names) and ran them all simultaneously utilizing 4 threads each. Curiously, each instance took 4 x as long to process except for the last, which took 2x as long

Attachments:
Instances.JPG
Threads.JPG

Dimitri


7,413 post(s)
#24-Sep-19 09:17

I could run multiple instances of the software

That's not parallelism, which is running one instance of the software that decomposes a large task into parallel slices, executes each, and assembles the result. For example, a web server that executes a different web session for each client on a different core isn't running parallel. If one of those users has a huge job, it runs on only one core and is not split up to run on many cores in parallel.

As always, when analyzing best use of parallelism and other resources, you have to look at the specifics of what is being one. What are you doing? Some things benefit from parallelism, and others don't.

Likewise for machine specs, for where there is a sweet spot between funding more CPU cores, more GPU cores, more RAM and faster "disk," but there will be a very big difference in the balance point depending on what is being done. For example, if your tasks are disk-bound, spending more money on GPU will not help. If your work is very intensively mathematical in a way that works well with GPU, then more/faster GPU cores will help, etc.

tjhb
10,094 post(s)
#20-Sep-19 02:08

Sorry, fir “drawing” above please read “image” where necessary. I thought I had corrected my mistake but must have been out of time.

jsperr
143 post(s)
#06-Oct-19 16:20

Writing to a new table

Lenovo D20 Windows 10 workstation -- 96GB ram -- Dual XEON x5650 2.67 GHz 6 core processors -- Nvidia 6GB GeForce Titan GTX in 64 bit floating point mode -- Dual 2 TB SAS HDD in Striped Raid Array.

THREADS 5 2019-10-06 09:38:03 -- Query: [Query] (127.104 sec)

THREADS 12 2019-10-06 09:40:33 -- Query: [Query] (106.721 sec)

THREADS SystemCpuCount() 2019-10-06 09:43:03 -- Query: [Query] (107.050 sec)

THREADS 8 2019-10-06 09:45:51 -- Query: [Query] (103.230 sec)

THREADS 5 2019-10-06 09:48:18 -- Query: [Query] ( 98.598 sec)

THREADS 6 2019-10-06 09:53:14 -- Query: [Query] ( 99.584 sec)

THREADS 4 2019-10-06 09:56:02 -- Query: [Query] (102.608 sec)

For the fastest query, the second run of Threads 5, the following was reported:

GPU load 70 % -- upper limit as the card is throttled in FP64bit mode for thermal control?

CPU load 28 %

Ram used 29.2 GB

This is a workstation I purchased used on EBay for $475. To that I added the $150 SAS raid array, and recently had to upgrade from the Quadro 2000 card to the Titan GTX for another $125. Under $1000 has me a solid Manifold 9 setup.

Special thanks to Tim for steering me to the GK110 GPU cards and all his work here -- I learn a ton of stuff reading the forum every day.

Dimitri


7,413 post(s)
#07-Oct-19 04:47

That's a great report! Some comments:

THREADS 5 2019-10-06 09:38:03 -- Query: [Query] (127.104 sec)

and

THREADS 5 2019-10-06 09:48:18 -- Query: [Query] ( 98.598 sec)

The difference in times from the same configuration is probably from better caching in RAM by Windows. If you re-run a job and times go down like that, it's a sign of data moving into RAM where there is faster access than from disk.

This is a workstation I purchased used on EBay for $475.

That is so totally super-cool! 12 mighty processor cores and 96GB of RAM for $475!

What you should consider next, as prices come down, is an SSD. Fast SSDs are becoming very affordable.

adamw


10,447 post(s)
#24-Sep-19 16:46

Some notes, replying to the whole thread.

First off, great to see the effect of a per-pixel Iif in M8 being achieved by a non-branching combination of sign / min / max / other arithmetics - this is clever, and the elimination of branches helps improve the performance a lot, not just here.

On to the results - the expression in 9, however complex it might look in the query, is blazing fast on the GPGPU. So, yes, as Dimitri said, the performance is mostly in the storage + threads. The GPGPU still helps, but it is not dominant. 8 loses big because it has slower storage and less threads.

It is not clear why the performance improves when you add an extra term of [Tile] * 0 - we'll look into the GPGPU profile.

Inserting records into a new table is faster than updating values in an existing table because of data access patterns - sequential access in one case, (essentially) random access in another. Updating values of a new (blank?) field in an existing table likely was faster than updating values of a filled field in an existing table for a similar reason - there was just more random access in the second case than in the first.

We are planning to add query functions to work with tile masks which would make expressions like the one used here much simpler to write, too.

Great thread! :-)

tjhb
10,094 post(s)
#16-Oct-19 04:52

For 9.0.169.9, one of the things I was most interested in is the use of a new toolset, mentioned by Adam here.

Something that bothered me above was that timings on relatively new AMD CPUs seemed too slow. I had wondered whether AVX or AVX2 optimizations were only enabled for Intel up to now, since older AMD CPUs did not support them.

Partly for selfish reasons (I am thinking of buying a new AMD CPU with more cores quite soon), I would be very interested to see how new AMD CPUs perform this task under 9.0.169.9.

For what it is worth, for the same task, I get a small speedup on the same Intel hardware as before. Using 5 cores 9.0.169.9 shaves 2 full seconds off my best time above here (96s to 94s). That is not large but not trivial--and of course, much better than a slowdown.

In the meantime I had also experimented with 4 cores (the number of actual physical cores on my Intel CPUs), both with and without hyperthreading enabled in BIOS. In sum, hyperthreading helps just slightly with efficient code (naturally since it is designed to mop up slack in inefficient code), and helps best if we use only the same number of threads as there are physical cores. Adding more threads does not help--but turning off hyperthreading does not help either.

It may be very different on AMD.

It will be great if the new toolset has made AMD CPUs significantly faster.

adamw


10,447 post(s)
#16-Oct-19 09:48

We are not using AVX / AVX2 / AVX-512 because not all CPUs on the operating systems that we support have these instruction sets. We also did not do any serious testing to determine performance benefits from using these instruction sets. It is pretty clear that we do have code that should benefit, but it is unclear to what extent.

We will try to do at least some testing, and if the results will be promising enough, maybe we can ship an extra set of binaries with AVX support and select which binary to use at runtime.

jsperr
143 post(s)
#22-Oct-19 12:47

I too ran my numbers again and picked up a few seconds (~ 2.5%) with the newest build.

With 9.0.169.9 my new best time is 96.170 seconds with 6 threads.

Prior fastest run with 9.0.169.8 was 98.598 seconds with 5 threads.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.