Subscribe to this thread
Home - General / All posts - New video - 48 Thread Parallel Watershed Areas
Dimitri


5,695 post(s)
#30-Dec-19 09:16

There's a new video in the Videos page. This one shows a 24-core / 48-thread AMD Threadripper doing a watershed computation in 22 seconds on a 1 GB terrain elevation raster, with Windows Task Manager open and showing utilization of 48 threads.

The Threadripper in the video is a second generation 2970WX, yet despite the complex NUMA (non uniform memory architecture) architecture used in the 2970WX, Manifold still makes effective use of many threads that have different performance characteristics, without threads interfering with each other and causing slow downs.

artlembo

3,118 post(s)
#31-Dec-19 01:41

I can use my own DEM, but do you have a link to the data?

tjhb
9,045 post(s)
#01-Jan-20 02:56

I would enjoy a link to the data too, so I can do my own tests.

Even more, it would be great to see some comparative timings, using an older system using four cores (with four or eight threads), addressing the same data.

Likewise timings for AMD 3900X or 3950X, same data.

The pointer to NUMA architectures was an eye-opener, I should have known this before, thanks Dimitri.

Dimitri


5,695 post(s)
#01-Jan-20 09:11

The pointer to NUMA architectures was an eye-opener,

One task for current builds is to run in a NUMA processor without many threads causing a slow down. I know that doesn't sound particularly ambitious, but it is a good first step, with optimizations for NUMA still ahead.

The video is mainly useful as a technical exploration of using many threads in a mixed performance environment where some threads have better RAM bandwidth and where disk access is a factor as well. That's a useful model for exploring what happens when you distribute parallel tasks across multiple machines, where varying RAM bandwidth in the Threadripper and disk access latency are useful proxies for varying machine performance and varying network latencies in local networks or clouds. There is a lot more work to be done in this area. But at least for now it provides a better setting for throwing many threads at those GIS tasks where, all other things (disk access, etc) being equal, more threads can help.

You can see the effects of faster disk access running the same job on a 3900x with faster disk access. The system in the video uses a pretty old X399 motherboard with an SATA III, 500 GB Samsung EVO SSD that provides a maximum of about 500 MB/s read speed for sequential reads... not anywhere near as fast as more modern M.2 SSD.

To compare this to a Ryzen 9 3900x with faster data access, I first re-ran the calculation on the Threadripper using the latest build, 9.0.170.2, which was published after that video. The 170.2 build has an improved raster shortest path algorithm, which resolves plateaus (there are many in that data set due to the lakes/reservoirs in the Montara region) faster and better. The timings, run three times to average out Windows cache effects:

AMD Threadripper 2970WX 48 threads X399 motherboard

(Samsung EVO SSD SATA III 550 MB/s read) 

2020-01-01 09:53:33  -- Transform (Watershed Areas): [Montara] (18.391 sec)

2020-01-01 09:54:58  -- Transform (Watershed Areas): [Montara] (18.516 sec)

2020-01-01 09:55:52  -- Transform (Watershed Areas): [Montara] (18.406 sec)

Compare that to a system that has ten times faster data access (PCIe 4.0 M.2 SSD), using a faster CPU with uniform (non-NUMA) memory access:

AMD Ryzen 9 3900x 24 threads  X570 motherboard

(PCIe 4.0, M.2 NVMe SSD, Sabrent Rocket 5000 MB/s read) 

2020-01-01 09:57:47  -- Transform (Watershed Areas): [Montara] (11.006 sec)

2020-01-01 09:58:35  -- Transform (Watershed Areas): [Montara] (10.753 sec)

2020-01-01 09:59:09  -- Transform (Watershed Areas): [Montara] (10.596 sec)

The Ryzen 9 test ran with from 91% to 99% overall CPU utilization.

Despite having only half as many threads, the system with ten times faster data storage does the job in 11 seconds or less as compared to about 18.5 seconds. It's true the Ryzen 9 has a base clock about 22% faster than the older Threadripper, and the Ryzen 9 has 25% faster main memory in the X570 motherboard (DDR4-3600) compared to the older X399 (DDR-2666), and the Ryzen 9 has uniform memory architecture, compared to the NUMA architecture of the older Threadripper where half the cores cannot reach memory directly but must go through another chiplet.

The data for the above can be downloaded from: http://www.manifoldgis.com/files/Montara_SRTM3.map (288,192 KB)

The 2970WX Threadripper in the video at $900 is a pretty good deal for 48 threads. The equivalent 3rd gen 48 thread Threadripper is out of stock, but typically priced at around $1500 to $1700.

Once they become available and prices come down, a Ryzen 9 3950x with 32 threads will be great, but for now the sweet spot I think is a 24 thread Ryzen 9 3900x for around $490. Add a 1 TB Sabrent Rocket PCIe 4.0 M.2 SSD for around $120 and you get many fast threads with 5000 MB/s "disk" data access.

There's obviously many moving parts going on in all this, but in general I think it's wonderful we're entering 2020 with opportunities to get faster/better pretty much everywhere you look.

Manycore processors are getting cheaper and more available, 1 TB super-fast 4th gen M.2 SSDs are getting absurdly cheap, NVIDIA keeps churning out better and better GPUs at a lower price per core, and there is steady progress within Manifold for using more CPU cores as well as more GPU cores.

There's also interactivity with progress in one area (like the new raster shortest path work) that helps in other areas (faster plateau reckoning with watersheds). Connecting up the dots so GPU is used in mixed vector / raster applications like watersheds, visibility zones, etc., will also be helpful. It's nice to see that the original investment into 9 as a platform that facilitates reliable expansion and upgraded function is paying off.

StanNWT
151 post(s)
#01-Jan-20 12:59

Once I have my new system it will be interesting to try this data set with the same operations at home. Got a second ASUS Strix 2080Ti OC ordered on sale, so I'll have two of them in the 3970X build along with 128GB DDR4 3200MHz RAM. I'm buying the ASUS NVLink bridge for gaming but if it interferes with Manifold I'll disable SLI when using Manifold. Still waiting on parts...

tjhb
9,045 post(s)
#02-Jan-20 01:32

Here are some results on older Intel processors for comparison.

Manifold 9.0.170.2, Windows 10 1809. Both machines use a 1 TB Samsung EVO SATA3 drive for TEMP. Using default Min flow parameter of 100.

Intel Core i7-4790K (8 threads) desktop

2020-01-02 14:18:33  -- Open: D:\Downloads\Montara_SRTM3.map (0.016 sec)

2020-01-02 14:18:36     Render: [Montara] (0.110 sec)

2020-01-02 14:19:23  -- Transform (Watershed Areas): [Montara] (38.091 sec)

2020-01-02 14:20:09  -- Transform (Watershed Areas): [Montara] (38.668 sec)

2020-01-02 14:20:53  -- Transform (Watershed Areas): [Montara] (39.622 sec)

2020-01-02 14:21:49  -- Transform (Watershed Areas): [Montara] (38.840 sec)

.

Intel Core i7-4800MQ (8 threads) notebook

2020-01-02 14:16:56  -- Open: D:\Downloads\Montara_SRTM3.map (0.000 sec)

2020-01-02 14:16:58     Render: [Montara] (0.031 sec)

2020-01-02 14:17:57  -- Transform (Watershed Areas): [Montara] (48.609 sec)

2020-01-02 14:18:59  -- Transform (Watershed Areas): [Montara] (49.590 sec)

2020-01-02 14:20:00  -- Transform (Watershed Areas): [Montara] (49.972 sec)

2020-01-02 14:20:55  -- Transform (Watershed Areas): [Montara] (50.979 sec)

2020-01-02 14:22:19  -- Transform (Watershed Areas): [Montara] (48.851 sec)

(Is it worth upgrading the desktop machine to AMD 3950X? Greater than 3x speedup on a real-world task -> definitely.)

tjhb
9,045 post(s)
#02-Jan-20 01:58

@Dimitri,

It would also be cool to see timings using threads equal to the number of physical cores. I.e. 24 threads for the 2970WX, 12 threads for the 3900X.

On i7-4970K I get ~74s specifying 4 physical cores (in two places in the SQL).

From past experience I was expecting a slightly faster result using 4 cores than using all 8 logical cores. That expectation now proves wrong.

Recent Manifold builds have changed the equation significantly.

hugh
170 post(s)
#02-Jan-20 02:30

similar result on my i7

12 threads:

Transform (watershed Areas): [Montara] (32.100 sec) flow 100

6 threads:

Transform (watershed Areas): [Montara] (38.035 sec) flow 100

on CPU i7-870DT 6 cores 12 threads

Lenovo ThinkStation P330 Tiny, 32 GB DDR4-2666 DDR4 SDRAM, nvidia Quadro P1000 Pascal

Attachments:
All_threads_used_P330.jpg

hugh
170 post(s)
#02-Jan-20 03:39

looking at the resource monitor it seems like my 6 thread attempt still ran on all threads.

I ran the sql generated by the watershed areas transform with changes in this part:

--SQL9

PRAGMA ('progress.percentnext' = '40');

VALUE @watersheds TABLE = CALL TileWatershedMakePar([Montara], false, ThreadConfig(6));

PRAGMA ('progress.percentnext' = '100''progress.percentinsertsource' = '80');

INSERT INTO [Montara Table Watershed Areas 3] (

  [Geom][Stream][Target][OrderShreve][OrderStrahler][Value][ValueSum]

SELECT

  [Geom][Stream][Target][OrderShreve][OrderStrahler][Value][ValueSum]

FROM CALL TileWatershedAreasPar(@watersheds, 100, ThreadConfig(6));

should I have done something different?

same difference in timing but resource monitor does not show the threads so saturated and the little box's fan was not so loudly trying to cool it down

Attachments:
ThreadConfig(6).jpg

tjhb
9,045 post(s)
#02-Jan-20 04:29

That is normal.

If you have (say) 6 physical cores and 12 logical cores, and 6 heavy threads of tasks, then Windows will try to distribute those 6 tasks evenly over the 12 available logical cores using internal switching.

Thus all 12 logical cores will be shown as saturated. In fact what is happening is that the 6 physical cores are saturated twice.

That is why it could previously be faster to use just 6, not 12 cores. (At least I think so.) Same physical saturation, less context switching.

But now... it seems Manifold has done something cleverer. Now, always allocate all logical cores if a task is parallel.

Dimitri


5,695 post(s)
#02-Jan-20 05:42

First, a small note: you're not running the same test as in the video, because the watersheds in the query are set to use a Min flow value of 100, not 500 as in the video. Instead of

TileWatershedAreasPar(@watersheds, 100, ThreadConfig(6))

it should be

TileWatershedAreasPar(@watersheds, 500, ThreadConfig(6))

I repeated the video test (min flow 500) using the Ryzen 9 3900x with default query using ThreadConfig(SystemCpuCount()), 12 threads using ThreadConfig(12) and six threads using ThreadConfig(6).

With all 24 threads (default), task manager shows all virtual cores going straight to full utilization.

With 12 threads, although all virtual cores are active, only around 12 of them seem to be intensively utilized.

With 6 threads, although there is some action on more than 6 virtual cores, only about six of them are intensively utilized.

In terms of trying just physical cores and not virtual cores, to do that you have to turn off hyperthreading in the BIOS, I think. Otherwise, if in a 12 core processor you say to use 12 threads, you can still end up using 12 virtual cores, and not sticking to 12 physical cores.

Attachments:
threads_06.png
threads_12.png
threads_24.png

tjhb
9,045 post(s)
#02-Jan-20 06:13

One important thing that Dimitri gets right which I got wrong is that we cannot specify the number of physical cores to use. We can only specify logical cores. Threads can migrate.

To limit logical cores to physical cores, we must disable hyperthreading in the BIOS, as Dimitri says.

Given a 12- or 16-core processor, I would disable hyperthreading for Manifold, at least for testing. I can test this using a 4-core processor of course.

adamw


8,842 post(s)
#08-Jan-20 11:14

Regarding physical cores vs logical cores.

In general, we wouldn't ever want to specify how many physical cores an operation should use. This is just too low-level a hint. It is much better to tell the operating system (directly or indirectly) how many logical cores we want, and let it decide how to translate that into physical cores. The operating system can do this better than anyone else because it has plenty more information regarding what the machine is busy with in general that only exists in kernel mode.

Now, for testing, there is a way to limit operation to specific physical cores. Task Manager - Details (assuming Windows 10 / Windows Server 2016+) - find the Manifold process, right click it - Set Affinity - pick cores to limit the process to.

jsperr71 post(s)
#01-Jan-20 23:19

Windows 10 Pro 64 bit, Manifold 9.0.170.2 full license, Lenovo D-20 4158 workstation, Dual 6 core Xeon X5650 @ 2.67 GHz (24 threads), 96 GB ECC RAM, NVIDIA GeForce GTX Titan in Float64 mode, 2 TB SAS Striped HDD Array.

Results of three successive runs:

2020-01-01 17:30:44 -- Transform (Watershed Areas): [Montara] (21.264 sec)

2020-01-01 17:31:23 Render: [Map] (1.204 sec)

2020-01-01 17:35:04 Render: [Map] (0.970 sec)

2020-01-01 17:36:09 -- Transform (Watershed Areas): [Montara] (20.241 sec)

2020-01-01 17:37:04 Render: [Map] (0.852 sec)

2020-01-01 17:37:34 -- Transform (Watershed Areas): [Montara] (20.363 sec)

Amazing -- I love going into the office and telling the GIS people in the Planning Department what they are missing by being locked into the ESRI environment.

I will try it with Viewer on my old Windows 7 Pro machine next.

tjhb
9,045 post(s)
#01-Jan-20 03:23

It's only now that I am fully appreciating the chasm of difference (not only logical but physical) between (a) parallelism amongst tasks and (b) parallelism within a task with data contention. Thanks again.

BTW if I could vote for a new forum feature, it would be for short programming essays on topics like these. For those of us who are interested.

In other words, how you have done it, in terms we can repeat.

Good marketing too.

Manifold User Community Use Agreement Copyright (C) 2007-2019 Manifold Software Limited. All rights reserved.