Subscribe to this thread
Home - General / All posts - Calculating Mean Curvature on very large data sets
StanNWT
196 post(s)
#14-Mar-19 21:57

I'm wondering if this is just pushing things too far and it would be better to do it on a smaller DEM?

I have a 30m DEM that has dimensions of:

X: 327,603

Y: 115,217

Data type: float32

This means a 37,745,434,851 pixel raster @ 4 bytes of data per pixel.

If I want to do a mean curvature with a radius of 3, is this just going to blow up my computer?

I did some rough assumptions that given the 49 pixels in a 7x7 matrix that it's 1,849,526,307,699 pixels calculated.

Given the 4 bytes of data that would be 7,398,105,230,796 bytes of data? Am I wrong in assuming it would be calculating 6.729 TB of data?

Just for reference the BigTIFF file of the DEM is around 78 GB, but the ArcGIS arc binary grid is 140.61 GB.

I have a dual Xeon Gold 6128 (3.4GHz, 6 core/CPU), 128 GB RAM, 2TB Samsung Pro 2TB OS Drive where the page file and temp folders are. There's roughly 1.4 TB free space. I also have dual Quadro P4000 cards. The page file is 320 GB in size which isn't listed as free space, it's part of the used space. My storage drive where the (map) file of the data is stored, has 12 TB free space, it's a direct attached storage (USB 3.0 RAID 6) box.

tjhb
10,094 post(s)
#15-Mar-19 03:13

You have great hardware! It could be improved by moving OS, software and page file to a smaller, cheaper, slower SSD, then using that fantastic Samsung 2TB SSD/NVMe drive for both TEMP and current working data (including this project), with the USB3 RAID6 left for static user data. But that is not the question.

The project is easily doable with Manifold 9, the total amount of data is in a sense immaterial because of the tiled image model and the Radian storage model.

The only question is how long it will take. Just for fun I would guess 25 minutes, but I have no idea if that is even within an order of magnitude. (I'm just not so well endowed.)

Please let us know the actual code, and the actual time? Try assigning, say, 18 threads.

StanNWT
196 post(s)
#15-Mar-19 04:29

Wel my storage will be moving to a QNAP TVS-1283T3, with 8 x 8 TB WD UltraStar HDs, 4x 4TB Samsung 860 Pro SSDs, and will have 2 x 1 TB M.2 SATA Samsung 860 Pro cache drives, since the TVS-1283T3 doesn't use NVME drives. I'll be connecting via Thunderbolt 3 to my workstation. The theoretical max speed according to documentation and benchmarks of the uncompressed TVS1283-T3 is 1600 MB/s. As far a code I'm just trying the default transform for "curvature - mean". I was thinking maybe 5 hrs, but it was 1.5 hrs when I left work, but I have no idea if it will work. CPU utilization was fluctuating between 6-20% for Manifold on those 24 cores with default settings, RAM usage was 28 GB total system usage, GPU usage was near zero. The dialogue box say it was inserting records at about 600/s @ 7 MB/s, it did start at several thousand per second though. My RAID box in ATTO disk bench does give me 250MB/s write and 225 MB/s read. Disk IOPS is usually up to 1100 but today was only around 110. The RAID has a 512 GB Samsung 840 mSATA drive for its cache.

I'll know tomorrow morning if it locked up, crashed, failed, completed or is still running. If you haven't guessed by now, I like testing big data, not that a DEM that size is all that big, but it's bigger than what most people have other than large LiDAR data sets.

tjhb
10,094 post(s)
#15-Mar-19 06:35

"Well my storage will be moving to..."

Possibly a waste of money. Just make sure your most important disk is as fast as possible. There is currently only one type of drive that meets that requirement, and it's neither external nor RAID. Don't be fooled by alphabet soup.

As far as the current task is concerned, clearly something is wrong, so the more you can specify the task, the better we can all find what it is.

StanNWT
196 post(s)
#15-Mar-19 14:44

The curvature mean transform finished and the raster layer was created successfully.

It took 8839.753 seconds to finish ~ 2 hrs 27 mins

Saving took 410.774 seconds ~ 6 mins 51 seconds.

The transform log function is below:

2019-03-14 17:16:25 -- Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8839.753 sec)

I just set the radius to 3 when I ran it.

BTW, I love that Manifold 9 can run 100% CPU utilization.

Dimitri


7,413 post(s)
#15-Mar-19 18:42

GPU usage was near zero

If you are looking at task manager performance, don't forget to switch the GPU reporting mode to compute, or windows will not report the near-total use of GPU. See the video for speed demo with 1280 cores to see how it's typical to get 90%+ saturation of GPU.

StanNWT
196 post(s)
#18-Mar-19 19:34

I'm wondering if the data set is at least an order of magnitude bigger, or just bigger by some factor than the available VRAM on an Nvidia graphics card, would Maniofld 9 choose to do more processing on the CPUs and via system RAM as opposed to the GPU? I'm running the curvature profile transform and there is 0% CPU usage and only between 6% - 20 % CPU usage. This data set is very large. I have previously sent tjhb a dropbox download link for the big DEM, it's about 152 GB in Manifold. The transform will likely complete just not maxing out CPU and GPU at the (inserting records / scanning data) stage so far.

tjhb
10,094 post(s)
#18-Mar-19 20:55

I have previously sent tjhb a dropbox download link for the big DEM, it's about 152 GB in Manifold.

I'm glad it's this data, cool. But the download never worked for me--we tried several times and gave up.

If it's still there on Dropbox then I will try again.

StanNWT
196 post(s)
#18-Mar-19 21:19

Yes it's still there. I suppose I could try to upload it again and send you a new link? But try the download of the current file first.

Personally I'd like all of Canada and Alaska at a 30 m DEM as a single data set. It would be interesting if a 7x7, 9x9 would run on that. I've yet to run out of RAM so far. It's likely these are less memory intensive than that decompose to points exercise I ran over a year ago on the older workstation.

tjhb
10,094 post(s)
#18-Mar-19 21:36

I need a new link please. I think I have misfiled the original.

StanNWT
196 post(s)
#19-Mar-19 17:27

I've resent the same link Tim.

tjhb
10,094 post(s)
#19-Mar-19 22:57

Thanks. I have the data downloaded and open. (By the way it took well over an hour for the .mxb file to be opened, decompressed and saved .map format. Maybe plain .map format, compressed with 7Zip, would be better for distribting enormous files.)

StanNWT
196 post(s)
#20-Mar-19 01:45

I myself was surprised when I found that 7-zip files were larger than the mbx files with this and similar large map files that I've made. I haven't determined the unpacking time versus the unzipping time.

tjhb
10,094 post(s)
#20-Mar-19 02:01

Seems worth a test... one day.

I'll have fun repeating your curvature test on between one and three machines, with comparative timings, along with screenshots showing GPU activity.

StanNWT
196 post(s)
#20-Mar-19 23:05

Any luck using that data at all. I'm not specifically talking about the mean curvature or profile curvature.

tjhb
10,094 post(s)
#20-Mar-19 23:59

Agreed. But the important thing is to ensure, and prove, that GPGPU is working for tasks like this. On my machine(s) and yours. And if not, to follow it up carefully.

StanNWT
196 post(s)
#21-Mar-19 00:19

You're right. But in any event, being able to preview 37.7 billion 32-bit float pixels with a 7x7 matrix in 2.5 hrs is impressive. And ArcGIS Pro doing high pass filters 3x3 in still single threaded. I watched it running and there's likely alot of still single threaded tools in ArcGIS Pro at this stage AMD forgery about using the python package manager in Pro to get CUDA working on many or most python packages.

StanNWT
196 post(s)
#21-Mar-19 07:07

I meant to say process not preview 37.7 billion pixels.

StanNWT
196 post(s)
#25-Mar-19 20:18

Any luck with trying to run the same operation?

tjhb
10,094 post(s)
#26-Mar-19 06:02

Yes. Several tests, with full notes.

At default settings, M9 saturates my fastest GPU.

Like Dimitri, I suspect you have been measuring GPGPU incorrectly, but that is easily fixed.

Details tomorrow.

StanNWT
196 post(s)
#27-Mar-19 15:17

Hi Tim,

I found the drop down arrow to switch to "Compute_0" and "Compute_1", since I have two Quadro P4000s I assume that's why I have a "_0" and "_1".

This morning I tried the "limit low" transform to fill in some gaps in that DEM, that you will notice when zoomed into some of the offshore areas where the DEM squared off bits which represent the original 1 degree x 1 degree tile edges, are still present in the DEM. When I see the preview I see no compute activity, but in the first 10 seconds of the "add component" GPU usage went up to 90%, but now I only get about 5% activity every 10 seconds at a completely regular interval.

I've attached a screen grab of the GPU usage.

I've also attached a screen grab of the DEM that shows the gap I was mentioning offshore which also happens to show the progress dialog box.

Attachments:
Manifold_GPU_Compute_Usage_on_Limit_Low_Mar27_2019.jpg
Manifold_Processing_Limit_Low_Transform_Mar27_2019.jpg

Dimitri


7,413 post(s)
#27-Mar-19 17:49

Limit Low is not Curvature, Mean. There's no computation in Limit Low... it's just moving bits between storage. Nothing for a GPU to do there.

How does Curvature, Mean work for you and GPU?

StanNWT
196 post(s)
#27-Mar-19 18:48

I'm going to reproduce the curvature mean and curvature profile runs to check now that I have the compute_0 and compute_1 drop downs set up.

However, I did notice brief times when compute_0 was showing 95%+ usage for 30 seconds or so at the start and near the end of the transform processing.

It did take about 2 hrs 15 minutes to do the "limit low" on that same DEM.

Just an fyi the log window reports:

2019-03-27 11:21:10 -- Transform (Limit Low): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8180.886 sec)

Seems like given that same sized DEM that curvature mean, curvature profile and limit low are all taking 2 hrs 15 mins to 2 hrs 27 minutes.

Attachments:
Manifold_GPU_Compute_Usage_on_Limit_Low_v2_Mar27_2019.jpg

tjhb
10,094 post(s)
#27-Mar-19 20:30

There's no computation in Limit Low... it's just moving bits between storage. Nothing for a GPU to do there.

The compute work is trivial, and most of the work is in priming the GPU with data, but it does look as if GPGPU is being used all the same, and from Stan's timings this seems right. (Right in the sense that the same task would take several times longer on CPU only.)

I gather that CUDA PTX has built-in Max and Min functions for all supported data types, so a limit/clamp operation doesn't need a conditional.

I'm posting my notes on using Curvature, Mean with Stan's data below.

StanNWT
196 post(s)
#27-Mar-19 21:48

Hi Tim,

Were you going to post time to process on your various computers, or is that still running?

I'm going to re-run the mean curvature and profile curvature on that DEM and see if I can see the GPU compute_0 running, as well as compute_1. I have a funny feeling that only compute_0 will show any processing, I'm hoping not though, as I'd want to use both GPUs for processing. Note: I have my Nvidia quadro settings to use both GPUs for "Use for Graphics and compute needs", under the "Manage GPU Utilization. It would be awesome if there was a Manifold setting under the Program Settings for Quadro 3D settings. I know Nvidia would charge a fortune to test and verify their drivers against Manifold 9 in general let alone various builds. However, my default settings are to use CUDA on all GPUs. It is interesting to note that my PhysX settings are set to use the GPU that isn't connected to the displays and the GPU that is doing the calculations in Manifold is the GPU that is connected to the displays. I could set PhysX to be on the CPU, but I highly doubt there is any relation to PhysX settings and how Manifold uses CUDA. Just an observation and maybe something Tim can test as well since he has a Quadro card on a laptop.

tjhb
10,094 post(s)
#27-Mar-19 20:50

I found the drop down arrow to switch to "Compute_0" and "Compute_1", since I have two Quadro P4000s I assume that's why I have a "_0" and "_1".

That's it... with variations.

My fastest GPU is the GTX TITAN (more on that below), installed as a single GPU. On that machine, I also have counters for both Compute_0 and Compute_1. Only Compute_0 shows any GPGPU activity (naturally enough).

On the other hand, this laptop has a tiny Quadro K1100M GPU in conjuction with Intel Graphics 4600. Here the drop-down list of "engines" (as Microsoft mysteriously calls them) does not include Compute_0, but only Compute_1, which again shows no activity when GPGPU is in use. Instead, GPGPU is shown for the "engine" named 3D. This could be a hybrid graphics issue, or a Windows or driver bug, or who knows what. (For interest, the Intel graphics adapter, listed as GPU 0, has "engines" named 3D and Engine 5, but nothing beginning with Compute.)

So we may have to "shop around" amongst the available "engines" named in Task Manager to monitor GPGPU activity correctly.

StanNWT
196 post(s)
#27-Mar-19 21:50

I'd love to have two RTX Titans on a Threadripper 2950X with 128 GB RAM, but hey...

Maybe you can try those variations of PhysX to see if there's a difference, see my post above. I know it's likely to have nothing to do with it, but who knows.

dyalsjas
157 post(s)
#27-Mar-19 22:05

I could do this with a GTX 1080ti.

I don't have any Titans.

tjhb
10,094 post(s)
#27-Mar-19 22:58

Here are my results and notes for processing the same data as in Stan's oriiginal post on my fastest GPU. [Coming in a separate post.]

The GPU is a GeForce GTX TITAN (GK110), which is an interesting card, for the following reasons. (Here comes a long digression...)

Usually the CUDA capability of an NVIDIA adapter is described by its generation (Curie, Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, Turing...) and its asscoated "compute capability" (the range of CUDA programming features encoded in its hardware), plus its number of "CUDA cores". In the last respect this involves an important simplification, in two respects.

First, CUDA cores are not individual processors. They are grouped into streaming multiprocessors, variously called SM, SMX or SMM depending on the generation), each of which is an individual processor, responsible for scheduling parallel work on its own cores. For example, the GTX TITAN has 14 Kepler SMX units, each driving 192 32-bit CUDA cores, for a total of 2688 32-bit cores. (Naturally, the 2688 number is more useful for marketing than the 14 or the 192!) The laptop I am writing this on has a Quadro K1100M, which has 2 Kepler SMX units, again with 192 32-bit cores each, for a total of 384. So in one sense, the TITAN card is about 7 times more powerful than the Quadro, having 14 streaming multiprocessors to schedule simultaneous work to CUDA cores, rather than just 2.

Secondly, the number of "marketing" cores given is for 32-bit processing. This comes about because the "day job" of most GPUs is to process graphics. When they are not doing compute work, these same cores are used for processing pixels or triangular facets for display, and for that purpose 32-bit representation is absolutely enough. But for processing data, mathematical analysis? Not so much.

So since the Tesla generation of NVIDIA cards, each streaming multiprocessior also contains some dedicated 64-bit processing units, designed especially for compute functions.

How many is "some"? Generally, from the Kepler generation onwards, the number of 64-bit units is either 1/24 or 1/32 the number of 32-bit units on (nearly) all NVIDIA graphics cards, whether consumer (GeForce) or professional (Quadro). On cards designed primarily for compute work (mainly Tesla* cards), the ratio is usually 1/3 and occasionally 1/2. (*That is Tesla the marketing brand, not to be confused with the Tesla chip generation. Thanks NVIDIA.) That is a huge difference in capabilty, and naturally there is a corresponding difference in price.

Well, why does this matter to us? Because by default Manifold 9 performs GPGPU work in 64-bit mode not 32-bit mode. As it should: it is doing maths. (We can switch to use 32-bit mode by a PRAGMA directive, although in initial testing this currently seems to disable GPGPU entirely. A question for another day.)

So the effective number of CUDA compute units available to Manifold 9 is (normally) not the number of "marketing" cores, but the number of 64-bit units sprinkled among them. Unfortunately, for a given NVIDIA adapter it can be quite difficult to find this smaller number out.

The K1100M in this laptop has 384 32-bit units, and among them 16 FP64 units (8 per SMX), a ratio of 1/24.

A GeForce GTX 1060 Pascal (6 GB) card has 1280 32-bit units and 40 64-bit units (1/32).

[Added for dyalsjas:] A GTX 1080Ti has 3584 32-bit cores and 112 64-bit cores (1/32).

Stan's two Quadro P4000 Pascal cards each have 1792 32-bit cores and 56 64-bit cores (3584 and 112 in total), again 1/32.

The GTX TITAN was marketed as a hybrid card, which can be optimized either for graphics or for computeIt has 2688 FP32 units, and by default uses 112 FP64 units, a ratio of 1/24. But it can be switched (by a driver setting) to a ratio of 1/3, to utilise all 896 64-bit units it has on board. (In this case the surrounding 32-bit units are run slightly more slowly to keep the card cool.)

There were two other TITAN cards in the same generation (Kepler) with a switchable 64-bit ratio: the TITAN Black and the TITAN Z. Since then, with only three exceptions AFAIK, all GeForce cards (including TITAN models) and all Quadro cards have had a fixed ratio of 1/32. Even the latest RTX and TITAN RTX cards use 1/32. [NB Stan, above.] The three exceptions are the TITAN V ($US 3000), and the Quadro GP100 ($4500) and GV100 ($9000), all of which have a ratio of 1/2. Ideal Manifold 9 cards, for those who can afford them!

So after that massive digression, how does the Kepler TITAN stack up against Stan's pair of Quadro P4000s on the same data? And what difference do 896 64-bit units make, compared with 112?

StanNWT
196 post(s)
#28-Mar-19 01:00

Hi Tim,

I've been fairly well versed in the differences between single and double percussion floating point precision on geforce vs. quadro vs. tesla cards since I first used Manifold 8 in 2009. My biggest hope for rtx titan cards is the 24GB of RAM on them, but given Tim observation of how little RAM Manifold appears to be using it might not make a difference. One thing the Quadro P4000 card had going for it I'd that each card is single slot allowing more customization in how and where you install other PCI-Express cards in your computer depending on the motherboard layout. Plus each card only uses 105W which is the lowest wattage per CUDA cores prett slot I think you can get. Do for the workstation tested driver optimization for things like ArcGIS, Adobe and many of other apps is I think the best card, maybe not for the money as a RTX 2080 would be the same price range but 2-3 slots and way more power usage.

adamw


10,447 post(s)
#01-Apr-19 15:48

(Finally found the time to read through the thread, it's a great one.)

Regarding this:

(We can switch to use 32-bit mode by a PRAGMA directive, although in initial testing this currently seems to disable GPGPU entirely. A question for another day.)

We should use GPGPU for both 64-bit and 32-bit floating-point math.

Is the above suspicion that GPU does not seem to get engaged for 32-bit math based on observations of GPU activity? The activity of GPU with 32-bit math could easily be lower than with 64-bit - exactly because GPU can do way more 32-bit math than 64-bit, so if computation times are dominated by transporting the data around and 64-bit math does not max out the GPU, then 32-bit math won't max it either and will load it noticeably less in relative terms.

(In any case, I just ran a test transform in both modes and GPGPU seems to get engaged in both cases, with the activity being lower for the 32-bit math. So, on first sight, this seems to work as expected.)

StanNWT
196 post(s)
#01-Apr-19 16:05

Hi Adam,

I'm not sure I've not tried the PRAGMA statement to force single precision on the GPU. However, it seems that I might be the only one of the respondents to this thread that is using two GPUs, let alone two quadro cards, let alone a dual xeon setup (Gold 6128). Think there's anything to a system setup like mine that is interfering with the GPU getting maxed out for the entire run? My run time is 2.5 hrs, Tim's is 40 minutes, one would hope a more souped up system would get the work done comparably fast if not faster, or am I just stewing in my own frustration? sorry for the food puns.

adamw


10,447 post(s)
#01-Apr-19 17:40

Tim's 40 minutes are on the test where he uses his card's rare ability to use a humongous numbers of fp64 cores, we should compare to his other test that uses a more normal number of fp64 cores and completes in 51 minutes. Now, why on his system the test completes in 51 minutes whereas on your system it completes in 2.5 hours is a little puzzling, yes.

I doubt it's related to your system having two cards - or, more precisely, it might be related to that, but if it is related to the system having two cards, that's almost certainly on the driver (so, might perhaps be helped by using a later version sometime when they notice the issue).

I think the difference is related to the performance of the disk subsystem. You report that after you upgraded it, saves got a significant speed up, but the transform got a much smaller speed up, that's likely related to different access patterns. It might help to run the transform on both your and Tim's systems without GPU, just on CPU, to see how the numbers will compare. I think I said that we might have a couple of things planned for the future which will make better use of system configurations like yours, too - with access patterns straightened to be more like those during save.

StanNWT
196 post(s)
#01-Apr-19 18:15

The disk subsystem wasn't upgraded. I simply switched where the (map) files were stored for the test. Normally for (map) files they're stored on a USB 3.0 Drobo 5D, economical dual redundancy storage that gets me 250 MB/s write and 225 MB/s read. The <C:> drive is a Samsung 960 Pro 2 TB drive that when initially set up as the <C:> drive was getting 3000 MB/s read, 2500 MB/s write. Obviously the disk IO is vastly different as well. max of 1600 IOPS for the Drobo and possibly 90,000 + IOPS for the 960 Pro. I was also getting actual write speeds of 250 MB/s max on the Drobo when I looked at the Task manager for that disk when saving the file when it was located and saved to the Drobo, and I got upwards of 2 GB/s when saving the file when the (map) file was located and saved to the Samsung 960 Pro drive. Obviously NVME drives are preferable, but no RAID option for me there and I do prefer fault tolerance for GIS and remote sensing data.

tjhb
10,094 post(s)
#28-Mar-19 00:21

OK the timings! Both tests using GeForce GTX TITAN GPU, Kepler generation, compute capability 3.5, 14 SMX, 6 GB graphics RAM; installed on an ASUS Z97-based system, Intel i7-4790K 4 GHz, 32 GB system RAM. Windows pagefile is with OS on SSD1 (256 GB). TEMP and the source .map project are both on SSD2 (480 GB). Other (spinning) drives are not used in the tests.

I had Task Manager set up like this (this is during data copying, prior to testing--no GPGPU load yet).

Here is the test query, as written by the Curvature, Mean transform template, with one manual adjustment: the number of threads is reduced from SystemCpuCount() (in this 8) to 6 [and with the 'FieldCoordSystem.Tile' string truncated to '...'].

--SQL9

CREATE TABLE [USGS_NED_Fused_DEM Curvature, Mean] (

  [X] INT32,

  [Y] INT32,

  [Tile] TILE,

  [mfd_id] INT64,

  INDEX [mfd_id_x] BTREE ([mfd_id]),

  INDEX [X_Y_Tile_x] RTREE ([X][Y][Tile] TILESIZE (128, 128) TILETYPE FLOAT32),

  PROPERTY 'FieldCoordSystem.Tile' '...',

  PROPERTY 'FieldTileSize.Tile' '[ 128, 128 ]',

  PROPERTY 'FieldTileType.Tile' 'float32'

);

CREATE IMAGE [USGS_NED_Fused_DEM Curvature, Mean Image] (

  PROPERTY 'Table' '[USGS_NED_Fused_DEM Curvature, Mean]',

  PROPERTY 'FieldTile' 'Tile',

  PROPERTY 'FieldX' 'X',

  PROPERTY 'FieldY' 'Y',

  PROPERTY 'Rect' '[ 0, 0, 327603, 115217 ]'

);

PRAGMA ('progress.percentnext' = '100');

VALUE @scales FLOAT64X3 = ComponentCoordSystemScaleXYZ([USGS_NED_Fused_DEM]);

INSERT INTO [USGS_NED_Fused_DEM Curvature, Mean] (

  [X][Y],

  [Tile]

SELECT

  [X][Y],

  CASTV ((TileRemoveBorder(TileCurvMean(TileCutBorder([USGS_NED_Fused_DEM], VectorMakeX2([X][Y]), 3), 3, @scales), 3)) AS FLOAT32)

FROM [USGS_NED_Fused_DEM]

--THREADS SystemCpuCount()

THREADS 6

;

TABLE CALL TileUpdatePyramids([USGS_NED_Fused_DEM Curvature, Mean Image]);

(1) The first test is using all 896 64-bit CUDA cores (ratio 1/3), with the GPU set up like this. (See post above for an explanation.)

GPGPU usage went straight to 98-99%. CPU usage went to 75% (meaning that 6 of 8 virtual cores were fully saturated; all 6 allocated threads were fully supplied with work). Those figures both stayed close to constant throughout the test, until the TileUpdatePyramids phase at the end, when GPGPU went to 0% and CPU usage dropped to about 28%.

Graphics RAM usage was constant at 6 GB (that is, all of it). System RAM usage was initially about 6 GB (pinned/shared?), rising to 30 GB during the TileUpdatePyramids phase.

SSD2 was normally about 24% busy, rising to 75% during the TileUpdatePyramids phase. The project used a maximum of 151 GB of TEMP space. There was ~0% activity on SDD1 (so no significant pagefile use).

Here is a typical progress shot.

And during the TileUpdatePyramids phase.

Total processing time using all 896 cores was 2566.518 sec, 42mn 47s.

(2) The second test used 112 64-bit cores (ratio 1/24), otherwise the same. The statements about utilization for the first test also apply with no substantial differences.

Total processing time using 112 cores was 3076.059 sec, 51mn 16s.

This compares with Stan's total time of 8839.753 sec, 2h 27mn, using 112 64-bit cores (if both cards were used) or 56 cores (if not).

So my first test was 3.44x faster than Stan's test, the second 2.87x faster.

Of my two tests, the first (896 cores) was 1.20x faster than the second (112 cores).

Tentative conclusions to come in a separate post.


One more thing: I did set up a test on this laptop. Again GPGPU usage went straight to ~98%, and CPU use was pegged at ~75%. Only, the test turned the laptop into a hairdryer, and I cancelled the test after a few minutes rather than continue to torture it.

I will make a separate test on a Geforce GTX 1060 in the next few days.

Attachments:
GTX TITAN settings.png
progress 1.png
progress 2.png
Task Manager GPGPU.png

tjhb
10,094 post(s)
#28-Mar-19 00:42

Correction:

Graphics RAM usage was constant at 6 GB (that is, all of it) 0.6 GB.

I misread the graph. Why is so little GPU memory used? Buh. Well, if all CUDA cores were saturated (if this is roughly what Compute_0 98% means) then Manifold clearly knows what it is doing.

BTW notice the somewhat unhelpful reading near the bottom of Task Manager window showing "Utilisation 1%", although Compute_0 is at 98%. Evidently "Utilisation" does not include compute. Misleading by default.

StanNWT
196 post(s)
#28-Mar-19 01:12

Hi Tim,

I wonder if having the data type for the DEM being float32 would be better performing than if the data type was float64 given the performance penalty of 32-bit vs. 64-bit on nvidia cards depending on quadro, geforce, tesla or GP /GV editions of those cards. I'd live a Titan V due to its better double precision performance bit it has half the RAM of the Titan RTX.

I think the show stopped compared to your runs is due to the storage medium I'm reading the data from. When it gets moved over to the QNAP RAID or will improve I'm sure. Disk IO on my current storage usually tops out at 1100 IOPS with 225 MB/s and 250 MB/s using ATTO disk bench. Reads and writes when reading or writing with Manifold it's usually about 117 IOPS, saving is usually around 10 MB/s, but I've seen 270 MB/s when the transform dialog showed the progress of whatever I'm running. My new storage should be 8 x faster for reads and writes, not sure about the disk IO levels.

I really appreciate your help testing this Tim. If nothing else you have a very large DEM to test things with.

tjhb
10,094 post(s)
#28-Mar-19 01:41

I wonder if having the data type for the DEM being float32 would be better performing than if the data type was float64 given the performance penalty of 32-bit vs. 64-bit on nvidia cards depending on quadro, geforce, tesla or GP /GV editions of those cards.

I'm not sure I follow this exactly, but this might be a useful comment (maybe not). By default Manifold 9 processes all data on GPGPU as FP64, even FP32 as in the case of your massive image. So it only uses the FP64 cores, however many or few there are.

We can use PRAGMA to set 'gpgpu.fp', but in my limited testing the effect of that is not what I had expected, which was to force processing onto the FP32 CUDA cores instead, but rather to pull processing back to the CPU. If so it could be for a good reason, but in any case I need to check this.

I wouldn't regard the FP32 vs FP64 question as involving a performance penalty. We just need to be aware of the facts: that Manifold 9 uses (strongly prefers) FP64 processing, on GPU as on CPU; that the number of FP64 cores on an NVIDIA card is significantly smaller than the number of 32-bit cores which are advertised; and that the difference (the FP64:FP32 ratio) varies greatly between different cards.

Having said that, the difference in my tests between 896 and 112 cores was not 896/112, nothing like it. It was only 1.20x. That is useful, but it's not a reason to spend an extra zillion dollars in my opinion.

The great thing is that Manifold can saturate whatever you've got--both GPU and CPU, together, and constantly. I think that's amazing. I hadn't expected nearly such a good result.

I'd live a Titan V due to its better double precision performance bit it has half the RAM of the Titan RTX.

The current test suggests that for GPGPU with Manifold 9, the amount of on-board graphics RAM is of little importance. I was surprised by that (after I corrected my dumb mistake). At least for this test, it basically doesn't matter at all. Would it be more important for a more complex function, e.g. using lots of nested functions? Maybe--we should test that.

Graphics RAM matters for other things, of course. E.g. for editing large images in Photoshop (which also uses CUDA and/or OpenCL), and I think for CAD. And for running multiple and/or very large monitors.

I think the show stoppe[r] compared to your runs is due to the storage medium I'm reading the data from.

I agree.

When it gets moved over to the QNAP RAID [it] will improve I'm sure.

Not so sure about that! Other things being equal, internal storage is always much faster than external storage. (No controller can beat the chipset, since everything has to go through that eventually anyway.) Before you spend the money, why don't you run a trial with the .map file on your drive C:, the lovely large SSD? See how much that accelarates the performance.

tjhb
10,094 post(s)
#28-Mar-19 02:15

The current test suggests that for GPGPU with Manifold 9, the amount of on-board graphics RAM is of little importance.

I see you've already covered that above.

Thinking about it a bit more, it's possible that the amount of memory used is not a Manifold choice, but a CUDA driver choice, under the Unified Memory model introduced from Kepler onwards, which migrates data between system and graphics RAM on demand. Manifold might still do it the hard way when it really matters, but I bet they would rely on the built-in memory management when it just works.

The evidence that matters is the constant GPGPU saturation. In this test the GPU is getting all the data it can possibly eat.

StanNWT
196 post(s)
#28-Mar-19 18:29

Hi Tim,

I've already got the QNAP box. I've had it for many months, just my normal everyday workload hasn't allowed me to spend the trime transferring 10 TB of data from my two current external RAIDs to the new Thunderbolt 3 connected QNAP box.

I'll copy that DEM project file (.map) onto the Samsung 960 Pro 2TB drive, I've got over 1TB free space on it. I'll try and run the curvature mean and curvature profile on it from there and monitor the compute_0 and compute_1 charts to see if they max out, or are at least above a marginal usage and if either sustain that throughout the transforms progress.

tjhb
10,094 post(s)
#28-Mar-19 20:22

Thanks, will keep watching.

You could do a manual TRIM before the tests, just to make sure the SSD doesn't do one automatically (slowing processing to a crawl) partway through.

StanNWT
196 post(s)
#29-Mar-19 16:05

Morning Tim,

I'm running the mean curvature on the big DEM project file. I chose to open a blank Manifold Project. I'm using 9.0.168.10. I added the big DEM project file as a data source. Not copying it into the new project, just processing it so that the results are in the new project to keep the new project file smaller. the attached screen grab is a mosaic of several separate screen grabs of the read speed from my Drobo 5D RAID as I copy to the Samsung 960 Pro 2TB, the write speed of the Samsung 960 Pro 2TB, the GPU load at the start of the transform processing, and a temporally coincident screen grab of the Manifold window showing the records per second read rate and MB/s, a screen grab of the end of the heavy GPU load and a temporally coincident screen grab of the Manifold project window, finally a temporally coincident <C:> drive read and write rate and CPU usage percentage. You can see that the GPU only sustains a load for a minute then goes on that very spiked saw toothed utilization illustrated in the screen grab of the heavy GPU utilization. I was glad to see if max out but disappointed that it didn't sustain it.

I also attached a screen grab of the Nvidia control panel illustrating that I don't have an option for double precision like you do with your Geforce drivers and your Titan card. I'm assuming most Geforce drivers at least for Titan cards would show this if there's double precision compute performance?

Attachments:
Composite_Screen_Grabs_Processing_Mean_Curvature_from_Samsung_960_Pro_2TB _v2_Mar29_2019.jpg
Quadro_P4000_Manage_3D_Settings_Mar29_2019.jpg

StanNWT
196 post(s)
#29-Mar-19 16:24

Attached is the screen grab of the Manage GPU utilization section under the workstation settings of the Nvidia Control Panel as well as the real time Nvidia GPU utilization graph overlain on the task manager. You can see that the spikes are coincident.

Attachments:
Quadro_P4000_GPU_Utilization_Graph_Overlain_On_Task_Manager_Mar29_2019.jpg
Quadro_P4000_Manage_GPU_Utilization_Settings_Mar29_2019.jpg

StanNWT
196 post(s)
#29-Mar-19 18:24

Ok, it finished and I've attached the screen grabs during the processing illustrating the CPU usage, GPU usage the <C:> drive performance and the time to create the curvature mean with a radius of 3 and the time to save the file.

You can see that near the end the CPU and GPU usage maxed out, but during the mid section of processing the small spikes in GPU performance were reflected in the CPU performance and the <C:> read/write speed. Fortunately the Nvidia GPU monitoring graph that can be activated from the Nvidia Control Panel does seem to be a better indicator of GPU performance than the built in Windows tracking of it in the Task Manger's "Compute_0" panel option.

Render: [Map] (1.264 sec)

2019-03-29 11:40:46 -- Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8003.733 sec)

Save: C:\Manifold_Temp_Data\USGS_NED_Fused_1_and_2_arc_second_DEM_Mean_Curvature_Profile_Mar29_2019.map (252.091 sec)

My previous save took 410 sec with just a curvature mean data set in the Manifold project with a similar data source for the original DEM.

The save time is roughly 2x faster. I monitored the range of write speeds on the 960 Pro 2TB between a high of 1.9 GB/s and as low as 60 MB/s.

Previous curvature mean took:

Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8839.753 sec)

to complete.

As a point of reference the curvature profile radius 5 took:

Transform (Curvature, Profile): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8667.657 sec)

However once I had both a curvature mean and curvature profile surface inside the map document the save time took:

USGS_NED_DEM_Curvature_Mean_and_Curvature_Profile_Mar19_2019.map (6364.526 sec)

These save times and processing times were working on Drobo 5Ds. Not the fastest USB 3.0 RAIDs but all that I could afford to provide simplicity dual redundancy and capacity. Now that I have a QNAP RAID solution I'll have a much higher performing system.

However, I'm dismayed that the computational time is only 10% faster on the Samsung 960 Pro 2TB vs a Drobo 5D.

Attachments:
Composite_Screen_Grabs_Processing_Mean_Curvature_from_Samsung_960_Pro_2TB _v3_Mar29_2019.jpg

adamw


10,447 post(s)
#01-Apr-19 16:48

The image to this and a couple of other posts not loading seems to be a problem with the forum code (names are too long). We'll get that fixed.

Regarding the performance on different hard drives and the saves getting accelerated much better than the transform - this has to deal with different access patterns. We have a couple of wishlist items that will have a positive effect here - an option to use more virtual memory first and foremost.

Dimitri


7,413 post(s)
#29-Mar-19 18:22

I added the big DEM project file as a data source. Not copying it into the new project,

Why leave the data in a slow format if you want to measure and report performance, or keep GPU fed with data? DEM is not as fast a format as MAP. A key benefit of using MAP is to be able to get at data as quickly as possible without the bottlenecks of slower formats. See the discussion in the Importing and Linking topic.

StanNWT
196 post(s)
#29-Mar-19 18:29

The original data source in this context is an existing Manifold Project file, that I also sent to Tim. So there shouldn't be a performance penalty.

That's the great thing that you and Adam and Tim have accurately professed for ages about the (map) file, is that just use existing (map) files as sources for other projects, no need to import.

tjhb
10,094 post(s)
#29-Mar-19 21:08

All the same, there could be exceptions or limits.

It would be worth testing the more neutral scenario, where you open the project containing the source data, and run the transform/query there. For testing you do not need to save the result--and bear in mind that the GPGPU filter functions cancel very well, so that you can just proceed with the test as long as you need to, to check the performance.

My test scenario was with a single project. I will also make a quick test taking data from a linked child project, to see if that throttles processing for me.

If we get no difference (either way) then we should look at the remaining differences between your setup and mine.

Some obvious differences: you have a pair of GPUs, I have just one per machine; your GPUs are Pascal, mine are both Kepler (but I also have a smaller Pascal GPU still to test).

You are using NVIDIA Quadro driver version 25.21.14.1917, which matches the version I am using on this laptop, while on the machine with the TITAN card I am using GeForce driver 25.21.14.1935. I get full, constant GPU saturation on both machines.

What else should we look at? But first let's eliminate the first thing: using or not using a linked child datasource.

Another thing we should cover off is what GPGPU performance you get outside Manifold 9. One good testing tool is CUDA-Z. If you open the Performance tab it will test memory throughput (both directions) and CUDA performance (single- and double-precision, plus integer types), per GPU. We can compare results.


NB two of the images you have attached above

Composite_Screen_Grabs_Processing_Mean_Curvature_from_Samsung_960_Pro_2TB _v2_Mar29_2019.jpg

Composite_Screen_Grabs_Processing_Mean_Curvature_from_Samsung_960_Pro_2TB _v3_Mar29_2019.jpg

will not download, at least for me. I get 'Bad Request'. Maybe the names are too long? The other three images download correctly.


[Added:]

I also attached a screen grab of the Nvidia control panel illustrating that I don't have an option for double precision like you do with your Geforce drivers and your Titan card. I'm assuming most Geforce drivers at least for Titan cards would show this if there's double precision compute performance?

Sorry, I should have been clearer. That setting is specific to the Kepler TITAN card, and is the means of switching FP64 performance ratio between 1/3 FP32 (as shown in the screenshot) and 1/24 FP32 (by changing the value shown to 'None'). So it doesn't apply to your situation, and its absence is completely normal.


One more thing. Can you show your SLI settings?

StanNWT
196 post(s)
#29-Mar-19 22:08

One thing I realized after I ran the test this morning is that in the past I was running a radius of 2 for mean curvature. This morning I ran it with a radius of 3. So I reran the test with a radius of 2 and got:

Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8042.007 sec)

This is the time with a Radius of 3:

Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8003.733 sec)

The fact that a much more complex mean curvature computation got pretty much equivalent results is interesting.

I have two GPUs but not in SLI. None of the applications I use would benefit from SLI.

I can look at CUDA-Z.

I do have GPU-Z and the CUDA specs state a 1:32 ratio for SP to DP ratio. Processor count of 14, cores per Processor of 128.

Perhaps the size of the photos is the problem?

Attachments:
Composite_Screen_Grabs_Processing_Mean_Curvature _v2_Mar29_2019 - Copy.jpg
Composite_Screen_Grabs_Processing_Mean_Curvature _v3_Mar29_2019 - Copy.jpg
GPU-Z_Screen_Grab_Mar29_2019.jpg

tjhb
10,094 post(s)
#29-Mar-19 22:41

The fact that a much more complex mean curvature computation got pretty much equivalent results is interesting.

It's not really much more complex to run the calculation with a radius of 3 rather than 2. Once the data is on the GPU, everything is almost equally easy. It's getting the data to and fro, efficiently, that takes the time.

Please can you try testing with a single project? I.e. executing (or at least beginning) the transform in the same project that contains the data (without using a child project). Let's rule out whether that makes a difference.

I can read your images fine now thanks! It was probably the long(er) filenames.

I wonder what would happen if you disabled (CUDA on) one of your GPUs. Maybe the NVIDIA CUDA driver is trying too hard to share the load (Unified Memory), and this is wasting transport and synchronisation overhead. Just speculation at this stage.

[Added.] You can tell Manifold to use only one GPU using the 'gpgpu.device' directive. To do this you would set up the curvature transform, but press the Edit Query button rather than the Add Component button. Then edit the query text to insert

PRAGMA ('gpgpu.device' = '0'); -- or '1'

somewhere near the top. Then run it.

StanNWT
196 post(s)
#29-Mar-19 23:01

Will using that PRAGMA statement override the settings for each GPU in the Nvidia Control Panel? In that control panel I can set "graphics only" or "graphics and compute". Currently both are selected for graphics and compute. However, I can set the card that doesn't have my two 32" 4K monitors attached to it as the graphics and compute card. Then in Manifold set device 1, which is the second GPU, without monitors attached as the one for compute? Should that work better so that Manifold isn't fighting against the driver configured by the control panel?

I will try running mean curvature from just in the main project file that has the DEM. I am splitting the result up into different files since having everything in one file makes the (map) file really huge, and creating MBX files takes extra long when they're all in one file. I'm trying to save disk space in the long run.

tjhb
10,094 post(s)
#29-Mar-19 23:28

(1) To your first para I can mainly answer in three words: I don't know. I have never (yet) tried using a system with two CUDA-capable cards attached. (Actually I think I did do some tests with Manifold 8, but not with 9.) And I never use two monitors!

But all of that detail seems material. You have two huge monitors attached (which of course consume heaps of graphics RAM), both attached to one GPU, and no monitor attached to the other GPU, making it "headless"--which BTW used to prevent GPGPU work on the GPU in question; we might need to check whether this still creates driver problems for CUDA.

With that setup, why not set the first GPU to graphics only, the second to compute only? At least for testing.

Now, in that case, would you also need to tell Manifold to use only the second GPU for GPGPU using the PRAGMA directive? It seems at least plausible, possibly likely.

I feel like we're starting to get somewhere. Simplifying the GPU setup first seems like the best approach.

(2) On your second para, what you say about storage makes perfect sense for the long run. But for testing I think it's important to test the simpler arrangement (a single project). As I've said, you can just cancel when you've seen all you need to see (especially whether performance still drops off a cliff after some minutes or seconds).

A question: do both GPUs go to ~100% on Compute_0 at the start of processing? How long do they stay there? I might not be reading your screenshots closely enough.

StanNWT
196 post(s)
#29-Mar-19 23:36

The Nvidia control panel doesn't provide me with an option for compute only, just graphics and compute.

The only GPU compute that has any response in the Windows task manager is compute_0, however, the Nvidia GPU utilization shows both GPUs having activity, usually both are roughly the same usage. This strange variation in reporting is something to consider.

tjhb
10,094 post(s)
#29-Mar-19 23:54

(1)

The Nvidia control panel doesn't provide me with an option for compute only, just graphics and compute.

Thanks, got it. I've checked your helpful screenshot. So you could set the first GPU to "Dedicate to graphics tasks", the second to "Graphics and compute" (then reboot?). Then you probably wouldn't need to tell Manifold to use only the second GPU for CUDA, it should just pick that up. (Maybe the Manifold 9 About pane would help confirm that?) But no doubt the PRAGMA would do no harm, even if it's unnecessary.

(2)

I suspect that Compute_0 and Compute_1 are there to allow for (rare) cards that contain two GPUs back-to-back, in a single slot.

If that's right then for each GPU, we only need to care about its own Compute_0 readout. So you should see activity in both GPU 0 -> Compute_0 and GPU 1 -> Compute_0 when they are both firing.

(3)

I have done a quick test and I think the "single project" thing is a red herring. Not 100% sure but I think so, for two reasons.

First because when we add your .map file as a child project, then run a curvature transform using the GUI, the result is written to that child project, not to the containing project. We can adjust that by editing the SQL, but you haven't been doing that (I think), so AFAIK that's where your result image ended up. (Then you could move it to the parent, and save just that.) So there is no communication necessary between the two Manifold databases during computation.

Secondly because when I do that on this laptop, I still get fully saturated performance on both GPU and CPU.

So from my side, that really doesn't seem to be the blockage.

It seems more fruitful to try to isolate CUDA to one of your GPUs, and see how that looks.

StanNWT
196 post(s)
#30-Mar-19 00:39

I set card 1 as graphics only, since card 2 has the attached monitors. Card 1 is likely compute_0 and card 2 compute_1. I used your pragma statement

PRAGMA ('gpgpu.device' = '0');

to set the compute to "0" as you suggested I did see a burst of activity then it dropped to the saw tooth utilization as before. In the Nvidia GPU utilization the GPUs have distinct graphs, since they're not both dedicated to trying to do graphics, this makes sense.

I did try CUDA-Z.

I got 6000 Gflop/s for single precision

191 Gflop/s for double precision

out of both cards even when I picked heavy time load.

I noticed that in the Windows Task manager when card 1 was set for compute and graphics, which is device 1 I saw the GPU utilization in the Windows task manager when looking at compute_0 but not compute_1. When card 2, which is device 1, because it has graphics cards attached to it, that device shows no GPU usage on the windows task manager when in CUDA-Z when card 2 is the only card for graphics and compute.

GPU utilization in Manifold for compute_0 is fluctuation between 90% and 60% for 1 min 6 seconds then drops down to that saw toothed pattern as before. This time in the Nvidia GPU utilization monitor only compute_0 which is card 1 (no monitors) shows the spike in GPU usage, unlike before when both were tasked with graphics and compute, however, after 3 min the sawtooth pattern that uses maybe 2 percent at the lowest spiking at 30% shows up now.

tjhb
10,094 post(s)
#30-Mar-19 00:55

CUDA-Z CUDA performance looks great. Wikipedia gives 5300 and 165.6 GFlops for the P4000--your cards are well ahead of that.

Card 1 is likely compute_0 and card 2 compute_1

I'm not sure about that. As I read it, both GPU 0 and GPU 1 should have Compute_0 readings*, both of which are relevant; and both also should have Compute_1 readings, neither of which is relevant (always blank). The two separate Compute_0 graphs should correspond, roughly, with the two NVIDIA GPU utilization graphs.

[*When they are both enabled for compute.]

Back to the substance, is your conclusion that performance is no better when only one GPU is allowed to do compute work? Compute_0 performance still falls off a cliff after about a minute? Previously for both GPUs, if I am right about the graphs--can you check this, with CUDA enabled on both GPUs again?--now for only one.

If we are lucky, someone else with dual CUDA GPUs will be able to do a comparative test, either with your massive test project (if you can share it again) or with something else.

(Or dual GPUs might be completely irrelevant. The blockage could be something else. But in any case, it is pretty striking.)

StanNWT
196 post(s)
#30-Mar-19 01:09

Sorry I had a typo in my last post, rising through trying to describe it before leaving work, it was 6:45 pm when I left.

It should read:

I noticed that in the Windows Task manager when card 1 was set for compute and graphics, which is device 0 I saw the GPU utilization in the Windows task manager when looking at compute_0 but not compute_1. When card 2, which is device 1, because it has monitors attached to it, that device shows no GPU usage on the windows task manager when in CUDA-Z when card 2 is the only card for graphics and compute.

This applies when manifold is running as well. If I specify thr card that had the monitors attached as compute and graphics, and the card that has no monitors as graphics only I get no compute_0 or compute_1 readings in the windows task manager.

Only when I specify both cards as compe or card 1 (no monitors) as graphics and compute do I see compute_0 readings when running curvature mean.

When I use CUDA-Z to check the cards usage in windows task manager depending on whether I set both out reach cards individually to be graphics and compute or graphics only I only see a compute_0 usage in the windows tak manager. But CUDA-Z sees either card as device 0 if that card is the only card to have gaps and compute specified in the nvidia control panel.

This is a bit of confusing mess, am I making any sense?

Your being managing helpful by the way. You must see this as nice interesting intellectual challenge plus it helps fine tune how manifold works in the wild amongst various computer configs.

adamw


10,447 post(s)
#01-Apr-19 17:03

Regarding setting cards to graphics / graphics and compute in the NVIDIA control panel applet and then using PRAGMA ('gpgpu.device' ...):

The number of cards and their modes are somewhat removed from what the pragma manages. What happens is this: you tell the control panel which cards to use and how, and this info then gets to the graphics driver. The driver looks at the options you set and decides how many CUDA devices it is going to expose. It can expose a single device covering all cards if it wants to, or it can expose multiple devices even if there's only a single card in the system, it's all up to the driver. When you launch a Manifold query with the PRAGMA for which device to use, that's a CUDA device as exposed by the driver, not a card. You can check how many CUDA devices the driver decided to expose using SystemGpgpuCount() / SystemGpgpus(), see the query builder. Finally, if you try to make the query run on the CUDA device that does not exist, the query engine will just assume that you made a mistake and ignore the request. As in, if you only have a single device and try PRAGMA ('gpgpu.device' = 18), the query engine will think "well, that's an invalid device number, we don't have that many devices" and use the device that you have.

StanNWT
196 post(s)
#01-Apr-19 18:08

Hi Adam,

I can confirm when I looked at the number of CUDA devices that it indeed only shows 1, now that I have set in the Nvidia control panel, to only use the card that doesn't have monitors attached to be the one for compute and graphics, the other card that has the monitors attached is graphics only. So manifold only sees one GPU now.

Dimitri


7,413 post(s)
#02-Apr-19 05:55

now that I have set in the Nvidia control panel, to only use the card that doesn't have monitors attached to be the one for compute and graphics

Why turn off resources artificially and make the situation a) slower and b) more difficult to understand? To repeat:

And trust the optimizers: don't turn stuff on/off to try to game the system or give it hints. Turn everything on and let it use everything as it sees fit. :-)

You have expensive and powerful cards: allow Manifold to use them. It will do a better job than humans can, so turning off some cards won't improve performance, but it can lead to confusion.

tjhb
10,094 post(s)
#02-Apr-19 06:13

No one is trying to "game the system".

The current result is unexplained and strange. Something is wrong, we don't know what. The natural objective is differential diagnosis.

In light of Adam's helpful insights above (regarding how CUDA devices are really enumerated), it does seem better to turn both GPUs back on (then count how many Manifold sees...), and for both me and Stan to time the transform using CPU only. That might well help.

I will also test tomorrow on my smaller Pascal GPU.

Any better ideas?

I am full of ideas, but few are good. One more: try installing the NVIDIA Quadro driver that installs with CUDA Toolkit version 10.1.

Another: rearrange the two monitors. Plug one into each card--unless there is a good reason why not.

Dimitri


7,413 post(s)
#02-Apr-19 07:21

The current result is unexplained and strange. Something is wrong,

What is unexplained and strange about the current result? What is "wrong"?

tjhb
10,094 post(s)
#02-Apr-19 07:38

Read the thread. In particular the results.

Or if you are feeling short of time, Adam's summary here.

(Easy to miss actually, because the thread is heavily spliced.)

Dimitri


7,413 post(s)
#02-Apr-19 08:10

But you're the one who says something is unexplained and strange, something is wrong, so it's up to you to state clearly what you think that is. It's not up to me to infer what, exactly, you think is wrong. For example, there are posts here about saw toothed patterns reported by Windows for GPGPU use... is that the "unexplained and strange" to which you refer? Why guess, when you can say clearly what you mean?

We do have Adam's note here, for his take on it, but we don't have a crisp summary of what you think.

Now, why on his system the test completes in 51 minutes whereas on your system it completes in 2.5 hours is a little puzzling, yes.

If the above is the issue, I don't think that's so puzzling because, obviously, your two systems are different and quite possibly, how the systems are being used is different. That's routine. Look at the details of what the differences are in the systems and you'll usually find the issue. The place to start in a data access intensive task (and a relatively lightweight computation task) is data access.

It's often a matter of small details that go unnoticed, like temp space or other resource inadvertently being set to a slow device, some background process like a virus checker or something else looking at files and slowing the works down. Could be version skew in drivers (doubt that, but stranger things have happened). But I agree with Adam it's not likely something having to do with two GPU cards or one, but instead something more likely having to do with speed of data storage and access.

As to whether that's "wrong" or not may be taking the, ahem, wrong approach (couldn't resist...). Suppose it is a virus checker slowing the works down? Well, that's perfectly correct operation if somebody has a virus checker installed that examines all files, including temp files. The virus checker is doing it's thing correctly. It's just not an optimal configuration if what you want is speed in analytics that move a lot of data around in files.

But there too, the advice not to over-think the GPU end of it is useful: put time into what your post on what you would try recommends.

adamw


10,447 post(s)
#02-Apr-19 11:05

I'd do this:

First, create a subset of full data to make the number of feasible tests higher - something that completes in 1-2, at most 5 minutes (not in 15 seconds though, that's too short).

Second, on both machines (Tim's on one side and Stan's on the other), set up the same transform, then press Edit Query, turn off GPGPU (add PRAGMA ('gpgpu'='none'); at the top of the query), turn off threads (could be done before Edit Query in the transform pane - or could just remove THREADS ... in the query - or, perhaps better, hard-set it to THREADS 1). Run the transform, measure the result. If Tim's machine is faster, that's on CPU + storage. The bigger the data, the more the contribution from storage.

Then try the same but with the fixed number of threads (THREADS 8).

Then try the same but with automatic threads (THREADS SystemCpuCount()).

Then single-thread but add GPU (THREADS 1, but change the pragma at the top to PRAGMA ('gpgpu'='auto'); - do not just delete it, if you just delete it, the statements will use the value of the pragma from earlier commands, which is 'gpgpu'='none', we have to override it).

Then finally automatic threads and GPU (THREADS SystemCpuCount()).

We can then look at the differences between timings on two different machines and between timings on the same machine.

In general, when I look at your configuration, Stan, I am (a) feeling very warm (what a great monster of a system), and (b) want to bump up the wishlist items that propose that we have an option to allow using a lot more RAM - within reason, of course, but it would be nice to be able to go significantly above the default conservative limits when that's desired.

StanNWT
196 post(s)
#02-Apr-19 15:55

I only have whatever is installed with the 419.17 drivers, I've not installed any version of the CUDA Toolkit separately.

StanNWT
196 post(s)
#02-Apr-19 23:38

Will installing the CUDA toolkit be of any use vs. the Nvidia Quadro drivers, if so how?

tjhb
10,094 post(s)
#02-Apr-19 06:35

What else would I try, if this were my system (and knowing no more than I do)?

First, check where TEMP is. Move it to drive C: if not already there.

Secondly, uninstall any and all Google apps, especially Chrome. Uninstall or disable Microsoft OneDrive, and any other live backup or mirroring software. Disconnect from Internet. Remove all USB drives.

Third, remove any custom antivirus or antimalware software, and revert to Windows Defender.

Reboot.

Lastly, perform TRIM, then shrink the SSD to ensure 10-20% overprovisioning, then TRIM again.

tjhb
10,094 post(s)
#02-Apr-19 07:08

[Stan has already said his TEMP is on C:.]

StanNWT
196 post(s)
#02-Apr-19 19:07

Hi Tim,

Generally I've always wanted to put the:

  • pagefile.sys
  • <C:>\Temp
  • <C:>\Windows\Temp
  • <C:>\Windows\Users\**\Temp
all remapped to a separate drive, or in the case of a workstation that is using a 4-6 drive RAID 10, it's not as important, due to fault tolerance. My reasoning, is that I like to reduce the continuous wear and tear on the boot drive, in an effort to extent the life of it to avoid having to restore from backup on a new drive the operating system, or worst case re-install from scratch everything.

My workstation has full/incremental backups of the <C:> drive, so that if there is a detrimental OS patch, software conflict, or complete corruption, I can revert to an earlier state.

Obviously full/incremental backups apply to my main GIS drive as well.

However, due to the massive performance boost I get on the Samsung 960 Pro 2 TB drive I don't put those on a separate nvme drive. The standard nvme drives that come with the Dell Precision 7820 are Toshiba drives and have a read of 1400 MB/s, write of 300 MB/s. No where near the Samsung (960 Pro, 970 Pro, 970 or 970 Plus) drives. I bought the extra drive and installed it into the extra flexbay for nvme drives. I do have the 1 TB Toshiba nvme drive but it's performance is pretty lame.

StanNWT
196 post(s)
#02-Apr-19 15:53

I am in an organisation that doesn't allow me to turn off my anti-virus / security suite. Also there are other corporate tools running in the background. They don't take up hardly any CPU and no GPU performance.

One thing that does happen and this is a known Windows issue since 7, is that windows explorer (file manager, not to be confused with the Windows 3.0 file manager that you can now download and use in Windows 10 for some nostalgia! ), will occasionally max out the CPUs, but it only lasts for seconds. Usually this is an issue with 'Carbon Black' which used to be called 'bit-9'. I can't just disconnect from the corporate network. My GIS data storage for the moment is all USB 3.0 so I can't disconnect that and Acronis TrueImage needs to do the scheduled backups. The backups were not running nor using any CPU usage during the tests. I'm not going to go into massive detail about every nuance of my setup and workflow, since it's a work environment and I have to have some level of discretion, especially in an open forum. One drive isn't installed, perhaps it will be at some point, if it gets installed I won't have any control over it and it will sync as it chooses likely, I might be able to restrict syncing to certain folders that aren't GIS data locations. I used chrome as my web browser 99% of the time and I'm not getting rid of it. Sorry to be so blunt. I'm incredibly greatful for the help Tim, Adam, Dimitri, but I'm not able to go to a clean install version of Windows 10, or practically clean install for the Testing.

StanNWT
196 post(s)
#02-Apr-19 16:06

As stated here, there is only only compute_0 reference, whether you show the graph on both cards or not, whether both cards are set to use compute or not, only compute_0 shows any usage. It's just on the comparison runs I've done that when only one GPU, the GPU without monitors attached seems to have a longer maxed out GPU time and the sawtooth seems to have higher peaks and higher minimums than when both are enabled. I've not tried the run when only the card with monitors attached is the compute and graphics, the other card is only graphics, even though it has nothing attached to it. That would be a good test to run I think.

Dimitri


7,413 post(s)
#30-Mar-19 06:56

OK. Thanks for clarifying. My mistake: I read it that you had linked in the DEM, not linked in a MAP that contained an import of the DEM. It's true there should not be a performance hit.

By the way, if you find yourself scratching your head to analyze results, you're not alone: unfortunately it is very difficult to guess what the innards are doing many levels deep.

GPGPU involves so many intricate issues on so many levels, involving so many different packages and devices, and a mix of them which change depending on what specifically is going on, that it is hard to get beyond the usual broad generalities: use fast disk to avoid being disk bound, have plenty of CPU cores so those can work in auxiliary or primary roles as the optimizers determine, have lots of main memory so the many moving parts that use main memory (be they Windows or Manifold or libraries) don't run into limits, and that, generally, a higher end GPU card will provide better performance (but for most people in most applications there should not be pressure to over-spend). And trust the optimizers: don't turn stuff on/off to try to game the system or give it hints. Turn everything on and let it use everything as it sees fit. :-)

For all that, we can learn useful info about rigs for bigger tasks with trials like these, and it is wonderful fun to try big data with cool GPGPU configurations. I love it, and thanks for spending the time to set up experiments like this!

I also agree that when you see total GPU saturation of very mighty GPU card like a TITAN that is truly amazing. That's not something you'd see in years past with 8. A big part of that, I'd guess, besides 9 being what it is, is having a fast SSD data store and enough CPU cores for the system to use in parallel to support GPGPU.

adamw


10,447 post(s)
#01-Apr-19 16:23

Regarding transforms not needing much memory on GPU:

It's true that we currently don't need much memory on GPU. This happens because all of our current GPU-enabled functions allow us to split the work and then reuse memory for different chunks of data very efficiently. As we add more analysis, other functions might not be so lucky, and might need to, say, pre-load a big part of the analysed raster onto GPU - and begrudgingly go to CPU if GPU does not have enough memory.

Regarding switching to 32-bit floating-point math moving the work from GPU to CPU:

This does tend to happen because of two things: (a) the raster itself being not 32-bit floating-point (whatever its type, just that it isn't FLOAT32), and / or (b) one or more of the operations in the optimized expression being not GPU-enabled (this then frequently converts 32-bit floating-point values coming out of GPU to 64-bit floating-point in order to do the non-GPU-enabled operation, then possibly converts the result back to 32-bit floating-point if there's more GPU computations to perform).

If the raster is 32-bit floating-point and all operations are GPU-enabled, then switching to 32-bit floating-point math should have no conversions and should result in better performance, although the performance gains will vary and will frequently not just automatically be 2x or 3x or whatever.

In general, we treat 32-bit floating-point math as a special case for when the user tries to squeeze the absolute max performance possible and is controlling everything trying to get it. The general case is 64-bit floating-point math.

StanNWT
196 post(s)
#01-Apr-19 18:05

Hi Adam,

I'm running the standard test without modifying the curvature mean transform on 9.0.168.11. I'm attaching a mosaic of all the task manager charts with the corresponding manifold windows, where it shows the time index and MB/s or KB/s. Labeling for each group is provided. I also have the GPU-Z composite screen grab in there. the JPG is 7239 x 2813 pixels. It's very bizzare that when the CPUs and GPUs are maxed out I'm only getting ~600 KB/s scanning records, but when it drops to lower levels that's when I get into the MB/s. Note I've moved the data back to the Drobo, however, the read/s hasn't changed much in the initial part of this run. Note, my <C:> drive, the Samsung 960 Pro 2 TB drive likely has the OPAL 2.0 encryption turned on, something the IT department sets up without option. However, when I first got the workstation, which would have been after IT configured it my ATTO disk bench was getting 3 GB/s read, and 2.5 GB/s write. Not sure the current performance. If the times are the same, regardless of where the data is stored and accessed, then there's a larger issue afoot. Note, I have a slew of applications running in the background that I cannot turn off that are IT department things I have no control over. I appreciate how hard Tim has worked on benching his computer with the same data. I'm sure he enjoys the challenge. I'm sure I could give you the drop box link, Adam, if you wanted to play with the same data set and perhaps get it into the hands of the development team so they can play with that large of a data set, not that they've not already played with very large data sets.

When I looked up the likely double precision flops of my two Xeon Gold 6128 CPUs they see to have more flops than the two quadro cards combined using the AVX-512 metric? That being a dp flops to dp flops comparison.

Attachments:
Mean_Curvature_Performance_Test_Apr1_2019.jpg

StanNWT
196 post(s)
#01-Apr-19 19:49

It just finished after 2 hrs 35 minutes. There's little difference in the computation time when the data is on the Samsung 960 Pro 2 TB vs. the Drobo 5D. Save times will be vastly different of course. I have a new set of task manager and manifold screen grabs I'm compiling.

StanNWT
196 post(s)
#01-Apr-19 20:38

Here's the last page of test screen grabs as promised. 9663 x 2438 pixels.

StanNWT
196 post(s)
#01-Apr-19 21:57

Can I delete the attachment? The Page2 attachment?

adamw


10,447 post(s)
#02-Apr-19 11:12

I deleted it.

I'll think about what we see on the first screen, it's very useful.

StanNWT
196 post(s)
#02-Apr-19 15:30

Thanks. Same graphic, no names or directory paths.

The "*Test_Apr1_2019" is fine.

To me it seems that given the same transform and the same radius, setting the graphics card that doesn't have any monitors attached to it to be the only compute card seems to have given it a boost, as reflected in the '*Page2a_Apr1_2019' image.

"If" Windows was always scheduling the card that didn't have any minitors attached to it to be the predominant compute device, i.e. 'compute_0', and the card that does have monitors attached to be 'compute_1', then it makes sense that I never saw any GPU processing on 'compute_1'. However, if 'compute_0' and 'compute_1' are simply compute functions on all GPUs, then that's different.

The settings for the driver are as you say past on to windows and then to Manifold I assume, and Manifold currently only shows one CUDA device. Before it showed two. However, I'm getting more and higher GPU compute graph results in the task manager, now that there's only one CUDA device. This could mean perhaps two things:

1) The graph is an average of the two cards, each card having 'compute_0' and 'compute_1' components, when, perhaps, manifold is only using one card, the graph is averaged over two it appears lower.

2) Setting only one CUDA device, the card without monitors attached, a card that has nothing to do but compute tasks, actually produces higher results?

Attachments:
Mean_Curvature_Performance_Test_Page2a_Apr1_2019.jpg

adamw


10,447 post(s)
#02-Apr-19 16:38

What we mostly see in our experience is that, although that's not guaranteed to happen, each physical card tends to get represented as a separate CUDA device and only as a single device. However, in terms of what gets shown as 'compute_0' and 'compute_1' in Task Manager, I believe we saw cases where even with two separate CUDA devices, 'compute_1' was staying blank and everything was going into 'compute_0'. For one thing, 'compute_1' is absolutely showing as a choice on systems which have a single card and a single CUDA device, so the presence of 'compute_1' isn't an indication that there's anything to show there. So if 'compute_1' shows nothing and there are multiple CUDA devices, I would suspect that the combined load of these multiple CUDA devices might indeed go into 'compute_0'. That would produce the effect you are talking about in 1.

As regards of what would give best performance, without heavy rendering (which we don't have here), probably putting both card to 'graphics and compute'. Ignoring one of the cards entirely shouldn't be beneficial - if it does, that's likely something to fix for either us or NVIDIA (but I don't think it does). :-)

StanNWT
196 post(s)
#02-Apr-19 18:45

Thanks Adam,

One thing to note is that I'm not likely to want to stick one monitor on each card because of syncing issues. Without SLI I don't think syncing is going to work well across both cards/monitors. Not really looking to have any issues. Considering each card can handle 4 x 4K monitors, it seems like a good idea to keep them on one card?

I can easily turn 'graphics and compute' back on for both cards.

tjhb
10,094 post(s)
#03-Apr-19 01:20

I'm not likely to want to stick one monitor on each card because of syncing issues

You mean for gaming, or does it also matter for extended desktop mode, e.g. using Manifold?

Considering each card can handle 4 x 4K monitors, it seems like a good idea to keep them on one card?

I would expect that it is better to have two hands performing two tasks, than to have one hand juggling two tasks. CUDA tasks aside, one of your cards is currently going to waste.

And there is at least a chance that leaving one card headless may interfere with CUDA balancing/throughput.

It would be good to test with both cards active again, both with one headless, and then with both driving a display.

For a start, how many GPUs does Manifold count in each case? (That would be very interesting, now that Adam has explained how it works.)

Then, what performance do you get on this test in each case, after the first couple of minutes? Does performance always fall off a cliff?

adamw


10,447 post(s)
#02-Apr-19 16:17

OK, regarding the screen.

I think what we are seeing is mostly the effects of the hierarchical memory system that we have in 9, which works similarly to the disk cache: we use a limited amount of memory as disk cache and organize reads / writes around that. In your case the machine has tons of memory, so what we write to the 'disk' does not actually go straight to the disk and instead goes into a second-order cache, but that just muzzles the reaction somewhat without changing its nature. We are still accessing the big image in limited portions, and there are still all effects of hierarchical memory with some accesses being fast and others slow, and these effects are just less pronounced: 'slow' is much faster than it would be on a less bulky machine, but it is still slower than 'fast'.

The numbers reported by the progress dialog are cumulative from the start, they are not moving averages. The difference in KB/s is affected by whether tiles are filled with zeros or not, tiles filled with zeros are smaller, so big discrepancies in KB/s are perhaps coming from that. Here's my interpretation of the screens for curvature mean, radius 3:

0:09 - cumulative speed so far is 1,168 records/s (I wouldn't dwell too much on 68.6 KB/s, this looks very low but it is low likely because many tiles at the start are coming from the edges and are full of zeros = small in terms of bytes). Since we are just starting, most tiles that we read likely come from 9's 'disk' instead of 9's cache, in the future more tiles should come from cache. The GPU load is good.

1:43 - cumulative speed increased to 1,398 records/s. That likely happened due to more tiles being loaded from 9's cache vs 9's 'disk' (which is still cached by the operating system in memory, but still slower). The GPU load is good. There are some sawtooth-looking artifacts, that's probably the graphics driver bumping into some limits internally and refreshing / restructuring something under load.

2:43 - cumulative speed dropped a bit to 1,322 records/s. The GPU load starts having the big sawtooth pattern. This looks like a saturation of 9's cache. Once saturated, the cache is going to perform at a predictable speed, and that speed isn't terrible, but it is lower than the speed before the cache is saturated obviously. The ratio of the time spent rearranging data increases, the ratio of the time spent doing useful computations decreases. All this is normal and expected. (It's just that since we have so much memory, there's a case for increasing the cache size to have it saturate later and to maintain a better ratio for doing useful computations.)

3:40 - cumulative speed dropped more to 1,045 records/s due to the big sawtooth pattern overtaking all of the GPU - and will likely stay there dropping insignificantly all the time for the duration of the stage of transform that computes tile values to be inserted into the table. 9's cache is saturated, it won't become slower, etc.

It might be that Tim's machine for some reason has better performance for when 9's cache is saturated and that's what explains the performance difference. I'd still do the tests I recommended above without GPU and without threads if you have the stamina / desire, these might be very informative.

The takeaways so far: (a) we should allow using more memory, (b) we should try to increase the performance base for when the cache gets saturated (it's not a question of if it is going to be saturated, because it will be, whatever the amount of memory on the machine, disks and files are going to be larger). Overall, I wouldn't say the big sawtooth looks too bad, but obviously the more area we can capture on that graph, the better.

StanNWT
196 post(s)
#02-Apr-19 18:55

Is there a possibility of using a larger percentage of memory, as a checkbox / fill in a number as a percentage dialog in each transform? For example, you have a checkbox added, when added a fill in white box where you type in a number (0 to 100), for a percentage of memory to use. Perhaps making this something the user has control over is a bad thing, but the end user has more knowledge of their individual system and its available memory needs with the software that's running that creating just a larger memory allocation for cache. The caveat to this is that the user will be increasing or decreasing performance directly, but at least they can potentially control some of the performance boost. Some times the software that a user might need to run at scheduled times needs more memory available for it than might be available if Manifold is using it, so to prevent conflicts, reduce the amount of memory available, other times increase it. Only having it in the background unavailable to the user might not be the best approach, but often times the user can do things unintentionally detrimental to software. Only having an option to apply this in SQL, through editing a transform isn't necessary great for those that don't do well as of yet with SQL. I know adding transform dialog additions in terms of radio buttons, check boxes and fill in boxes for numbers or text, isn't always desirable, but gives a more GUI driven approach for those that like it. It takes longer to roll out new capabilities if you have to put it in GUI dialog boxes as well.

Dimitri


7,413 post(s)
#03-Apr-19 06:11

Is there a possibility of using a larger percentage of memory, as a checkbox / fill in a number as a percentage dialog in each transform?

God, I sure hope not. :-) Tech support is already taking hostages to ensure that doesn't happen. Imagine people around the world deciding that 16K of memory is plenty to run a billion object transform... and Tech gets to deal with that... :-)

the end user has more knowledge of their individual system and its available memory needs with the software that's running that creating just a larger memory allocation for cache.

Maybe there's one end user in a million that does, but such knowledge is not a realistic possibility for most, not even for experts. If anything, the more expert someone becomes in such matters the more they realize automated systems are a better way to approach optimization of activity where wheels-within-wheels intricacy and interactions between many, very complex systems (Windows, drivers, Manifold, etc.) change what's best from millisecond to millisecond. The right way for better use of memory is to improve the algorithms and code that assigns and uses memory automatically.

In things like cache, trust the software. If you can't trust the software, the solution is not to provide manual settings so you can try to fix what you don't trust. The solution is to fix the automated function of the software so you can trust it.

tjhb
10,094 post(s)
#03-Apr-19 06:34

Re your last para Dimitri:

That's complete and utter bollocks, and you yourself don't believe it.

Your suggestion is exactly the same as "you don't need your own PC, use a dumb terminal / thin client". 100% soviet.

Your thinking is plain wrong here, and I know for a fact that you disagree with it.

Dimitri


7,413 post(s)
#03-Apr-19 07:46

Let's unpack your comment... what could you mean? That...

a) You should not trust your software, especially not in highly intricate matters where no human can see many levels deep on a millisecond basis what is going on. Well, can't be that.

b) If you don't trust your software the solution is not to fix the software so you can trust it. Somehow, that doesn't feel right either.

c) If you don't trust your software, the solution is to provide manual settings so you can work around what you do not trust. Maybe that is what you mean, but if so, I don't buy into that at all.

Look, if you don't trust how the the intricacies of automated allocation and use of memory work, I don't think that it is realistic to expect to do any better many levels in complexity and time frame away from where the action happens. Here's a thought experiment using what you already can do to explore the limits of what you can do manually, if you feel sufficiently confident:

Manifold isn't 100% soviet, it's the opposite. If you like the idea of rolling your own, you are perfectly welcome to do that. Manifold exposes plenty of programming capability, and you can use Visual Studio or whatever other tool you like.

Write your own routines that use CUDA and use memory however you see fit. What you'll learn very quickly is that trying to hard-wire specific use of memory and cache for each and every specific thing you do, in each individual circumstance, does not work anywhere near as well as writing some more general mechanism that uses good algorithms to allocate what is reasonable use on the fly as the algorithms indicate is best.

There should be nothing controversial about the above, as that approach is very routine for pretty darned much any significant package, and has been for many decades, operating systems like Windows and Linux being good examples.

If you don't trust the way Manifold internals use memory in highly complex situations, why do you trust Windows to do that? If you don't trust the internals, you should have good reasons for that lack of trust that are based on facts, not feelings, and thus you should be able to express your reasons for thinking it could be better. That is a good thing, not a bad thing. Such exploration of different circumstances is exactly how algorithms can be tuned for better performance.

I'll make an analogy to neural network based face recognition. When you set up a neural network and train it, not one person in a million can tell you how in any specific case it arrives at recognizing a person's face instead of a monkey's face. But it does, using totally automated, wheels-within-wheels levels of complexity that do, indeed, work with good reliability if correctly set up and trained.

When it doesn't produce results with the reliability you want, you don't say "OK, the solution is to hire thousands of people to manually examine millions of images and to hard-code workarounds for each image where we think our neural network cannot be trusted." The solution is to tune the network, to improve how it works in a general way, to tune how it is trained so that the automatic results work. That's not a soviet approach, that's a modern approach of understanding how sophisticated technology works and using that knowledge to adjust the technology so it does what you want.

As a general rule, I like all sorts of manual settings, where they are reasonable and where they don't cater to ignorance instead of insight, and where a manual approach is realistic. But where the technology requires automatic function, a Luddite approach is not the best.

Its like a fine Swiss watch that you suspect might be running slightly fast or slow. The solution is to have the innards tuned, so it automatically runs on time. The solution is not to package the watch with instructions that tell the user to move his or her arm only in up and down motions if the watch runs fast and only in left and right motions if the watch runs slow. The manual work around to what you don't trust is exactly the soviet approach, not the modern approach.

StanNWT
196 post(s)
#02-Apr-19 19:15

Has all this testing that Tim and I've been doing been illustrative and useful not just for us but the Manifold developers and gurus to think about how things are actually being used 'in the wild'?

One thing I'm interested in knowing is have you experienced any performance penalties on workstations with dual socket configurations, since there have in some professional CAD/CAM animation and other high end programs or databases, been known to have some performance penalty when dealing with the traffic between sockets? it's the same rationale that Intel used against Intel about AMDs separate chiplets with their Infinity Fabric, however, Intel is now starting to build that way.

Are most Manifold users and/or developers, gurus using single socket multi-core setups or multi-socket/multi-core setups?

I'd love to see Manifold running on a dual socket EPYC Ryzen 2 architecture server, with 8 x Quadro 8000s and 2 TB RAM, but I haven't won the lottery.

Dimitri


7,413 post(s)
#03-Apr-19 06:28

One thing I'm interested in knowing is have you experienced any performance penalties on workstations with dual socket configurations,

Multiple sockets make no difference, since any theoretical performance differences that chip vendors may use to market against competitors are far smaller than real world bottlenecks such as data access, running an antivirus/security/indexing/backup service that slows stuff down, etc.

Are most Manifold users and/or developers, gurus using single socket multi-core setups or multi-socket/multi-core setups?

Same as with almost all software these days: the overwhelming majority of machines are running single-socket, multicore CPUs.

The increasing popularity and decreasing cost of multicore CPUs means it is easy to buy a cost-effective rig with a single multicore CPU socket. But at the same time, chip vendors have not put the same focus on chipsets which would make it as easy for motherboard vendors to introduce cost-effective dual-socket motherboards, so those remain a niche market.

There are also form factor issues that mean multiple socket motherboards remain a "server" market niche product, and those tend to be configured for server farm use that revolves around web serving and not analytics or general parallel software use like Manifold.

Try to find a motherboard that provides, say, four sockets for inexpensive manycore CPUs, plus four full-speed slots for GPU cards, plus lots of on-board memory, plus massive connectivity to big SSDs and there are not so many choices.

It's cool that Manifold will run out of the box on all that, if you configured it, but in the real world such fire- breathing exotica is very rare outside of military and other classified, black-budget users. The Manifold response to that is the background work being done on the big "servers" theme, where big tasks could be automatically distributed across a private cloud configuration on your local network.

There is much to be said for distributing both the function and the data store to many many machines on your organization's local network, which are already paid for and for the most part, with most of their cores just sitting there doing nothing for endless milliseconds, even during the height of a work day. It's not the same tight coupling you get "inside the box," but for many tasks distributing the data store and processing over what now routinely are very, very fast networks can still get super effects.

After all, when you run Hadoopy stuff in clouds, you're running on a mass of very underpowered PCs that are connected via networks. The cloud isn't a single machine with thousands of sockets.

tjhb
10,094 post(s)
#03-Apr-19 01:03

It's hard to know where to put this test, but I'll put it here in reply to Adam's

I'd still do the tests I recommended above without GPU and without threads if you have the stamina / desire, these might be very informative.

I'm currently running a duplicate test, on the same machine as I used above, but with GPGPU off. Still using 6 threads, I'm not sure I have the patience to test on a single thread.

So this is interim:

Previously, with 896 64-bit cores active, I had this sort of result from Manifold's dialog (I have screenshots which I didn't post before but can):

1438 records/s - 1335 records/s [fairly constant]

117.2 KB/s - 25.6 MB/s - 46.7 MB/s [generally increasing]

Now, with no GPGPU usage, I get

77 records/s [rock steady, no change at all for the first 1h 20mn]

8.4 KB/s - 4.5 KB/s - 775 KB/s - 819 KB/s - 1.1 MB/s [generally increasing]

At the current rate, it will take about (2306560 records (tiles) / 77 records/s) ≈ 29955s ≈ 8h 20mn at the current constant processing rate.

This is on exactly the same system as used above for a time of 42mn 47s, except only that GPGPU is disabled with

PRAGMA ('gpgpu' = 'none');

The comparison is rough because we're only a short way in, and it doesn't take account of the slower pyramids phase at the end, but bearing that in mind, it shows roughly a 12x speedup using GPGPU on this system.

(Putting it another way, the previous test would have finished twice over by now, and I still have about 6 hours to go.)

Stan might have a lower overall ratio given his much faster CPUs.

So yes this gives a useful perspective!

tjhb
10,094 post(s)
#03-Apr-19 01:28

Still 77 records/s after 1h 43mn. Now 1.4 MB/s.

CPU has been solid 78-80% throughout, in line with assignment of 6 threads. (6 out of 8 virtual cores fully saturated by Manifold, plus a little bit for Windows.)

SSD2 (containing both .map file and TEMP) mostly 0-3%, occasional burst to 18%.

RAm usage 6.4 of 32 GB, static.

GPGPU at constant 0% of course.

tjhb
10,094 post(s)
#03-Apr-19 04:18

Abandoning now after > 4 hours. I think anything useful has already been shown.

tjhb
10,094 post(s)
#09-Apr-19 01:05

I have finally got around to making this same test on a third machine:

Intel i7-2600, 16 GB RAM, NVIDIA GeForce GTX 1060 6GB, SSD1 120 GB, SSD2 480 GB, HDD1 1 TB

This machine is much slower and less capable in every way than Stan's beast. The CPU is old (and single), there is only 16 GB system RAM, and the graphics card is much less capable than either of Stan's (and he has two).

The time for the same mean curvature test on exactly the same data was 3315.319s, 55mn 15s.

Te recap, that is compared with 2h 27 mn on Stan's much more powerful machine, or 42mn 47s and 51mn 16s on my GTX TITAN.

I had swapfile, TEMP and Manifold project file all on drive D.

I had ~98% GPU saturation throughout the test (the same pattern as for the GTX TITAN).

This shows clearly and definitively, that for Manifold GPGPU processing, either there is something seriously wrong or misconfigured with Stan's system, or else there is something misconfigured in Manifold 9 with hardware like Stan's.

(Could the problem be hard drive encryption? Dual GPUs? Security software? Google Chrome--which I don't have? No idea.)

It also shows that a well-configured system with a relatively small GPU (GTX 1060) is almost as powerful as a system with a relatively expensive mammoth GPU (Kepler TITAN). Transport matters much more than theoretical GPU power. The efficiency and power is in Manifiold's scheduler, all going well. Which is free.

Dimitri


7,413 post(s)
#09-Apr-19 09:13

It also shows that a well-configured system with a relatively small GPU (GTX 1060) is almost as powerful as a system with a relatively expensive mammoth GPU (Kepler TITAN).

To avoid an unintended misdirection, I would preface the above comment with...

"In cases of relatively simple calculations on large volumes of data, where data access might be a greater factor than computation, ..."

Mammoth GPUs are more powerful than lesser GPUs in complex calculations where the greater performance of high-end GPUs will show a difference, and where other bottlenecks, such as the need to move lots of data around, do not come to the fore. Such situations are unusual in GIS work, where a more typical situation is the case of relatively simple calculations done on lots of data, like the task in this thread.

That's why the GPGPU advice explicitly discusses such matters and why it advises not to overspend on GPGPU while neglecting the other parts of the system (manycore CPU, memory, fast data store).

---

So... why is there an outlier in terms of performance? It's frustrating not to close the loop on this, because quite often the root cause of such things is a simple thing that has a big effect. Find it, and suddenly things go much faster, saving hours of work.

Given the dominant role data access likely plays in this particular application, the fastest way to discover why there is an outlier is to focus first on the most likely cause: differences in data access performance. Examine all the details of hardware and software that might affect data access. If that doesn't turn up the answer, move on to other possibilities.

My gut feel is that the answer likely would be found by following up all details that come to mind based on this post: http://www.georeference.org/forum/t147125r147464#147394

Key quotes from that post:

I am in an organisation that doesn't allow me to turn off my anti-virus / security suite. Also there are other corporate tools running in the background.

[...]

I can't just disconnect from the corporate network. My GIS data storage for the moment is all USB 3.0 so I can't disconnect that and Acronis TrueImage needs to do the scheduled backups.

OK. The above tells us that the usual suspects in terms of software that might reduce data access throughput are known to be in play. It also tells us the interface to GIS data is through USB 3.0, and that there are imaging packages running which might sync to the corporate network. Any one of those things can impact data access in a big way, which is why they are the usual suspects.

Here's just one possibility that might not be expected: one of those syncing packages running in background might not actually do a sync, but it might reach out across the corporate network every now and then to check a time stamp on an archived cache to see if will need to do a sync when it is time to sync, and it does that in a way, due to corporate network latency or whatever other effect, which holds up the processes generating files or touching data. Turn off the "check sync cache status" and suddenly the big job runs three times faster.

Or, it could be something even simpler like turning off some antivirus or "security" package.

Sure, the organization's IT group might not like that, but it could be when a user shows them some use case where adjusting the default guidelines saves hours of work, well, they might agree that in this case it's OK to turn it off, or they might apply their skills to a new configuration that doesn't impact performance. Might help to get them involved.

adamw


10,447 post(s)
#24-Apr-19 08:09

A belated reply to you and Stan.

This thread has been very useful, yes. I mentioned some take-aways that we made earlier, there were several others. Thanks a lot for that.

We don't yet know why Stan's bigger machine would perform worse than Tim's smaller machine, but we think we know enough to put useful telemetry - measure runtime statistics and report them after the transform - which will likely help. We test on many different configurations, including those with multiple cards, but with the immense range of configuration options available for the PC, there are always tons of nuances that we cannot realistically see with our own eyes - measuring runtime statistics will allow us to see them. Lower than expected performance might be related to many different things. Speaking loosely, Stan's machine might be too fast in places which we assume things to be slower, this could produce waits where we don't expect to have them, and our code might be handling those unexpected waits less efficiently than it could.

We will try to add telemetry to a couple of transforms after the current cutting edge build.

We will also try to increase the memory limits, etc, as discussed above.

ColinD

2,081 post(s)
#24-Apr-19 12:20

could produce waits where we don't expect to have them

I have a similar machine to Stan's, duel six core Xeon but a single M5000 Quadro card. I have suspected waits occurring on account of the amount of times I get Not Responding in both M8 and M9. Or not related? The process always completes.


Aussie Nature Shots

rk
621 post(s)
#24-Apr-19 13:24

I remember that while M8 was Not Responding because it was busy importing some big file, then sometimes other instances of M8 and M9 were also blocked. I have not used M8 lately.

tjhb
10,094 post(s)
#24-Apr-19 13:41

My guess (to Riivo) is that one instance had copied content to the Clipboard. In that case it seems all M8 instances insist on synchronizing their pointers, including with the non-responding instance.

adamw


10,447 post(s)
#24-Apr-19 15:12

"[The process is] not responding" happens when the wait is in the UI. We do 99% of what could possibly block for a long time in background threads, so "not responding" tends to happen when the UI is doing something benign, which is not supposed to take long, but that takes long because Windows is paging heavily. In my post above I was talking about different waits - those that happen in background threads which cooperate with each other to do big jobs. But we can and will try making cases of "not responding" rarer and shorter as well - by making better use of memory, for example.

tjhb
10,094 post(s)
#24-Apr-19 14:01

I know this is a trivial comment, but apart from being useful, and increasing everyone's sanity, adding runtime statistics like this will be fun. (For those able to muddle work and play--I hope that is all of us.)

adamw


10,447 post(s)
#20-Mar-19 09:04

MXB files throw away a lot of data that can be re-created, that's why 7-zipped (or compressed in any other, arbitrarily aggressive, way) MAP files will always have a hard time being smaller than MXB.

Dimitri


7,413 post(s)
#19-Mar-19 05:26

there is 0% CPU usage and only between 6% - 20 % CPU usage.

Please post a screenshot of the Task Manager Performance tab.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.