Home - General / All posts - Calculating Mean Curvature on very large data sets
 StanNWT140 post(s) #14-Mar-19 21:57 I'm wondering if this is just pushing things too far and it would be better to do it on a smaller DEM?I have a 30m DEM that has dimensions of:X: 327,603Y: 115,217Data type: float32This means a 37,745,434,851 pixel raster @ 4 bytes of data per pixel.If I want to do a mean curvature with a radius of 3, is this just going to blow up my computer?I did some rough assumptions that given the 49 pixels in a 7x7 matrix that it's 1,849,526,307,699 pixels calculated.Given the 4 bytes of data that would be 7,398,105,230,796 bytes of data? Am I wrong in assuming it would be calculating 6.729 TB of data?Just for reference the BigTIFF file of the DEM is around 78 GB, but the ArcGIS arc binary grid is 140.61 GB.I have a dual Xeon Gold 6128 (3.4GHz, 6 core/CPU), 128 GB RAM, 2TB Samsung Pro 2TB OS Drive where the page file and temp folders are. There's roughly 1.4 TB free space. I also have dual Quadro P4000 cards. The page file is 320 GB in size which isn't listed as free space, it's part of the used space. My storage drive where the (map) file of the data is stored, has 12 TB free space, it's a direct attached storage (USB 3.0 RAID 6) box.
 tjhb8,760 post(s) #15-Mar-19 03:13 You have great hardware! It could be improved by moving OS, software and page file to a smaller, cheaper, slower SSD, then using that fantastic Samsung 2TB SSD/NVMe drive for both TEMP and current working data (including this project), with the USB3 RAID6 left for static user data. But that is not the question.The project is easily doable with Manifold 9, the total amount of data is in a sense immaterial because of the tiled image model and the Radian storage model.The only question is how long it will take. Just for fun I would guess 25 minutes, but I have no idea if that is even within an order of magnitude. (I'm just not so well endowed.)Please let us know the actual code, and the actual time? Try assigning, say, 18 threads.
 StanNWT140 post(s) #15-Mar-19 04:29 Wel my storage will be moving to a QNAP TVS-1283T3, with 8 x 8 TB WD UltraStar HDs, 4x 4TB Samsung 860 Pro SSDs, and will have 2 x 1 TB M.2 SATA Samsung 860 Pro cache drives, since the TVS-1283T3 doesn't use NVME drives. I'll be connecting via Thunderbolt 3 to my workstation. The theoretical max speed according to documentation and benchmarks of the uncompressed TVS1283-T3 is 1600 MB/s. As far a code I'm just trying the default transform for "curvature - mean". I was thinking maybe 5 hrs, but it was 1.5 hrs when I left work, but I have no idea if it will work. CPU utilization was fluctuating between 6-20% for Manifold on those 24 cores with default settings, RAM usage was 28 GB total system usage, GPU usage was near zero. The dialogue box say it was inserting records at about 600/s @ 7 MB/s, it did start at several thousand per second though. My RAID box in ATTO disk bench does give me 250MB/s write and 225 MB/s read. Disk IOPS is usually up to 1100 but today was only around 110. The RAID has a 512 GB Samsung 840 mSATA drive for its cache.I'll know tomorrow morning if it locked up, crashed, failed, completed or is still running. If you haven't guessed by now, I like testing big data, not that a DEM that size is all that big, but it's bigger than what most people have other than large LiDAR data sets.
 tjhb8,760 post(s) #15-Mar-19 06:35 "Well my storage will be moving to..."Possibly a waste of money. Just make sure your most important disk is as fast as possible. There is currently only one type of drive that meets that requirement, and it's neither external nor RAID. Don't be fooled by alphabet soup.As far as the current task is concerned, clearly something is wrong, so the more you can specify the task, the better we can all find what it is.
 StanNWT140 post(s) #15-Mar-19 14:44 The curvature mean transform finished and the raster layer was created successfully.It took 8839.753 seconds to finish ~ 2 hrs 27 minsSaving took 410.774 seconds ~ 6 mins 51 seconds.The transform log function is below:2019-03-14 17:16:25 -- Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8839.753 sec)I just set the radius to 3 when I ran it.BTW, I love that Manifold 9 can run 100% CPU utilization.
 Dimitri5,452 post(s) #15-Mar-19 18:42 GPU usage was near zeroIf you are looking at task manager performance, don't forget to switch the GPU reporting mode to compute, or windows will not report the near-total use of GPU. See the video for speed demo with 1280 cores to see how it's typical to get 90%+ saturation of GPU.
 StanNWT140 post(s) #18-Mar-19 19:34 I'm wondering if the data set is at least an order of magnitude bigger, or just bigger by some factor than the available VRAM on an Nvidia graphics card, would Maniofld 9 choose to do more processing on the CPUs and via system RAM as opposed to the GPU? I'm running the curvature profile transform and there is 0% CPU usage and only between 6% - 20 % CPU usage. This data set is very large. I have previously sent tjhb a dropbox download link for the big DEM, it's about 152 GB in Manifold. The transform will likely complete just not maxing out CPU and GPU at the (inserting records / scanning data) stage so far.
 tjhb8,760 post(s) #18-Mar-19 20:55 I have previously sent tjhb a dropbox download link for the big DEM, it's about 152 GB in Manifold.I'm glad it's this data, cool. But the download never worked for me--we tried several times and gave up.If it's still there on Dropbox then I will try again.
 StanNWT140 post(s) #18-Mar-19 21:19 Yes it's still there. I suppose I could try to upload it again and send you a new link? But try the download of the current file first.Personally I'd like all of Canada and Alaska at a 30 m DEM as a single data set. It would be interesting if a 7x7, 9x9 would run on that. I've yet to run out of RAM so far. It's likely these are less memory intensive than that decompose to points exercise I ran over a year ago on the older workstation.
 tjhb8,760 post(s) #18-Mar-19 21:36 I need a new link please. I think I have misfiled the original.
 StanNWT140 post(s) #19-Mar-19 17:27 I've resent the same link Tim.
 tjhb8,760 post(s) #19-Mar-19 22:57 Thanks. I have the data downloaded and open. (By the way it took well over an hour for the .mxb file to be opened, decompressed and saved .map format. Maybe plain .map format, compressed with 7Zip, would be better for distribting enormous files.)
 StanNWT140 post(s) #20-Mar-19 01:45 I myself was surprised when I found that 7-zip files were larger than the mbx files with this and similar large map files that I've made. I haven't determined the unpacking time versus the unzipping time.
 tjhb8,760 post(s) #20-Mar-19 02:01 Seems worth a test... one day.I'll have fun repeating your curvature test on between one and three machines, with comparative timings, along with screenshots showing GPU activity.
 StanNWT140 post(s) #20-Mar-19 23:05 Any luck using that data at all. I'm not specifically talking about the mean curvature or profile curvature.
 tjhb8,760 post(s) #20-Mar-19 23:59 Agreed. But the important thing is to ensure, and prove, that GPGPU is working for tasks like this. On my machine(s) and yours. And if not, to follow it up carefully.
 StanNWT140 post(s) #21-Mar-19 00:19 You're right. But in any event, being able to preview 37.7 billion 32-bit float pixels with a 7x7 matrix in 2.5 hrs is impressive. And ArcGIS Pro doing high pass filters 3x3 in still single threaded. I watched it running and there's likely alot of still single threaded tools in ArcGIS Pro at this stage AMD forgery about using the python package manager in Pro to get CUDA working on many or most python packages.
 StanNWT140 post(s) #21-Mar-19 07:07 I meant to say process not preview 37.7 billion pixels.
 StanNWT140 post(s) #25-Mar-19 20:18 Any luck with trying to run the same operation?
 tjhb8,760 post(s) #26-Mar-19 06:02 Yes. Several tests, with full notes.At default settings, M9 saturates my fastest GPU.Like Dimitri, I suspect you have been measuring GPGPU incorrectly, but that is easily fixed.Details tomorrow.
 StanNWT140 post(s) #27-Mar-19 15:17 Hi Tim,I found the drop down arrow to switch to "Compute_0" and "Compute_1", since I have two Quadro P4000s I assume that's why I have a "_0" and "_1".This morning I tried the "limit low" transform to fill in some gaps in that DEM, that you will notice when zoomed into some of the offshore areas where the DEM squared off bits which represent the original 1 degree x 1 degree tile edges, are still present in the DEM. When I see the preview I see no compute activity, but in the first 10 seconds of the "add component" GPU usage went up to 90%, but now I only get about 5% activity every 10 seconds at a completely regular interval.I've attached a screen grab of the GPU usage.I've also attached a screen grab of the DEM that shows the gap I was mentioning offshore which also happens to show the progress dialog box.
 Dimitri5,452 post(s) #27-Mar-19 17:49 Limit Low is not Curvature, Mean. There's no computation in Limit Low... it's just moving bits between storage. Nothing for a GPU to do there. How does Curvature, Mean work for you and GPU?
 StanNWT140 post(s) #27-Mar-19 18:48 I'm going to reproduce the curvature mean and curvature profile runs to check now that I have the compute_0 and compute_1 drop downs set up.However, I did notice brief times when compute_0 was showing 95%+ usage for 30 seconds or so at the start and near the end of the transform processing.It did take about 2 hrs 15 minutes to do the "limit low" on that same DEM.Just an fyi the log window reports:2019-03-27 11:21:10 -- Transform (Limit Low): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8180.886 sec)Seems like given that same sized DEM that curvature mean, curvature profile and limit low are all taking 2 hrs 15 mins to 2 hrs 27 minutes.
 tjhb8,760 post(s) #27-Mar-19 20:30 There's no computation in Limit Low... it's just moving bits between storage. Nothing for a GPU to do there.The compute work is trivial, and most of the work is in priming the GPU with data, but it does look as if GPGPU is being used all the same, and from Stan's timings this seems right. (Right in the sense that the same task would take several times longer on CPU only.) I gather that CUDA PTX has built-in Max and Min functions for all supported data types, so a limit/clamp operation doesn't need a conditional.I'm posting my notes on using Curvature, Mean with Stan's data below.
 StanNWT140 post(s) #27-Mar-19 21:48 Hi Tim,Were you going to post time to process on your various computers, or is that still running?I'm going to re-run the mean curvature and profile curvature on that DEM and see if I can see the GPU compute_0 running, as well as compute_1. I have a funny feeling that only compute_0 will show any processing, I'm hoping not though, as I'd want to use both GPUs for processing. Note: I have my Nvidia quadro settings to use both GPUs for "Use for Graphics and compute needs", under the "Manage GPU Utilization. It would be awesome if there was a Manifold setting under the Program Settings for Quadro 3D settings. I know Nvidia would charge a fortune to test and verify their drivers against Manifold 9 in general let alone various builds. However, my default settings are to use CUDA on all GPUs. It is interesting to note that my PhysX settings are set to use the GPU that isn't connected to the displays and the GPU that is doing the calculations in Manifold is the GPU that is connected to the displays. I could set PhysX to be on the CPU, but I highly doubt there is any relation to PhysX settings and how Manifold uses CUDA. Just an observation and maybe something Tim can test as well since he has a Quadro card on a laptop.
 tjhb8,760 post(s) #27-Mar-19 20:50 I found the drop down arrow to switch to "Compute_0" and "Compute_1", since I have two Quadro P4000s I assume that's why I have a "_0" and "_1".That's it... with variations. My fastest GPU is the GTX TITAN (more on that below), installed as a single GPU. On that machine, I also have counters for both Compute_0 and Compute_1. Only Compute_0 shows any GPGPU activity (naturally enough).On the other hand, this laptop has a tiny Quadro K1100M GPU in conjuction with Intel Graphics 4600. Here the drop-down list of "engines" (as Microsoft mysteriously calls them) does not include Compute_0, but only Compute_1, which again shows no activity when GPGPU is in use. Instead, GPGPU is shown for the "engine" named 3D. This could be a hybrid graphics issue, or a Windows or driver bug, or who knows what. (For interest, the Intel graphics adapter, listed as GPU 0, has "engines" named 3D and Engine 5, but nothing beginning with Compute.)So we may have to "shop around" amongst the available "engines" named in Task Manager to monitor GPGPU activity correctly.
 StanNWT140 post(s) #27-Mar-19 21:50 I'd love to have two RTX Titans on a Threadripper 2950X with 128 GB RAM, but hey...Maybe you can try those variations of PhysX to see if there's a difference, see my post above. I know it's likely to have nothing to do with it, but who knows.
 dyalsjas102 post(s) #27-Mar-19 22:05 I could do this with a GTX 1080ti.I don't have any Titans.
 StanNWT140 post(s) #28-Mar-19 01:00 Hi Tim,I've been fairly well versed in the differences between single and double percussion floating point precision on geforce vs. quadro vs. tesla cards since I first used Manifold 8 in 2009. My biggest hope for rtx titan cards is the 24GB of RAM on them, but given Tim observation of how little RAM Manifold appears to be using it might not make a difference. One thing the Quadro P4000 card had going for it I'd that each card is single slot allowing more customization in how and where you install other PCI-Express cards in your computer depending on the motherboard layout. Plus each card only uses 105W which is the lowest wattage per CUDA cores prett slot I think you can get. Do for the workstation tested driver optimization for things like ArcGIS, Adobe and many of other apps is I think the best card, maybe not for the money as a RTX 2080 would be the same price range but 2-3 slots and way more power usage.
 adamw8,579 post(s) #01-Apr-19 15:48 (Finally found the time to read through the thread, it's a great one.)Regarding this:(We can switch to use 32-bit mode by a PRAGMA directive, although in initial testing this currently seems to disable GPGPU entirely. A question for another day.)We should use GPGPU for both 64-bit and 32-bit floating-point math.Is the above suspicion that GPU does not seem to get engaged for 32-bit math based on observations of GPU activity? The activity of GPU with 32-bit math could easily be lower than with 64-bit - exactly because GPU can do way more 32-bit math than 64-bit, so if computation times are dominated by transporting the data around and 64-bit math does not max out the GPU, then 32-bit math won't max it either and will load it noticeably less in relative terms.(In any case, I just ran a test transform in both modes and GPGPU seems to get engaged in both cases, with the activity being lower for the 32-bit math. So, on first sight, this seems to work as expected.)
 StanNWT140 post(s) #01-Apr-19 16:05 Hi Adam,I'm not sure I've not tried the PRAGMA statement to force single precision on the GPU. However, it seems that I might be the only one of the respondents to this thread that is using two GPUs, let alone two quadro cards, let alone a dual xeon setup (Gold 6128). Think there's anything to a system setup like mine that is interfering with the GPU getting maxed out for the entire run? My run time is 2.5 hrs, Tim's is 40 minutes, one would hope a more souped up system would get the work done comparably fast if not faster, or am I just stewing in my own frustration? sorry for the food puns.
 adamw8,579 post(s) #01-Apr-19 17:40 Tim's 40 minutes are on the test where he uses his card's rare ability to use a humongous numbers of fp64 cores, we should compare to his other test that uses a more normal number of fp64 cores and completes in 51 minutes. Now, why on his system the test completes in 51 minutes whereas on your system it completes in 2.5 hours is a little puzzling, yes.I doubt it's related to your system having two cards - or, more precisely, it might be related to that, but if it is related to the system having two cards, that's almost certainly on the driver (so, might perhaps be helped by using a later version sometime when they notice the issue).I think the difference is related to the performance of the disk subsystem. You report that after you upgraded it, saves got a significant speed up, but the transform got a much smaller speed up, that's likely related to different access patterns. It might help to run the transform on both your and Tim's systems without GPU, just on CPU, to see how the numbers will compare. I think I said that we might have a couple of things planned for the future which will make better use of system configurations like yours, too - with access patterns straightened to be more like those during save.
 StanNWT140 post(s) #01-Apr-19 18:15 The disk subsystem wasn't upgraded. I simply switched where the (map) files were stored for the test. Normally for (map) files they're stored on a USB 3.0 Drobo 5D, economical dual redundancy storage that gets me 250 MB/s write and 225 MB/s read. The drive is a Samsung 960 Pro 2 TB drive that when initially set up as the drive was getting 3000 MB/s read, 2500 MB/s write. Obviously the disk IO is vastly different as well. max of 1600 IOPS for the Drobo and possibly 90,000 + IOPS for the 960 Pro. I was also getting actual write speeds of 250 MB/s max on the Drobo when I looked at the Task manager for that disk when saving the file when it was located and saved to the Drobo, and I got upwards of 2 GB/s when saving the file when the (map) file was located and saved to the Samsung 960 Pro drive. Obviously NVME drives are preferable, but no RAID option for me there and I do prefer fault tolerance for GIS and remote sensing data.
 tjhb8,760 post(s) #28-Mar-19 00:42 Correction:Graphics RAM usage was constant at 6 GB (that is, all of it) 0.6 GB. I misread the graph. Why is so little GPU memory used? Buh. Well, if all CUDA cores were saturated (if this is roughly what Compute_0 98% means) then Manifold clearly knows what it is doing.BTW notice the somewhat unhelpful reading near the bottom of Task Manager window showing "Utilisation 1%", although Compute_0 is at 98%. Evidently "Utilisation" does not include compute. Misleading by default.
 StanNWT140 post(s) #28-Mar-19 01:12 Hi Tim,I wonder if having the data type for the DEM being float32 would be better performing than if the data type was float64 given the performance penalty of 32-bit vs. 64-bit on nvidia cards depending on quadro, geforce, tesla or GP /GV editions of those cards. I'd live a Titan V due to its better double precision performance bit it has half the RAM of the Titan RTX.I think the show stopped compared to your runs is due to the storage medium I'm reading the data from. When it gets moved over to the QNAP RAID or will improve I'm sure. Disk IO on my current storage usually tops out at 1100 IOPS with 225 MB/s and 250 MB/s using ATTO disk bench. Reads and writes when reading or writing with Manifold it's usually about 117 IOPS, saving is usually around 10 MB/s, but I've seen 270 MB/s when the transform dialog showed the progress of whatever I'm running. My new storage should be 8 x faster for reads and writes, not sure about the disk IO levels. I really appreciate your help testing this Tim. If nothing else you have a very large DEM to test things with.
 tjhb8,760 post(s) #28-Mar-19 02:15 The current test suggests that for GPGPU with Manifold 9, the amount of on-board graphics RAM is of little importance.I see you've already covered that above.Thinking about it a bit more, it's possible that the amount of memory used is not a Manifold choice, but a CUDA driver choice, under the Unified Memory model introduced from Kepler onwards, which migrates data between system and graphics RAM on demand. Manifold might still do it the hard way when it really matters, but I bet they would rely on the built-in memory management when it just works.The evidence that matters is the constant GPGPU saturation. In this test the GPU is getting all the data it can possibly eat.
 StanNWT140 post(s) #28-Mar-19 18:29 Hi Tim,I've already got the QNAP box. I've had it for many months, just my normal everyday workload hasn't allowed me to spend the trime transferring 10 TB of data from my two current external RAIDs to the new Thunderbolt 3 connected QNAP box.I'll copy that DEM project file (.map) onto the Samsung 960 Pro 2TB drive, I've got over 1TB free space on it. I'll try and run the curvature mean and curvature profile on it from there and monitor the compute_0 and compute_1 charts to see if they max out, or are at least above a marginal usage and if either sustain that throughout the transforms progress.
 tjhb8,760 post(s) #28-Mar-19 20:22 Thanks, will keep watching.You could do a manual TRIM before the tests, just to make sure the SSD doesn't do one automatically (slowing processing to a crawl) partway through.
 StanNWT140 post(s) #29-Mar-19 16:05 Morning Tim,I'm running the mean curvature on the big DEM project file. I chose to open a blank Manifold Project. I'm using 9.0.168.10. I added the big DEM project file as a data source. Not copying it into the new project, just processing it so that the results are in the new project to keep the new project file smaller. the attached screen grab is a mosaic of several separate screen grabs of the read speed from my Drobo 5D RAID as I copy to the Samsung 960 Pro 2TB, the write speed of the Samsung 960 Pro 2TB, the GPU load at the start of the transform processing, and a temporally coincident screen grab of the Manifold window showing the records per second read rate and MB/s, a screen grab of the end of the heavy GPU load and a temporally coincident screen grab of the Manifold project window, finally a temporally coincident drive read and write rate and CPU usage percentage. You can see that the GPU only sustains a load for a minute then goes on that very spiked saw toothed utilization illustrated in the screen grab of the heavy GPU utilization. I was glad to see if max out but disappointed that it didn't sustain it.I also attached a screen grab of the Nvidia control panel illustrating that I don't have an option for double precision like you do with your Geforce drivers and your Titan card. I'm assuming most Geforce drivers at least for Titan cards would show this if there's double precision compute performance?
 StanNWT140 post(s) #29-Mar-19 16:24 Attached is the screen grab of the Manage GPU utilization section under the workstation settings of the Nvidia Control Panel as well as the real time Nvidia GPU utilization graph overlain on the task manager. You can see that the spikes are coincident.
 StanNWT140 post(s) #29-Mar-19 18:24 Ok, it finished and I've attached the screen grabs during the processing illustrating the CPU usage, GPU usage the drive performance and the time to create the curvature mean with a radius of 3 and the time to save the file.You can see that near the end the CPU and GPU usage maxed out, but during the mid section of processing the small spikes in GPU performance were reflected in the CPU performance and the read/write speed. Fortunately the Nvidia GPU monitoring graph that can be activated from the Nvidia Control Panel does seem to be a better indicator of GPU performance than the built in Windows tracking of it in the Task Manger's "Compute_0" panel option.Render: [Map] (1.264 sec)2019-03-29 11:40:46 -- Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8003.733 sec)Save: C:\Manifold_Temp_Data\USGS_NED_Fused_1_and_2_arc_second_DEM_Mean_Curvature_Profile_Mar29_2019.map (252.091 sec)My previous save took 410 sec with just a curvature mean data set in the Manifold project with a similar data source for the original DEM.The save time is roughly 2x faster. I monitored the range of write speeds on the 960 Pro 2TB between a high of 1.9 GB/s and as low as 60 MB/s.Previous curvature mean took:Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8839.753 sec)to complete.As a point of reference the curvature profile radius 5 took:Transform (Curvature, Profile): [USGS_NED_Fused_1_and_2_arc_second_DEM_May30_2018]::[USGS_NED_Fused_DEM] (8667.657 sec)However once I had both a curvature mean and curvature profile surface inside the map document the save time took:USGS_NED_DEM_Curvature_Mean_and_Curvature_Profile_Mar19_2019.map (6364.526 sec)These save times and processing times were working on Drobo 5Ds. Not the fastest USB 3.0 RAIDs but all that I could afford to provide simplicity dual redundancy and capacity. Now that I have a QNAP RAID solution I'll have a much higher performing system.However, I'm dismayed that the computational time is only 10% faster on the Samsung 960 Pro 2TB vs a Drobo 5D.
 adamw8,579 post(s) #01-Apr-19 16:48 The image to this and a couple of other posts not loading seems to be a problem with the forum code (names are too long). We'll get that fixed.Regarding the performance on different hard drives and the saves getting accelerated much better than the transform - this has to deal with different access patterns. We have a couple of wishlist items that will have a positive effect here - an option to use more virtual memory first and foremost.
 Dimitri5,452 post(s) #29-Mar-19 18:22 I added the big DEM project file as a data source. Not copying it into the new project,Why leave the data in a slow format if you want to measure and report performance, or keep GPU fed with data? DEM is not as fast a format as MAP. A key benefit of using MAP is to be able to get at data as quickly as possible without the bottlenecks of slower formats. See the discussion in the Importing and Linking topic.
 StanNWT140 post(s) #29-Mar-19 18:29 The original data source in this context is an existing Manifold Project file, that I also sent to Tim. So there shouldn't be a performance penalty.That's the great thing that you and Adam and Tim have accurately professed for ages about the (map) file, is that just use existing (map) files as sources for other projects, no need to import.
 StanNWT140 post(s) #29-Mar-19 22:08 One thing I realized after I ran the test this morning is that in the past I was running a radius of 2 for mean curvature. This morning I ran it with a radius of 3. So I reran the test with a radius of 2 and got:Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8042.007 sec)This is the time with a Radius of 3:Transform (Curvature, Mean): [USGS_NED_Fused_1_and_2_arc_second_DEM]::[USGS_NED_Fused_DEM] (8003.733 sec)The fact that a much more complex mean curvature computation got pretty much equivalent results is interesting.I have two GPUs but not in SLI. None of the applications I use would benefit from SLI.I can look at CUDA-Z.I do have GPU-Z and the CUDA specs state a 1:32 ratio for SP to DP ratio. Processor count of 14, cores per Processor of 128.Perhaps the size of the photos is the problem?
 tjhb8,760 post(s) #29-Mar-19 22:41 The fact that a much more complex mean curvature computation got pretty much equivalent results is interesting.It's not really much more complex to run the calculation with a radius of 3 rather than 2. Once the data is on the GPU, everything is almost equally easy. It's getting the data to and fro, efficiently, that takes the time.Please can you try testing with a single project? I.e. executing (or at least beginning) the transform in the same project that contains the data (without using a child project). Let's rule out whether that makes a difference.I can read your images fine now thanks! It was probably the long(er) filenames.I wonder what would happen if you disabled (CUDA on) one of your GPUs. Maybe the NVIDIA CUDA driver is trying too hard to share the load (Unified Memory), and this is wasting transport and synchronisation overhead. Just speculation at this stage.[Added.] You can tell Manifold to use only one GPU using the 'gpgpu.device' directive. To do this you would set up the curvature transform, but press the Edit Query button rather than the Add Component button. Then edit the query text to insert PRAGMA ('gpgpu.device' = '0'); -- or '1'somewhere near the top. Then run it.
 StanNWT140 post(s) #29-Mar-19 23:01 Will using that PRAGMA statement override the settings for each GPU in the Nvidia Control Panel? In that control panel I can set "graphics only" or "graphics and compute". Currently both are selected for graphics and compute. However, I can set the card that doesn't have my two 32" 4K monitors attached to it as the graphics and compute card. Then in Manifold set device 1, which is the second GPU, without monitors attached as the one for compute? Should that work better so that Manifold isn't fighting against the driver configured by the control panel?I will try running mean curvature from just in the main project file that has the DEM. I am splitting the result up into different files since having everything in one file makes the (map) file really huge, and creating MBX files takes extra long when they're all in one file. I'm trying to save disk space in the long run.
 tjhb8,760 post(s) #29-Mar-19 23:28 (1) To your first para I can mainly answer in three words: I don't know. I have never (yet) tried using a system with two CUDA-capable cards attached. (Actually I think I did do some tests with Manifold 8, but not with 9.) And I never use two monitors!But all of that detail seems material. You have two huge monitors attached (which of course consume heaps of graphics RAM), both attached to one GPU, and no monitor attached to the other GPU, making it "headless"--which BTW used to prevent GPGPU work on the GPU in question; we might need to check whether this still creates driver problems for CUDA.With that setup, why not set the first GPU to graphics only, the second to compute only? At least for testing.Now, in that case, would you also need to tell Manifold to use only the second GPU for GPGPU using the PRAGMA directive? It seems at least plausible, possibly likely.I feel like we're starting to get somewhere. Simplifying the GPU setup first seems like the best approach.(2) On your second para, what you say about storage makes perfect sense for the long run. But for testing I think it's important to test the simpler arrangement (a single project). As I've said, you can just cancel when you've seen all you need to see (especially whether performance still drops off a cliff after some minutes or seconds). A question: do both GPUs go to ~100% on Compute_0 at the start of processing? How long do they stay there? I might not be reading your screenshots closely enough.
 StanNWT140 post(s) #29-Mar-19 23:36 The Nvidia control panel doesn't provide me with an option for compute only, just graphics and compute.The only GPU compute that has any response in the Windows task manager is compute_0, however, the Nvidia GPU utilization shows both GPUs having activity, usually both are roughly the same usage. This strange variation in reporting is something to consider.
 StanNWT140 post(s) #30-Mar-19 00:39 I set card 1 as graphics only, since card 2 has the attached monitors. Card 1 is likely compute_0 and card 2 compute_1. I used your pragma statementPRAGMA ('gpgpu.device' = '0');to set the compute to "0" as you suggested I did see a burst of activity then it dropped to the saw tooth utilization as before. In the Nvidia GPU utilization the GPUs have distinct graphs, since they're not both dedicated to trying to do graphics, this makes sense.I did try CUDA-Z.I got 6000 Gflop/s for single precision191 Gflop/s for double precisionout of both cards even when I picked heavy time load.I noticed that in the Windows Task manager when card 1 was set for compute and graphics, which is device 1 I saw the GPU utilization in the Windows task manager when looking at compute_0 but not compute_1. When card 2, which is device 1, because it has graphics cards attached to it, that device shows no GPU usage on the windows task manager when in CUDA-Z when card 2 is the only card for graphics and compute.GPU utilization in Manifold for compute_0 is fluctuation between 90% and 60% for 1 min 6 seconds then drops down to that saw toothed pattern as before. This time in the Nvidia GPU utilization monitor only compute_0 which is card 1 (no monitors) shows the spike in GPU usage, unlike before when both were tasked with graphics and compute, however, after 3 min the sawtooth pattern that uses maybe 2 percent at the lowest spiking at 30% shows up now.
 tjhb8,760 post(s) #30-Mar-19 00:55 CUDA-Z CUDA performance looks great. Wikipedia gives 5300 and 165.6 GFlops for the P4000--your cards are well ahead of that.Card 1 is likely compute_0 and card 2 compute_1I'm not sure about that. As I read it, both GPU 0 and GPU 1 should have Compute_0 readings*, both of which are relevant; and both also should have Compute_1 readings, neither of which is relevant (always blank). The two separate Compute_0 graphs should correspond, roughly, with the two NVIDIA GPU utilization graphs.[*When they are both enabled for compute.]Back to the substance, is your conclusion that performance is no better when only one GPU is allowed to do compute work? Compute_0 performance still falls off a cliff after about a minute? Previously for both GPUs, if I am right about the graphs--can you check this, with CUDA enabled on both GPUs again?--now for only one.If we are lucky, someone else with dual CUDA GPUs will be able to do a comparative test, either with your massive test project (if you can share it again) or with something else.(Or dual GPUs might be completely irrelevant. The blockage could be something else. But in any case, it is pretty striking.)
 adamw8,579 post(s) #01-Apr-19 17:03 Regarding setting cards to graphics / graphics and compute in the NVIDIA control panel applet and then using PRAGMA ('gpgpu.device' ...):The number of cards and their modes are somewhat removed from what the pragma manages. What happens is this: you tell the control panel which cards to use and how, and this info then gets to the graphics driver. The driver looks at the options you set and decides how many CUDA devices it is going to expose. It can expose a single device covering all cards if it wants to, or it can expose multiple devices even if there's only a single card in the system, it's all up to the driver. When you launch a Manifold query with the PRAGMA for which device to use, that's a CUDA device as exposed by the driver, not a card. You can check how many CUDA devices the driver decided to expose using SystemGpgpuCount() / SystemGpgpus(), see the query builder. Finally, if you try to make the query run on the CUDA device that does not exist, the query engine will just assume that you made a mistake and ignore the request. As in, if you only have a single device and try PRAGMA ('gpgpu.device' = 18), the query engine will think "well, that's an invalid device number, we don't have that many devices" and use the device that you have.
 StanNWT140 post(s) #01-Apr-19 18:08 Hi Adam,I can confirm when I looked at the number of CUDA devices that it indeed only shows 1, now that I have set in the Nvidia control panel, to only use the card that doesn't have monitors attached to be the one for compute and graphics, the other card that has the monitors attached is graphics only. So manifold only sees one GPU now.
 Dimitri5,452 post(s) #02-Apr-19 05:55 now that I have set in the Nvidia control panel, to only use the card that doesn't have monitors attached to be the one for compute and graphicsWhy turn off resources artificially and make the situation a) slower and b) more difficult to understand? To repeat:And trust the optimizers: don't turn stuff on/off to try to game the system or give it hints. Turn everything on and let it use everything as it sees fit. :-)You have expensive and powerful cards: allow Manifold to use them. It will do a better job than humans can, so turning off some cards won't improve performance, but it can lead to confusion.
 tjhb8,760 post(s) #02-Apr-19 06:13 No one is trying to "game the system".The current result is unexplained and strange. Something is wrong, we don't know what. The natural objective is differential diagnosis.In light of Adam's helpful insights above (regarding how CUDA devices are really enumerated), it does seem better to turn both GPUs back on (then count how many Manifold sees...), and for both me and Stan to time the transform using CPU only. That might well help.I will also test tomorrow on my smaller Pascal GPU.Any better ideas?I am full of ideas, but few are good. One more: try installing the NVIDIA Quadro driver that installs with CUDA Toolkit version 10.1.Another: rearrange the two monitors. Plug one into each card--unless there is a good reason why not.
 Dimitri5,452 post(s) #02-Apr-19 07:21 The current result is unexplained and strange. Something is wrong,What is unexplained and strange about the current result? What is "wrong"?
 tjhb8,760 post(s) #02-Apr-19 07:38 Read the thread. In particular the results.Or if you are feeling short of time, Adam's summary here.(Easy to miss actually, because the thread is heavily spliced.)
 StanNWT140 post(s) #02-Apr-19 15:55 I only have whatever is installed with the 419.17 drivers, I've not installed any version of the CUDA Toolkit separately.
 StanNWT140 post(s) #02-Apr-19 23:38 Will installing the CUDA toolkit be of any use vs. the Nvidia Quadro drivers, if so how?
 tjhb8,760 post(s) #02-Apr-19 06:35 What else would I try, if this were my system (and knowing no more than I do)?First, check where TEMP is. Move it to drive C: if not already there.Secondly, uninstall any and all Google apps, especially Chrome. Uninstall or disable Microsoft OneDrive, and any other live backup or mirroring software. Disconnect from Internet. Remove all USB drives.Third, remove any custom antivirus or antimalware software, and revert to Windows Defender.Reboot.Lastly, perform TRIM, then shrink the SSD to ensure 10-20% overprovisioning, then TRIM again.
 tjhb8,760 post(s) #02-Apr-19 07:08 [Stan has already said his TEMP is on C:.]
 StanNWT140 post(s) #02-Apr-19 19:07 Hi Tim,Generally I've always wanted to put the:pagefile.sys\Temp\Windows\Temp\Windows\Users\**\Tempall remapped to a separate drive, or in the case of a workstation that is using a 4-6 drive RAID 10, it's not as important, due to fault tolerance. My reasoning, is that I like to reduce the continuous wear and tear on the boot drive, in an effort to extent the life of it to avoid having to restore from backup on a new drive the operating system, or worst case re-install from scratch everything.My workstation has full/incremental backups of the drive, so that if there is a detrimental OS patch, software conflict, or complete corruption, I can revert to an earlier state.Obviously full/incremental backups apply to my main GIS drive as well.However, due to the massive performance boost I get on the Samsung 960 Pro 2 TB drive I don't put those on a separate nvme drive. The standard nvme drives that come with the Dell Precision 7820 are Toshiba drives and have a read of 1400 MB/s, write of 300 MB/s. No where near the Samsung (960 Pro, 970 Pro, 970 or 970 Plus) drives. I bought the extra drive and installed it into the extra flexbay for nvme drives. I do have the 1 TB Toshiba nvme drive but it's performance is pretty lame.
 StanNWT140 post(s) #02-Apr-19 15:53 I am in an organisation that doesn't allow me to turn off my anti-virus / security suite. Also there are other corporate tools running in the background. They don't take up hardly any CPU and no GPU performance.One thing that does happen and this is a known Windows issue since 7, is that windows explorer (file manager, not to be confused with the Windows 3.0 file manager that you can now download and use in Windows 10 for some nostalgia! ), will occasionally max out the CPUs, but it only lasts for seconds. Usually this is an issue with 'Carbon Black' which used to be called 'bit-9'. I can't just disconnect from the corporate network. My GIS data storage for the moment is all USB 3.0 so I can't disconnect that and Acronis TrueImage needs to do the scheduled backups. The backups were not running nor using any CPU usage during the tests. I'm not going to go into massive detail about every nuance of my setup and workflow, since it's a work environment and I have to have some level of discretion, especially in an open forum. One drive isn't installed, perhaps it will be at some point, if it gets installed I won't have any control over it and it will sync as it chooses likely, I might be able to restrict syncing to certain folders that aren't GIS data locations. I used chrome as my web browser 99% of the time and I'm not getting rid of it. Sorry to be so blunt. I'm incredibly greatful for the help Tim, Adam, Dimitri, but I'm not able to go to a clean install version of Windows 10, or practically clean install for the Testing.
 StanNWT140 post(s) #02-Apr-19 16:06 As stated here, there is only only compute_0 reference, whether you show the graph on both cards or not, whether both cards are set to use compute or not, only compute_0 shows any usage. It's just on the comparison runs I've done that when only one GPU, the GPU without monitors attached seems to have a longer maxed out GPU time and the sawtooth seems to have higher peaks and higher minimums than when both are enabled. I've not tried the run when only the card with monitors attached is the compute and graphics, the other card is only graphics, even though it has nothing attached to it. That would be a good test to run I think.
 Dimitri5,452 post(s) #30-Mar-19 06:56 OK. Thanks for clarifying. My mistake: I read it that you had linked in the DEM, not linked in a MAP that contained an import of the DEM. It's true there should not be a performance hit.By the way, if you find yourself scratching your head to analyze results, you're not alone: unfortunately it is very difficult to guess what the innards are doing many levels deep.GPGPU involves so many intricate issues on so many levels, involving so many different packages and devices, and a mix of them which change depending on what specifically is going on, that it is hard to get beyond the usual broad generalities: use fast disk to avoid being disk bound, have plenty of CPU cores so those can work in auxiliary or primary roles as the optimizers determine, have lots of main memory so the many moving parts that use main memory (be they Windows or Manifold or libraries) don't run into limits, and that, generally, a higher end GPU card will provide better performance (but for most people in most applications there should not be pressure to over-spend). And trust the optimizers: don't turn stuff on/off to try to game the system or give it hints. Turn everything on and let it use everything as it sees fit. :-)For all that, we can learn useful info about rigs for bigger tasks with trials like these, and it is wonderful fun to try big data with cool GPGPU configurations. I love it, and thanks for spending the time to set up experiments like this! I also agree that when you see total GPU saturation of very mighty GPU card like a TITAN that is truly amazing. That's not something you'd see in years past with 8. A big part of that, I'd guess, besides 9 being what it is, is having a fast SSD data store and enough CPU cores for the system to use in parallel to support GPGPU.
 adamw8,579 post(s) #01-Apr-19 16:23 Regarding transforms not needing much memory on GPU:It's true that we currently don't need much memory on GPU. This happens because all of our current GPU-enabled functions allow us to split the work and then reuse memory for different chunks of data very efficiently. As we add more analysis, other functions might not be so lucky, and might need to, say, pre-load a big part of the analysed raster onto GPU - and begrudgingly go to CPU if GPU does not have enough memory.Regarding switching to 32-bit floating-point math moving the work from GPU to CPU:This does tend to happen because of two things: (a) the raster itself being not 32-bit floating-point (whatever its type, just that it isn't FLOAT32), and / or (b) one or more of the operations in the optimized expression being not GPU-enabled (this then frequently converts 32-bit floating-point values coming out of GPU to 64-bit floating-point in order to do the non-GPU-enabled operation, then possibly converts the result back to 32-bit floating-point if there's more GPU computations to perform).If the raster is 32-bit floating-point and all operations are GPU-enabled, then switching to 32-bit floating-point math should have no conversions and should result in better performance, although the performance gains will vary and will frequently not just automatically be 2x or 3x or whatever.In general, we treat 32-bit floating-point math as a special case for when the user tries to squeeze the absolute max performance possible and is controlling everything trying to get it. The general case is 64-bit floating-point math.
 StanNWT140 post(s) #01-Apr-19 18:05 Hi Adam,I'm running the standard test without modifying the curvature mean transform on 9.0.168.11. I'm attaching a mosaic of all the task manager charts with the corresponding manifold windows, where it shows the time index and MB/s or KB/s. Labeling for each group is provided. I also have the GPU-Z composite screen grab in there. the JPG is 7239 x 2813 pixels. It's very bizzare that when the CPUs and GPUs are maxed out I'm only getting ~600 KB/s scanning records, but when it drops to lower levels that's when I get into the MB/s. Note I've moved the data back to the Drobo, however, the read/s hasn't changed much in the initial part of this run. Note, my drive, the Samsung 960 Pro 2 TB drive likely has the OPAL 2.0 encryption turned on, something the IT department sets up without option. However, when I first got the workstation, which would have been after IT configured it my ATTO disk bench was getting 3 GB/s read, and 2.5 GB/s write. Not sure the current performance. If the times are the same, regardless of where the data is stored and accessed, then there's a larger issue afoot. Note, I have a slew of applications running in the background that I cannot turn off that are IT department things I have no control over. I appreciate how hard Tim has worked on benching his computer with the same data. I'm sure he enjoys the challenge. I'm sure I could give you the drop box link, Adam, if you wanted to play with the same data set and perhaps get it into the hands of the development team so they can play with that large of a data set, not that they've not already played with very large data sets.When I looked up the likely double precision flops of my two Xeon Gold 6128 CPUs they see to have more flops than the two quadro cards combined using the AVX-512 metric? That being a dp flops to dp flops comparison.Attachments: Mean_Curvature_Performance_Test_Apr1_2019.jpg
 StanNWT140 post(s) #01-Apr-19 19:49 It just finished after 2 hrs 35 minutes. There's little difference in the computation time when the data is on the Samsung 960 Pro 2 TB vs. the Drobo 5D. Save times will be vastly different of course. I have a new set of task manager and manifold screen grabs I'm compiling.
 StanNWT140 post(s) #01-Apr-19 20:38 Here's the last page of test screen grabs as promised. 9663 x 2438 pixels.
 StanNWT140 post(s) #01-Apr-19 21:57 Can I delete the attachment? The Page2 attachment?
 adamw8,579 post(s) #02-Apr-19 11:12 I deleted it.I'll think about what we see on the first screen, it's very useful.
 StanNWT140 post(s) #02-Apr-19 15:30 Thanks. Same graphic, no names or directory paths.The "*Test_Apr1_2019" is fine.To me it seems that given the same transform and the same radius, setting the graphics card that doesn't have any monitors attached to it to be the only compute card seems to have given it a boost, as reflected in the '*Page2a_Apr1_2019' image."If" Windows was always scheduling the card that didn't have any minitors attached to it to be the predominant compute device, i.e. 'compute_0', and the card that does have monitors attached to be 'compute_1', then it makes sense that I never saw any GPU processing on 'compute_1'. However, if 'compute_0' and 'compute_1' are simply compute functions on all GPUs, then that's different.The settings for the driver are as you say past on to windows and then to Manifold I assume, and Manifold currently only shows one CUDA device. Before it showed two. However, I'm getting more and higher GPU compute graph results in the task manager, now that there's only one CUDA device. This could mean perhaps two things:1) The graph is an average of the two cards, each card having 'compute_0' and 'compute_1' components, when, perhaps, manifold is only using one card, the graph is averaged over two it appears lower.2) Setting only one CUDA device, the card without monitors attached, a card that has nothing to do but compute tasks, actually produces higher results?
 adamw8,579 post(s) #02-Apr-19 16:38 What we mostly see in our experience is that, although that's not guaranteed to happen, each physical card tends to get represented as a separate CUDA device and only as a single device. However, in terms of what gets shown as 'compute_0' and 'compute_1' in Task Manager, I believe we saw cases where even with two separate CUDA devices, 'compute_1' was staying blank and everything was going into 'compute_0'. For one thing, 'compute_1' is absolutely showing as a choice on systems which have a single card and a single CUDA device, so the presence of 'compute_1' isn't an indication that there's anything to show there. So if 'compute_1' shows nothing and there are multiple CUDA devices, I would suspect that the combined load of these multiple CUDA devices might indeed go into 'compute_0'. That would produce the effect you are talking about in 1.As regards of what would give best performance, without heavy rendering (which we don't have here), probably putting both card to 'graphics and compute'. Ignoring one of the cards entirely shouldn't be beneficial - if it does, that's likely something to fix for either us or NVIDIA (but I don't think it does). :-)
 StanNWT140 post(s) #02-Apr-19 18:45 Thanks Adam,One thing to note is that I'm not likely to want to stick one monitor on each card because of syncing issues. Without SLI I don't think syncing is going to work well across both cards/monitors. Not really looking to have any issues. Considering each card can handle 4 x 4K monitors, it seems like a good idea to keep them on one card?I can easily turn 'graphics and compute' back on for both cards.
 tjhb8,760 post(s) #03-Apr-19 01:20 I'm not likely to want to stick one monitor on each card because of syncing issuesYou mean for gaming, or does it also matter for extended desktop mode, e.g. using Manifold?Considering each card can handle 4 x 4K monitors, it seems like a good idea to keep them on one card?I would expect that it is better to have two hands performing two tasks, than to have one hand juggling two tasks. CUDA tasks aside, one of your cards is currently going to waste.And there is at least a chance that leaving one card headless may interfere with CUDA balancing/throughput.It would be good to test with both cards active again, both with one headless, and then with both driving a display.For a start, how many GPUs does Manifold count in each case? (That would be very interesting, now that Adam has explained how it works.) Then, what performance do you get on this test in each case, after the first couple of minutes? Does performance always fall off a cliff?
 StanNWT140 post(s) #02-Apr-19 18:55 Is there a possibility of using a larger percentage of memory, as a checkbox / fill in a number as a percentage dialog in each transform? For example, you have a checkbox added, when added a fill in white box where you type in a number (0 to 100), for a percentage of memory to use. Perhaps making this something the user has control over is a bad thing, but the end user has more knowledge of their individual system and its available memory needs with the software that's running that creating just a larger memory allocation for cache. The caveat to this is that the user will be increasing or decreasing performance directly, but at least they can potentially control some of the performance boost. Some times the software that a user might need to run at scheduled times needs more memory available for it than might be available if Manifold is using it, so to prevent conflicts, reduce the amount of memory available, other times increase it. Only having it in the background unavailable to the user might not be the best approach, but often times the user can do things unintentionally detrimental to software. Only having an option to apply this in SQL, through editing a transform isn't necessary great for those that don't do well as of yet with SQL. I know adding transform dialog additions in terms of radio buttons, check boxes and fill in boxes for numbers or text, isn't always desirable, but gives a more GUI driven approach for those that like it. It takes longer to roll out new capabilities if you have to put it in GUI dialog boxes as well.
 Dimitri5,452 post(s) #03-Apr-19 06:11 Is there a possibility of using a larger percentage of memory, as a checkbox / fill in a number as a percentage dialog in each transform?God, I sure hope not. :-) Tech support is already taking hostages to ensure that doesn't happen. Imagine people around the world deciding that 16K of memory is plenty to run a billion object transform... and Tech gets to deal with that... :-)the end user has more knowledge of their individual system and its available memory needs with the software that's running that creating just a larger memory allocation for cache.Maybe there's one end user in a million that does, but such knowledge is not a realistic possibility for most, not even for experts. If anything, the more expert someone becomes in such matters the more they realize automated systems are a better way to approach optimization of activity where wheels-within-wheels intricacy and interactions between many, very complex systems (Windows, drivers, Manifold, etc.) change what's best from millisecond to millisecond. The right way for better use of memory is to improve the algorithms and code that assigns and uses memory automatically. In things like cache, trust the software. If you can't trust the software, the solution is not to provide manual settings so you can try to fix what you don't trust. The solution is to fix the automated function of the software so you can trust it.
 tjhb8,760 post(s) #03-Apr-19 06:34 Re your last para Dimitri:That's complete and utter bollocks, and you yourself don't believe it.Your suggestion is exactly the same as "you don't need your own PC, use a dumb terminal / thin client". 100% soviet.Your thinking is plain wrong here, and I know for a fact that you disagree with it.
 StanNWT140 post(s) #02-Apr-19 19:15 Has all this testing that Tim and I've been doing been illustrative and useful not just for us but the Manifold developers and gurus to think about how things are actually being used 'in the wild'?One thing I'm interested in knowing is have you experienced any performance penalties on workstations with dual socket configurations, since there have in some professional CAD/CAM animation and other high end programs or databases, been known to have some performance penalty when dealing with the traffic between sockets? it's the same rationale that Intel used against Intel about AMDs separate chiplets with their Infinity Fabric, however, Intel is now starting to build that way.Are most Manifold users and/or developers, gurus using single socket multi-core setups or multi-socket/multi-core setups?I'd love to see Manifold running on a dual socket EPYC Ryzen 2 architecture server, with 8 x Quadro 8000s and 2 TB RAM, but I haven't won the lottery.
 Dimitri5,452 post(s) #03-Apr-19 06:28 One thing I'm interested in knowing is have you experienced any performance penalties on workstations with dual socket configurations,Multiple sockets make no difference, since any theoretical performance differences that chip vendors may use to market against competitors are far smaller than real world bottlenecks such as data access, running an antivirus/security/indexing/backup service that slows stuff down, etc. Are most Manifold users and/or developers, gurus using single socket multi-core setups or multi-socket/multi-core setups?Same as with almost all software these days: the overwhelming majority of machines are running single-socket, multicore CPUs. The increasing popularity and decreasing cost of multicore CPUs means it is easy to buy a cost-effective rig with a single multicore CPU socket. But at the same time, chip vendors have not put the same focus on chipsets which would make it as easy for motherboard vendors to introduce cost-effective dual-socket motherboards, so those remain a niche market.There are also form factor issues that mean multiple socket motherboards remain a "server" market niche product, and those tend to be configured for server farm use that revolves around web serving and not analytics or general parallel software use like Manifold. Try to find a motherboard that provides, say, four sockets for inexpensive manycore CPUs, plus four full-speed slots for GPU cards, plus lots of on-board memory, plus massive connectivity to big SSDs and there are not so many choices.It's cool that Manifold will run out of the box on all that, if you configured it, but in the real world such fire- breathing exotica is very rare outside of military and other classified, black-budget users. The Manifold response to that is the background work being done on the big "servers" theme, where big tasks could be automatically distributed across a private cloud configuration on your local network. There is much to be said for distributing both the function and the data store to many many machines on your organization's local network, which are already paid for and for the most part, with most of their cores just sitting there doing nothing for endless milliseconds, even during the height of a work day. It's not the same tight coupling you get "inside the box," but for many tasks distributing the data store and processing over what now routinely are very, very fast networks can still get super effects. After all, when you run Hadoopy stuff in clouds, you're running on a mass of very underpowered PCs that are connected via networks. The cloud isn't a single machine with thousands of sockets.
 tjhb8,760 post(s) #03-Apr-19 01:03 It's hard to know where to put this test, but I'll put it here in reply to Adam'sI'd still do the tests I recommended above without GPU and without threads if you have the stamina / desire, these might be very informative.I'm currently running a duplicate test, on the same machine as I used above, but with GPGPU off. Still using 6 threads, I'm not sure I have the patience to test on a single thread.So this is interim:Previously, with 896 64-bit cores active, I had this sort of result from Manifold's dialog (I have screenshots which I didn't post before but can):1438 records/s - 1335 records/s [fairly constant]117.2 KB/s - 25.6 MB/s - 46.7 MB/s [generally increasing]Now, with no GPGPU usage, I get77 records/s [rock steady, no change at all for the first 1h 20mn]8.4 KB/s - 4.5 KB/s - 775 KB/s - 819 KB/s - 1.1 MB/s [generally increasing]At the current rate, it will take about (2306560 records (tiles) / 77 records/s) ≈ 29955s ≈ 8h 20mn at the current constant processing rate.This is on exactly the same system as used above for a time of 42mn 47s, except only that GPGPU is disabled withPRAGMA ('gpgpu' = 'none');The comparison is rough because we're only a short way in, and it doesn't take account of the slower pyramids phase at the end, but bearing that in mind, it shows roughly a 12x speedup using GPGPU on this system. (Putting it another way, the previous test would have finished twice over by now, and I still have about 6 hours to go.)Stan might have a lower overall ratio given his much faster CPUs.So yes this gives a useful perspective!
 tjhb8,760 post(s) #03-Apr-19 01:28 Still 77 records/s after 1h 43mn. Now 1.4 MB/s.CPU has been solid 78-80% throughout, in line with assignment of 6 threads. (6 out of 8 virtual cores fully saturated by Manifold, plus a little bit for Windows.) SSD2 (containing both .map file and TEMP) mostly 0-3%, occasional burst to 18%.RAm usage 6.4 of 32 GB, static.GPGPU at constant 0% of course.
 tjhb8,760 post(s) #03-Apr-19 04:18 Abandoning now after > 4 hours. I think anything useful has already been shown.
 tjhb8,760 post(s) #09-Apr-19 01:05 I have finally got around to making this same test on a third machine:Intel i7-2600, 16 GB RAM, NVIDIA GeForce GTX 1060 6GB, SSD1 120 GB, SSD2 480 GB, HDD1 1 TBThis machine is much slower and less capable in every way than Stan's beast. The CPU is old (and single), there is only 16 GB system RAM, and the graphics card is much less capable than either of Stan's (and he has two).The time for the same mean curvature test on exactly the same data was 3315.319s, 55mn 15s.Te recap, that is compared with 2h 27 mn on Stan's much more powerful machine, or 42mn 47s and 51mn 16s on my GTX TITAN.I had swapfile, TEMP and Manifold project file all on drive D.I had ~98% GPU saturation throughout the test (the same pattern as for the GTX TITAN).This shows clearly and definitively, that for Manifold GPGPU processing, either there is something seriously wrong or misconfigured with Stan's system, or else there is something misconfigured in Manifold 9 with hardware like Stan's. (Could the problem be hard drive encryption? Dual GPUs? Security software? Google Chrome--which I don't have? No idea.)It also shows that a well-configured system with a relatively small GPU (GTX 1060) is almost as powerful as a system with a relatively expensive mammoth GPU (Kepler TITAN). Transport matters much more than theoretical GPU power. The efficiency and power is in Manifiold's scheduler, all going well. Which is free.
 Dimitri5,452 post(s) #09-Apr-19 09:13 It also shows that a well-configured system with a relatively small GPU (GTX 1060) is almost as powerful as a system with a relatively expensive mammoth GPU (Kepler TITAN). To avoid an unintended misdirection, I would preface the above comment with..."In cases of relatively simple calculations on large volumes of data, where data access might be a greater factor than computation, ..."Mammoth GPUs are more powerful than lesser GPUs in complex calculations where the greater performance of high-end GPUs will show a difference, and where other bottlenecks, such as the need to move lots of data around, do not come to the fore. Such situations are unusual in GIS work, where a more typical situation is the case of relatively simple calculations done on lots of data, like the task in this thread. That's why the GPGPU advice explicitly discusses such matters and why it advises not to overspend on GPGPU while neglecting the other parts of the system (manycore CPU, memory, fast data store).---So... why is there an outlier in terms of performance? It's frustrating not to close the loop on this, because quite often the root cause of such things is a simple thing that has a big effect. Find it, and suddenly things go much faster, saving hours of work.Given the dominant role data access likely plays in this particular application, the fastest way to discover why there is an outlier is to focus first on the most likely cause: differences in data access performance. Examine all the details of hardware and software that might affect data access. If that doesn't turn up the answer, move on to other possibilities.My gut feel is that the answer likely would be found by following up all details that come to mind based on this post: http://www.georeference.org/forum/t147125r147464#147394 Key quotes from that post:I am in an organisation that doesn't allow me to turn off my anti-virus / security suite. Also there are other corporate tools running in the background. [...]I can't just disconnect from the corporate network. My GIS data storage for the moment is all USB 3.0 so I can't disconnect that and Acronis TrueImage needs to do the scheduled backups.OK. The above tells us that the usual suspects in terms of software that might reduce data access throughput are known to be in play. It also tells us the interface to GIS data is through USB 3.0, and that there are imaging packages running which might sync to the corporate network. Any one of those things can impact data access in a big way, which is why they are the usual suspects.Here's just one possibility that might not be expected: one of those syncing packages running in background might not actually do a sync, but it might reach out across the corporate network every now and then to check a time stamp on an archived cache to see if will need to do a sync when it is time to sync, and it does that in a way, due to corporate network latency or whatever other effect, which holds up the processes generating files or touching data. Turn off the "check sync cache status" and suddenly the big job runs three times faster.Or, it could be something even simpler like turning off some antivirus or "security" package. Sure, the organization's IT group might not like that, but it could be when a user shows them some use case where adjusting the default guidelines saves hours of work, well, they might agree that in this case it's OK to turn it off, or they might apply their skills to a new configuration that doesn't impact performance. Might help to get them involved.
 adamw8,579 post(s) #24-Apr-19 08:09 A belated reply to you and Stan.This thread has been very useful, yes. I mentioned some take-aways that we made earlier, there were several others. Thanks a lot for that.We don't yet know why Stan's bigger machine would perform worse than Tim's smaller machine, but we think we know enough to put useful telemetry - measure runtime statistics and report them after the transform - which will likely help. We test on many different configurations, including those with multiple cards, but with the immense range of configuration options available for the PC, there are always tons of nuances that we cannot realistically see with our own eyes - measuring runtime statistics will allow us to see them. Lower than expected performance might be related to many different things. Speaking loosely, Stan's machine might be too fast in places which we assume things to be slower, this could produce waits where we don't expect to have them, and our code might be handling those unexpected waits less efficiently than it could.We will try to add telemetry to a couple of transforms after the current cutting edge build.We will also try to increase the memory limits, etc, as discussed above.
 ColinD1,918 post(s) #24-Apr-19 12:20 could produce waits where we don't expect to have themI have a similar machine to Stan's, duel six core Xeon but a single M5000 Quadro card. I have suspected waits occurring on account of the amount of times I get Not Responding in both M8 and M9. Or not related? The process always completes. Aussie Nature Shots
 rk308 post(s) #24-Apr-19 13:24 I remember that while M8 was Not Responding because it was busy importing some big file, then sometimes other instances of M8 and M9 were also blocked. I have not used M8 lately.
 tjhb8,760 post(s) #24-Apr-19 13:41 My guess (to Riivo) is that one instance had copied content to the Clipboard. In that case it seems all M8 instances insist on synchronizing their pointers, including with the non-responding instance.
 adamw8,579 post(s) #24-Apr-19 15:12 "[The process is] not responding" happens when the wait is in the UI. We do 99% of what could possibly block for a long time in background threads, so "not responding" tends to happen when the UI is doing something benign, which is not supposed to take long, but that takes long because Windows is paging heavily. In my post above I was talking about different waits - those that happen in background threads which cooperate with each other to do big jobs. But we can and will try making cases of "not responding" rarer and shorter as well - by making better use of memory, for example.
 tjhb8,760 post(s) #24-Apr-19 14:01 I know this is a trivial comment, but apart from being useful, and increasing everyone's sanity, adding runtime statistics like this will be fun. (For those able to muddle work and play--I hope that is all of us.)
 adamw8,579 post(s) #20-Mar-19 09:04 MXB files throw away a lot of data that can be re-created, that's why 7-zipped (or compressed in any other, arbitrarily aggressive, way) MAP files will always have a hard time being smaller than MXB.
 Dimitri5,452 post(s) #19-Mar-19 05:26 there is 0% CPU usage and only between 6% - 20 % CPU usage. Please post a screenshot of the Task Manager Performance tab.