Subscribe to this thread
Home - General / All posts - Unexpected experiment in Big Data
dyalsjas
157 post(s)
#16-Jan-18 23:44

So today I downloaded the North America OSM data from the geofabrik.de Maps and Data page.

I expected a link to a page that would allow me to select which part of North America I wanted.

I got a 12.9Gb bz2 file.

It unzipped to a 189Gb .osm file.

WinRAR did not indicate any extraction errors.

So far, Edge 9.0.164.3 is importing about 10,000,000 features per minute.

We'll see what I end up with.

artlembo


3,400 post(s)
#17-Jan-18 00:45

I got it from here:

http://download.geofabrik.de/north-america.html

It is about 10 GB in size. My computer has around 100 GB of storage space free, but I eventually run out of disk space when loading it. I wish there was a way to just write it to my 4 TB drive, but it seems that the C drive gets filled up.

Feel free to contact me off-line, as I’m looking for people to discuss big data analytics and how manifold can fit in with that.

I’m currently taking a break from this OSM data but am now focusing on things like tableau and Microsoft power BI as front ends to the manifold engine

adamw


10,447 post(s)
#17-Jan-18 07:30

... but it seems that the C drive gets filled up.

Move the temp folder (move both TEMP and TMP environment variables, we only use TEMP if you don't have TEMP or if it points to an invalid location, but third-party modules like database drivers might do the reverse) away from C to where you have space.

We have been planning to allow setting the location of the temp folder explicitly, but this won't solve everything because of those third-party modules that we don't control.

dyalsjas
157 post(s)
#17-Jan-18 16:39

The initial import failed with many system warnings about low disk space. I've changed my system temp variables to point to an empty 1Tb SSD. I'm trying again with the.map file on the same drive as the temp folder. If that fails, I'll try the.pbf format Dimitri mentioned in a subsequent post.

I would appreciate the option to explicitly define a scratch space for Manifold.

If this import works correctly, I may try loading the data into an RDBMS instance, with a further intent to make the data visible to an ArcSDE connection.

lionel

995 post(s)
#25-Jan-18 06:10

see also Google Data Studio 360 ( online service )


Book about Science , cosmological model , Interstellar travels

Boyle surface fr ,en

rk
621 post(s)
#22-Feb-18 16:35

Can you connect from Tableau to Manifold 9 .map throughODBC? I defined an ODBC datasource for M9 .map and it works with M8, but not with Tableau Desktop (Pro, trial).

[Microsoft][ODBC Driver Manager] Driver's SQLAllocHandle on SQL_HANDLE_ENV failed

Generic ODBC requires additional configuration. The driver and DSN (data source name) must be installed and configured to match the connection.

Unable to connect using the DSN named "tableau_mfd". Check that the DSN exists and is a valid connection.

adamw


10,447 post(s)
#23-Feb-18 14:34

We'll take a look.

Just in case, could you try connecting from, say, Microsoft Office (or LibreOffice)? Connecting from Manifold 8 uses a specialized code path. If the ODBC data source for 9 works in Microsoft Office, but not in Tableau, that's one issue (application / driver logic), if it doesn't work from anywhere, that's likely a different issue (installation).

rk
621 post(s)
#23-Feb-18 20:22

In MS Access I see the list of tables in .map, but on import/link I get

Reserved error (-7748); there is no message for this error.

ODBC driver version 9.00.165.03

Added:

Access also creates table named [Name AutoCorrect Save Failures]

Object NameObject TypeFailure Reason

mfd_metaTableCould not open the object

mfd_rootTableCould not open the object

Dimitri


7,413 post(s)
#17-Jan-18 07:58

I got a 12.9Gb bz2 file.

It unzipped to a 189Gb .osm file.

It would be better to download the .pbf version that is the first link on the geofabrik.de web site, which is a mere 8.1 GB. The bz2 file unzips into such a huge .osm file because it unzips into XML, human readable format. Using human readable XML to interchange vast amounts of digital is extraordinarily inefficient in terms of space.

Sure, .pbf is ponderous, but it beats XML. Manifold reads .pbf so, unless there is some unexpected problem reading that .pbf, there is no need to use the XML they provide for programs that cannot read .pbf.

dyalsjas
157 post(s)
#17-Jan-18 16:50

I concur. Human readable is horribly inefficient for large data transfer. If my current import attempt fails, I'll try the.pb2 format you suggest.

BTW, who could possibly read 189Gb of xml?

dyalsjas
157 post(s)
#20-Jan-18 16:31

A few observations about "Big" OSM data from www.geofabrik.de

As Dmitri observed, .osm "is human readable xml".

This is a horribly inefficient format with respect to large data sets. A 12.9Gb .bz2 compressed file of North America .osm data expanded (after several hours in WinRAR) into a 189Gb .osm file.

Manifold 9 achieved some compression on import of the .osm file resulting in a 133Gb .map file.

The 8.9Gb .pbf file format was much more efficient, not needing to be extracted before import to Manifold.

Not unexpectedly, the resulting .map file was the same 133Gb file size.

Yes Manifold is remarkably fast when opening and displaying .map files, but "instant" still takes several seconds when opening a multi-gigabyte file.

Export to a .mxb file reduced the 133Gb .map file to 13.6Gb; a remarkable compression.

I still intend to attempt loading the resulting Manifold tables and drawings into an ArcSDE datastore to see if I can get the layers to display in an Esri enterprise geodatabase; but now I have to update my Esri installation. That will be several hours to accomplish and several days to make sure the update didn't break anything.

artlembo


3,400 post(s)
#23-Jan-18 15:11

How long did it take? I’m at 19 hours, and it is stalled at 97%.

I also notice task manager is only using 4GB of RAM (yes, I am using the 64bit version).

adamw


10,447 post(s)
#23-Jan-18 16:29

That's probably the saving phase. We will add progress tracking to that as well to give a better idea of where the process is.

4 GB used is how it should be (we use the rest of memory indirectly through file cache, it just does not show up for our process; also, using more memory wouldn't help the saving phase, although it could help the import that have been going on before).

artlembo


3,400 post(s)
#23-Jan-18 16:56

thanks, Adam. This isn't to the save portion yet. I'm still doing the import. Or, do you mean that it is saving the temporary file on the disk?

adamw


10,447 post(s)
#24-Jan-18 07:26

Yes, I meant flushing remaining changes and cleaning up. There is a special phase for that, which is normally short and so it does not do its own progress tracking. Although now that I checked, it happens at 100%, so since you were at 97% it could not have been that and was perhaps just normal copying. I figure the process is finished by now, did it spend a lot of time at 97%? If so, we'll take a closer look.

artlembo


3,400 post(s)
#24-Jan-18 14:37

It completed. I’m saving it now. Not sure how long the 97% went for.

Dimitri


7,413 post(s)
#23-Jan-18 18:04

I downloaded the whole North America psb file and set it to importing on a really old and slow machine. I have to admit I forgot about it until just now. Here is the log file:

2018-01-18 13:20:23 -- Manifold System 9.0.164.2 Beta

2018-01-18 13:20:23 -- Starting up

2018-01-18 13:20:24 -- Startup complete (0.705 sec)

2018-01-18 13:20:24 -- Create: Project1 (0.023 sec)

2018-01-20 07:02:30 -- Import: C:\data\OSM\north-america-latest.osm.pbf (150113.624 sec)

2018-01-20 21:49:06 -- Save: C:\data\OSM\north-america-latest.map (7491.823 sec)

So, that's not quite 42 hours for the import and about two hours for the save. The documentation isn't kidding when it says psb is a sloooow format. :-)

The .map project ends up being 139.9 GB, say, in round numbers, 140 GB. On the plus side, once it is saved as a .map you can open the project instantly.

Dimitri


7,413 post(s)
#23-Jan-18 18:47

Correction. Not psb but pbf file.

Dimitri


7,413 post(s)
#24-Jan-18 15:52

If anybody is curious, this is what 140 GB of vector data looks like....

zoomed in, with point sizes styled way down so the points don't cover up other data.

The amazing thing is that the drawing can be styled and used quite easily, once you zoom down to some reasonable area of interest.

Attachments:
big_osm_1.png
big_osm_2.png

artlembo


3,400 post(s)
#24-Jan-18 19:44

not just 140GB, but 92 million objects.

I've attached a video that shows how fast 9 works with the OSM data. The key is to be zoomed in before trying to draw it. This might be a good case for display scales.

Sorry, I didn't have my microphone.

Attachments:
osm.mp4

artlembo


3,400 post(s)
#24-Jan-18 19:57

one other thing beside display scales. I don't like the order of the layering. It seems like when you have a large area, like a village boundary, you aren't able to ID the streets (Alt-click). It always defaults back to the area feature, even if the street line is on top.

Dimitri, would you mind verifying this?

See this video looking at Salisbury University. Again, I apologize for not having a microphone.

Also, zooms and pans are pretty much instantaneous, as are selections, so long as you stay zoomed in fairly tight.

Attachments:
su.mp4

adamw


10,447 post(s)
#25-Jan-18 08:57

We are planning to extend the Record pane to be able to move between all objects that you clicked into.

dyalsjas
157 post(s)
#24-Jan-18 17:53

I should have kept my program log. Maybe saving the output of the log window as a comment layer in a project could be a setting in the program options (it might also assist when users send examples for technical support).

I have more recent hardware; AMD Ryzen 1700 series, 64 GB fast RAM, and SSD drives to read from and write to. I do not yet have an NVidia graphics card, so I can't say whether that would help the data import process.

My .pbf import was under 4 hours, the file save was less than 1 hour. To clarify a previous post, you are correct, opening the saved 139 GB .map file is instant; rendering the full data set in a map layer is perceptibly slower than other smaller files.

Concur, styling the layer using a reasonable area of interest is straight forward and the performance is excellent.

This level of performance is an example of why I'm consistently enchanted with Manifold; I don't believe I could accomplish this level of data import outside of a dedicated software.

I hope the ongoing user experience/interface development continues to match the remarkable Radian engine capability.

rk
621 post(s)
#24-Jan-18 18:00

I should have kept my program log.

Look at C:\Users\User\AppData\Local\Manifold\v9.0

See here.

http://www.manifold.net/doc/mfd9/log_window.htm

dyalsjas
157 post(s)
#25-Jan-18 00:14

Thank you for the web link.

Found the log files.

Import of the North America .pbf file was 3 hours 28 minutes (12482 sec).

File save of the 139 GB .map file was 26 minutes 15 seconds (1575 sec).

Export of the .map file to a 13.6 GB .mxb file was 1 hour 30 seconds (3629 sec)

Again, I wonder if NVidia CUDA cores would speed the process.

I'll be doing a system hardware refresh later this year...

Threadripper and NVidia. Will revisit my timings then.

Tried to change the log file locations based on the user manual, but haven't had success yet.

artlembo


3,400 post(s)
#25-Jan-18 00:59

3 hours to import – that’s crazy! Mine took almost 48 hours, similar to Dimitri.

My SSD isn’t large enough. I’ve got to buy another one.

lionel

995 post(s)
#25-Jan-18 03:36

3h compare to 48 h when import : it is x16 !!

dyaljas , artlambo ...what is the specification of your hardware ( graphic , motherboard , RAM, hard disk ) and perhaps OS ( should be by default microsoft window 10 and manifold 9 64 bit ) ?

Does path of map file and cache play a role ? ( short directory path ? )

regard's


Book about Science , cosmological model , Interstellar travels

Boyle surface fr ,en

dyalsjas
157 post(s)
#25-Jan-18 22:59

Lionel,

Like I posted above; I'm running an AMD Ryzen 1700 CPU (clocked to 4 GHz) on an ASUS MoBo with 64 GB RAM and an AMD RX 480 graphics card. I'm running Windows 10 and Manifold on a Toshiba RD400 NVMe SSD. My temp files, the North America .pbf file, and the output .map file were all stored on a SanDisk SSD.

Jason

Dimitri


7,413 post(s)
#26-Jan-18 07:52

Again, I wonder if NVidia CUDA cores would speed the process.

No. There's nothing computational for a GPU core to do in that import. It's almost all just a matter of moving data off disk with very, very simple updating of data structures in memory and in other places on disk.

The speed you are getting is the faster speed of SSD, faster memory, and overall better throughput between disk, memory and processor from Manifold's ability to use Ryzen cores. I suppose it helps that parallel access to SSD "disk" does not involve the physical limitations of heads seeking to cylinders on hard disk. Your Ryzen system is probably ten times faster than the old and slow machine I used... it was available and unused because it is so old it is just sitting around on the network used as a data archive.

Keep in mind that data on a hard disk is stored on circular tracks on a stack of magnetic platters. The stack of circular tracks at any given radius is called a "cylinder". The read-write heads physically move from one cylinder to another to grab data. If the heads are positioned on one cylinder, reading data for that circle as the disk spins beneath them, it doesn't help to issue a parallel call for data that is somewhere on other cylinders because the heads cannot be positioned on two cylinders at once, no more than the reading needle on a vinyl audio record can simultaneously be at two different positions on the record at the same time. You can pick up some gains from parallel reads given all the slop and interweaving of how blocks are stored in cylinders, but it is not the free-for-all you get with RAM memory.

In contrast, depending on the technology, SSD can be way faster at semi-random reads from all over the SSD storage. So with SSD there is potential for parallel reads to go faster.

artlembo


3,400 post(s)
#26-Jan-18 17:17

This is a little out of my wheelhouse, so I'll just ask: I can get a 0.5TB portable SSD for $179 (see here). It looks like internal SSDs are more expensive. How does an external SSD compare to an internal HDD? And, how does it compare to an internal SSD?

dyalsjas
157 post(s)
#26-Jan-18 18:15
dyalsjas
157 post(s)
#26-Jan-18 18:54

Art,

The relevant consideration for SSDs is the data bus / interface. Currently USB, SATA, or PCI. SCSI and Serial Attached SCSI (SAS) are more relevant for enterprise level hardware.

USB 1.0, 2.0, or SATA 1.0 (up to 3.0 gigabytes per second or GBps) would slow things to much to see a performance benefit, but would offer a data reliability benefit over spinning media.

USB 3.0 or SATA 2.0 would offer a good speed increase (usually 6 GBps).

SATA 3.0 or PCI NVMe would offer the highest performance (12GBps and up).

For the data import discussion of this topic; my system is configured with an NVMe PCI M.2 SSD for OS and software. The temp / scratch folder is on a 980Gb Toshiba SSD connected via a SATA 2.0/6 GBps data bus. The source .pbf file and the output .map file are on the same Toshiba drive as well.

In that SSD storage is getting cheaper and more dense, I expect my next hardware upgrade to support an NVMe PCI SSD that has sufficient storage to allow me to use it for data imports as well as software and OS.

In addition to the cost per core value of AMD processors, AMD ThreadRipper CPUs support more PCI lanes than comparable Intel CPUs allowing more PCI NVMe drives, GPU cards, etc.

lionel

995 post(s)
#28-Jan-18 13:38

It is not easy to know common/general speed for a PC desktop since many options can be mix : architecture , chipset and controller , memory . Here speed and architecture of my old Motherboad ( 2011) compare to newest connector specification ( red underline ) .

The device ( PCIe ,memory) connected to northBridge in old day are locate inside CPU today ( 2108) !!

During years the version of the specification for connector upgrade to better speed ( each time x2)

Speed value ( Gb/s) can change if focus on theory or real context and some bandwith are use to manage the protocol ( max 20% on old protocol less with new one ) .

Attachments:
asusH61MK_2Generation.png


Book about Science , cosmological model , Interstellar travels

Boyle surface fr ,en

lionel

995 post(s)
#28-Jan-18 13:58

NVMe ( Express) is base on PCI e ( Express ) !

NVM Express devices exist both in the form of standard-sized PCI Express expansion cardsand as 2.5-inch form-factor devices that provide a four-lane PCI Express interface through the U.2connector (formerly known as SFF-8639)

PCI Express storage devices can implement both AHCIlogical interface for backward compatibility, and NVM Expresslogical interface for much faster I/O.

PCIe 3.0 x1 don't exist and start from 4x !!

NVIDIANVLink ( NVIDIA Pascal™ GPU in 2016.) is a high-bandwidth, energy-efficient interconnect that enables ultra-fast communication between the CPU and GPU, and between GPUs.

1gB ( Byte) = 8gb ( bit)


Book about Science , cosmological model , Interstellar travels

Boyle surface fr ,en

lionel

995 post(s)
#28-Jan-18 14:17

nvlink is not a technology for storage !!!

2 connectors form factor and x10 bandwith !!


Book about Science , cosmological model , Interstellar travels

Boyle surface fr ,en

dyalsjas
157 post(s)
#26-Jan-18 18:14
dyalsjas
157 post(s)
#26-Jan-18 18:15
dyalsjas
157 post(s)
#26-Jan-18 18:53
dyalsjas
157 post(s)
#26-Jan-18 18:53

Dimitri,

Thanks for the feedback on GPU processing and IO operations like data import.

I didn't think GPU processing would help a data import much; but so much of what the Manifold team does seems magical; I would not have been surprised.

The Manifold user documentation highlights the value of good multi-core CPUs, abundant RAM, good GPU video cards, and current SSDs.

In all of that, the Manifold website does well selling the key fact that the software is asymetric to other GIS software in its multi-threaded / GPU threaded architecture.

It could also be valuable to give a higher emphasis the benefit of a fast SSD to IO / data import processes. A 10x speed increase for data imports using a current SSD is a use case many businesses can get behind.

p.s. Not sure what cause the empty posts. Sorry for the clutter.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.