Subscribe to this thread
Home - General / All posts - Import tab delimited file
artlembo


3,400 post(s)
#10-Apr-18 03:48

I want to automate the import of a tab delimited data file from EPANEt.

Here is the problem: while the file is tab delimited, the first few hundred rows might have 3 or 4 tabs. But, further down in the file are 7 or 8 tabs. Manifold, when reading in the file may only scan the first few rows to estimate the number of tabs for columns.

Any ideas how to let Manifold know a table has 7 or 8 columns, even if the first few hundred rows only have two columns?

tjhb
10,094 post(s)
#10-Apr-18 04:15

What have you already written to the stupid data provider?

I don’t think it would benefit anyone for Manifold to handle this.

KlausDE

6,410 post(s)
#10-Apr-18 06:39

Manually add a complete row of column names at the top.


Do you really want to ruin economy only to save the planet?

ColinD

2,081 post(s)
#10-Apr-18 08:12

I think what Art is saying is that all the column names are there but many of the fields have no data for many rows. The problem is how to automate so my usual fix of manually adding enough dummy data in the first several empty fields isn't going to help. As Tim implies, the data provider should fill the empties with a No Data value.


Aussie Nature Shots

artlembo


3,400 post(s)
#12-Apr-18 17:25

Tim: actually, it is EPANET. Whether it is stupid or not is anyone’s guess. It simply is what it is, and is used by whether it is stupid or not is anyone’s guess. It’s simply is what it is, and is used by just about everyone doing hydrologic analysis. I’m not actually asking manifold to handle that. They already allow you to scan X number of rows to determine the data type. I thought maybe there could have been an option to scan X number of rows to determine the number of columns.

Klaus: you hit the nail on the head. That is exactly how I have been doing it. It is a little bit of a pain to write a DOS shell command to insert a series of tabs at the beginning, but it does work. However, I wish I could avoid that step.

Colin: in this case, Klaus has the correct interpretation. Thanks for your advice.

tjhb
10,094 post(s)
#12-Apr-18 23:23

Art,

Even large, respected organisations can make really dumb errors. (They are only human too.)

When they do that, it is a mistake not to point the errors out, for all the usual reasons plus one. That is that if no one objects, the mistake can become widely accepted as a new (broken) normal.

You don't need to use the word "stupid", but I think you should use your respected position in the United States GIS community to point out the obvious error the EPA is making in this case. I am sure they would want to fix it.

Especially if many people have to use this data. What a cost!

artlembo


3,400 post(s)
#13-Apr-18 00:03

I agree on many fronts. But, EPANET and its .inp format has been around for 30 years. It is used by thousands of organizations to collectively make billions of dollars in decisions for hydraulic pipe analysis (water, sewer, gas).

I guess I’m thinking with that legacy, there is no fighting city hall. It works just fine for what it is, and they aren’t going to change it. I have a workaround, so I can adapt.

tjhb
10,094 post(s)
#13-Apr-18 07:34

All the same, I wonder if in all those 30 years, anyone with your street cred has bothered to tell the EPA that this data format is broken.

It’s never too late.

(What would John Parr Snyder do?—Let’s not leave that rhetorical. I think he would quietly point out the problem, to exactly the right guy.)

mdsumner


4,260 post(s)
#13-Apr-18 11:17

He'd probably use sed/awk, but now we have Perl, R, Python for these basic tasks. Automate the fix and then email the source at leisure


https://github.com/mdsumner

Dimitri


7,413 post(s)
#13-Apr-18 11:56

I think Colin hit the nail on the head with this:

I think what Art is saying is that all the column names are there but many of the fields have no data for many rows. The problem is how to automate so my usual fix of manually adding enough dummy data in the first several empty fields isn't going to help.

There is no automated solution for those problems if fields don't have all the data and are also missing separators that could give enough implied order.

Suppose you have a file that purports to represent a table of schools, with the name of the school, the first and last name of the principal and the name of the town, all separated by tab characters. The first row has the names of the fields:

"School"<tab>"Fname"<tab>"Lname"<tab>"Town"

That makes sense if all of the fields are in place:

"Wardley"<tab>"Henry"<tab>"Brown"<tab>"Middletown"

It even makes sense if separators are in place to imply NULL values:

"Wardley"<tab><tab>"Brown"<tab>"Middletown"

But what if a line consists of:

"Harrison"

... is that the name of the school, a first name, a last name or a town?

or a line that consists of...

"Winston"<tab>"Rush"<tab>"Reston"

Is that the name of the school followed by a last name followed by a town? Or is it some other combination like the principal Winston Rush of an unnamed school in the town of Reston?

artlembo


3,400 post(s)
#13-Apr-18 13:33

it's not actually broken. I'm broken :-)

The EPANET file works just fine for what it is. I, however, am trying to do something different. So, to accomplish that, I read the entire file into a table, and then do any kind of crazy SQL I can think of, whether it be on pipes, junctions, tanks, valves, etc.

The problem is, I can't bring the file into a table because it is tab delimited, and the pipes section of the file is further down - this is the one with many column. The first few hundred lines only have 1 tab.

I've been doing it the way Klaus mentioned - stuffing a bunch of tabs into the top of the file. But, just like you can scan a file to determine field type, I was hoping to scan the file to find the total number of tabs.

If Manifold allowed us to scan the table to determine the tabs, that would make it easier. However, it isn't too much bother for me to stuff the tabs into the top using sed/awk, or another method.

RonHendrickson
283 post(s)
#13-Apr-18 14:50

Art, it might not be simpler than what you are already doing, but the R language has several text handling functions and packages that might make easy work of your problem. I suggest looking into the "stringr" package by Hadley Wickham.

adamw


10,447 post(s)
#20-Apr-18 12:39

We are talking about files that look like this, correct? (I lifted the example from the format description doc. Ignore the initial #, it's there to turn off syntax highlighting.)

#

[TITLE]

EPANET TUTORIAL

 

[JUNCTIONS]

;ID Elev Demand

;------------------

2 0 0

3 710 650

4 700 150

5 695 200

6 700 150

 

[RESERVOIRS]

;ID Head

;---------

1 700

 

...

 

[OPTIONS]

Units GPM

Headloss H-W

Pattern 1

Quality Chlorine mg/L

Tolerance 0.01

 

[END]

If so, I think this really calls for a specialized import. This is good material for an add-in.

Not having enough tabs in the first line is not the biggest problem by far. We have multiple tables in the file, they have different fields of different types. We can certainly add, say, an option to scan the entire file instead of the first X lines to discover all fields, but I doubt this is the right approach here.

artlembo


3,400 post(s)
#20-Apr-18 15:45

That is exactly right. But, here is the slick thing: with SQL, I can easily create all of the files necessary. This is because I can search for [PIPES], and also the next iteration of “[“. That isolates the entire pipe section. It’s actually quite clever, if I do say so myself :-) The ease of this actually surprised me.

But again, my hack is to just insert multiple tabs at the top of the file.

Dimitri


7,413 post(s)
#20-Apr-18 16:41

It’s actually quite clever, if I do say so myself :-) The ease of this actually surprised me.

I'd love to see that example. :-) I've just tried to figure out how to do that and am having a brain fade...

steveFitz

340 post(s)
#12-Apr-18 23:25

Art,

Have you checked with Notepad++? It may have a plug-in that deals with this situation.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.