Subscribe to this thread
Home - General / All posts - Parsing address with regular expression and script
Mike Pelletier

2,122 post(s)
#20-Sep-14 15:26

Parsing addresses into individual pieces has come up before on the forum and I've run across someone who may have created a good solution using regular expressions here. The attached project file has this person's C# code, some sample data, and my lame attempt to make it run in C#. My scripting skills have weakened and I don't know C# at all. Might someone be willing to help me with the code. Thanks.

//C#

using Manifold.Interop.Scripts;

using M = Manifold.Interop;

class Script {

const string AddressTable = "AddressTable";

const string preDirections = "S W|SW|S E|SE|N W|NW|N E|NE|N|E|W|S";

// Suffixes taken from USPS website: http://pe.usps.gov/text/pub28/28apc_002.htm

const string suffixes = "ALLEY|ALLEE|ALY|ALLEY|ALLY|ALY|ANEX|ANEX|ANX // deleted other suffixes for forum formatting reasons; see map file

// Unit designators taken from USPS website: http://pe.usps.gov/text/pub28/28c2_003.htm

const string unitDesignators =

"APARTMENT|APT|BUILDING|BLDG|FLOOR|FL|SUITE|STE|UNIT|UNIT|ROOM|RM|DEPARTMENT|DEPT|SPC";

static void Main() {

M.Application app = Context.Application;

M.Document doc = (M.Document)app.ActiveDocument;

M.ComponentSet comps = (M.ComponentSet)doc.ComponentSet;

M.Table table = (M.Table)comps[AddressTable];

foreach (M.Record rec in Table.RecordSet)

{

var pattern = string.Format(

@"^((?<StreetNumber>[0-9]*)(?: ))*((?<PreDirection>({0}))(?! ({1}) ($|({2})))(?: ))?(?<StreetName>(.* (({1}) )?(?=({1}))|(PO|P O) (BX|BOX) [0-9]*))(?: )?((?<StreetSuffix>{1})($|(?: )))?((?<PostDirection>{0})(?: ))?((?<UnitDesignator>{2}))?((?: )(?<SecondaryNumber>[0-9]*))?",

preDirections, suffixes, unitDesignators);

var match = new Regex(pattern).Match(line);

if (match.Success)

{

var streetNumber = match.Groups["StreetNumber"];

var preDirection = match.Groups["PreDirection"];

var streetName = match.Groups["StreetName"];

var streetSuffix = match.Groups["StreetSuffix"];

var postDirection = match.Groups["PostDirection"];

var unitDesignator = match.Groups["UnitDesignator"];

var secondaryNumber = match.Groups["SecondaryNumber"];

}

else

}

}

}

Attachments:
Address parser.map

jkelly


1,234 post(s)
#22-Sep-14 02:04

Hi Mike

See attached map for the "working" copy. I say working in quotes though, because when run, the regular expression is not matching any of the addresses.

I'm not good with Regular expressions though, and try to avoid them because they are so hard to test and hide so much complexity, so someone else more adept may be able to pick apart issues with the pattern v data. Regex's are a nightmare to debug, especially one that long and complex. Great when they work though. My guess is that the web page you got this from has made a pasting error, as you would have pasted straight from there and I can't see anything different.

If anyone's keen and not familiar with the c# see below,

var pattern = string.Format(

@"^((?<StreetNumber>[0-9]*)(?: ))*((?<PreDirection>({0}))(?! ({1}) ($|({2})))(?: ))?(?<StreetName>(.* (({1}) )?(?=({1}))|(PO|P O) (BX|BOX) [0-9]*))(?: )?((?<StreetSuffix>{1})($|(?: )))?((?<PostDirection>{0})(?: ))?((?<UnitDesignator>{2}))?((?: )(?<SecondaryNumber>[0-9]*))?",

preDirections, suffixes, unitDesignators);

This is a basic string concatenation. {0} is stuffed with the preDirections string, {1} is stuffed with the suffixes string, etc. I'd print the full pattern, but its waaaaaaay to long to bother putting here. If you really want to see it, put a log.Log statement and print the pattern string out.

Hopefully someone's got some better ideas than me as to why it doesn't match?

Cheers

Attachments:
Address parser.map


James Kelly

http://www.locationsolve.com

tjhb
10,094 post(s)
#22-Sep-14 02:39

James there is a stray "]" in line 46 of [Address parser].

Plus the pattern assignment should be outside the loop--it only needs to happen once.

There are some strange repetitions in the suffixes string. E.g.

...BEND|BEND|BND|BND|BLUFF|BLF|BLF|BLUF|BLUFF|BLUFFS|BLUFFS...

...CORNER|COR|COR|CORNER|CORNERS|CORNERS|CORS|CORS|COURSE|COURSE|CRSE|CRSE|COURT|COURT|CT|CT|COURTS|COURTS|CTS|CTS|COVE|COVE|CV|CV|COVES|COVES...

...CURVE|CURVE|CURV|DALE|DALE|DL|DL|DAM|DAM|DM|DM|DIVIDE|DIV|DV|DIVIDE|DV|DVD|DRIVE|DR|DR|DRIV|DRIVE|DRV|DRIVES|DRIVES...

(None of which addresses the real issue.)

tjhb
10,094 post(s)
#22-Sep-14 03:01

I'm not at at home in C# as you are--all the casting makes me sulk--can read OK but not speak it like a native.

I completely agree with you about the joys and pains of Regular Expressions. In my opinion a RegEx search string is only as useful as the verbose plain-English translation, group by group, operator by operator, that accompanies it in comments.

Not just for transfer between one person and another, but also for transfer between the original coder and the same person a day later (even before and after breakfast).

[Added:] But I see from Mike's link that Chris Schiffhauer does exactly that, only moreso. So there is every hope of unpacking his expression and adapting it to play nicely with Manifold.

jkelly


1,234 post(s)
#22-Sep-14 03:15

Ok, this bothered me, so I kept playing.

The issues is that the suffixes are all capitalised. The Address table had "St" instead of "ST", so it was never going to match! This is also the case with pre directions and unit designations, but I don't have time to do those as well.

Now someone with more skills in reg ex would have changed the pattern, but I don't, so I brute forced it by adding all the variations to the pattern. Not ideal, but it works.

I added in lower case and TitleCased for the suffixes, plus I've removed the duplicates.

All tested, see the below map for the passing solution.

Be careful running this on a large dataset, it's slow as molassis!

Edit: Thanks Tim for picking up my typo and stupid pattern matching error, fixed in the attached map below.

Attachments:
Address parser 3.map
Address parser2.map


James Kelly

http://www.locationsolve.com

jkelly


1,234 post(s)
#22-Sep-14 03:30

Messed that up. Ran out of time to update my mistakes.

Here is the correct one. Set the regex to ignore case, removed all my ugly hacky code, and removed / fixed the other errors that Tim highlighted.

Have fun!

Attachments:
Address parser 3.map


James Kelly

http://www.locationsolve.com

mdsumner


4,260 post(s)
#22-Sep-14 11:35

What joy, but parsing addresses is so 2002! Doesn't everybody just use the google bot now?

1257 Main St Unit 1

(Now you have to parse the XML instead).


https://github.com/mdsumner

Mike Pelletier

2,122 post(s)
#22-Sep-14 17:52

Thanks Mike for more joy on using Google to get lat/long for your addresses. This would be quite useful. I found this link where someone has made this fairly automatic. I'll report back after I get a chance to try it.

Mike Pelletier

2,122 post(s)
#24-Sep-14 23:55

So the link I referred to above is directions for creating an add-in within Excel using a VB script from the author. It indeed does provide Google's lat/long for your addresses, although you have to go slowly to avoid exceeding Google usage restrictions. That's fair enough. Pretty handy tool for sure.

Mike Pelletier

2,122 post(s)
#22-Sep-14 15:20

Awesome James and Tim. Thanks for making this work. It might be a slow script but it's way faster than me manually parsing (that's actually slower than molasses) :-) It does a decent job too, leaving me with a much more friendly amount of cleanup work. It's also customizable to the extent I want to wrestle with regular expression. Cheers!

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.