Parsing Free-Text Addresses and a UK Postcode Regular Expression Pattern

We’ll be attempting to replicate the functionality of Google Maps using nothing but freely-available tools and data – SQL Server Express, OS Open Data, and a dash of Silverlight.

One of the features I’ll be demonstrating is a basic geocoding function – i.e. given an address, placename, or landmark, how do you look up and return the coordinates representing the location so that the map can centre on that place? This is not really a spatial question at all – it’s a question of parsing a free-text user input and using that as the basis of a text search of the database.

The simplest way of doing this is to force your users to enter Street Number, Street Name, Town, and Postcode in separate input elements (and these match the fields in your database). In this case, your query becomes straightforward:

SELECT X, Y FROM AddressDatabase WHERE StreetNumber = ‘10’ AND StreetName = ‘Downing Street’ AND Town=’London’

Most databases don’t contain the location of every individual address. If there is no exact matching StreetNumber record, then you typically find the closest matching properties on the same road and interpolate between them (it seems reasonable to assume that Number 10 Downing Street will be somewhere between Number 9 and Number 11).

Forcing users to enter each element of the address separately doesn’t necessarily create the most attractive UI, however. What’s more common is to use a single free-text search box into which users can type whatever they’re searching for – a placename, address, landmark, postcode etc. Nice UI, but horrible to make sense of the input. In these cases, the user might supply:

“10 Downing Street, London”

“Downing Street, St James’, LONDON”

“10, Downing St. SW1A 2AA”

…not to mention “10 Downig Street. London”, and any other many of misspellings or alternative formats.

One approach you might want to take in these cases is to use a RegEx pattern matcher to determine if any part of the string supplied is a postcode. The UK postcode format is defined by British Standard BS7666, and can be described using the following regular expression pattern:

(GIR 0AA|[A-PR-UWYZ]([0-9][0-9A-HJKPS-UW]?|[A-HK-Y][0-9][0-9ABEHMNPRV-Y]?) [0-9][ABD-HJLNP-UW-Z]{2})

Matching the supplied address string against this RegEx doesn’t prove that a valid postcode was supplied, but just that some part of the user input matched the format for a postcode. The matching substring can then be looked up (say, against the CodePoint Open dataset) to confirm that it is real.

Once you’ve identified the postcode, you can then run a query to retrieve a list of roadnames that lie in that postcode, from something like the OSLocator dataset, and scan the remainder of the input to see if it contains any of those names. You can also scan for any numeric characters in the first part of the text input, which might represent a house number. If you find a matching property, with the same road name and valid postcode, you can be pretty sure you’ve found a match.

If you find more than valid match, or possibly several partial matches only, then you can of course present a disambiguation dialogue box – “Is this the 10 Downing Street you meant?”. For example, there are many “10 Downing Street”s in the UK – from Liverpool to Llanelli and Farnham to Fishwick…. without knowing either the town or the postcode, it could have referred to any of the following:

image