Spatial Coordinate Calculation in SQL Server

Let’s discuss a little about the precision with which SQL Server stores coordinates in the geometry and geography datatype. To summarise:

  • Well-Known Text, the format from which spatial data is most commonly created in SQL Server, is a text string in which coordinates are stated as decimal values.
  • However, when you pass a WKT representation to a static method, such as STGeomFromText(), the resulting geometry or geography instance serializes each coordinate into an 8-byte binary value. In essence, each coordinate is CAST from nvarchar to binary(8).
  • When you retrieve the WKT of a geometry or geography instance, using STAsText() or ToString(), the stored binary coordinate value is implicitly CAST back to a decimal text string again.

Just as when converting any other data between datatypes, there is therefore the potential for truncation or rounding errors introduced when you store and retrieve spatial data in SQL Server. In practice, the significance of these errors when you’re just storing and retrieving values is pretty small – an 8-byte binary value is capable of representing most decimal values with a precision equivalent to around 15 decimal places. If your coordinates are stated with less precision than this, you can fairly confident that SQL Server will store and retrieve them with full precision.

In this post I’m going to investigate a related, but perhaps more important topic, which is the precision with which SQL Server performs calculations on geometry and geography data. This is something you are likely to notice in any spatial application, however precise your original supplied coordinates are, and explains why sometimes you may get “incorrect” answers – LineStrings apparently self-intersecting when they shouldn’t, Points that lie outside of Polygons when they don’t, etc.

Before I go any further, I should perhaps issue a disclaimer:

  • Spatial calculations involve some pretty complex maths (which, in SQL Server, are provided courtesy of Michael Kallay). I am not a mathematician (although my mum is).
  • AFAIK, there is no public documentation on exactly how SQL Server performs spatial calculations. The content of this post is therefore based solely on scraps of information, background reading, and my own empirical research.

It is quite likely that sections of this post (if not this post in its entirety) are either slightly or wholly factually incorrect. Basically, I wouldn’t rely on it. However, I also know that “Spatial” Ed Katibah has frequented this blog in the past and, if I’m lucky, he might read it and add some comments to correct me where I’ve got it wrong…

Demonstrating The Problem

Surely, the whole point of using SQL Server’s methods rather than creating your own spatial library is that you shouldn’t have to worry about the maths behind how STIntersects() works – you just use it and you get the right result, right?

Well, not always… let’s look at example. Consider the following two lines:

[php]DECLARE @line1 geometry = ‘LINESTRING(0 11, 430 310)’;
DECLARE @line2 geometry = ‘LINESTRING(0 500, 650 0)’;[/php]

You can view them in SSMS Spatial Results tab as follows:

[php]SELECT @line1
UNION ALL SELECT @line2;[/php]

image

The two lines clearly cross each other, and SQL Server will tell us the point at which they cross using the STIntersection() method:

[php]SELECT
@line1.STIntersection(@line2).ToString()[/php]

The result is POINT (333.88420666910952 243.16599486991572)

So far, nothing unusual, but what about if we run this query instead:

SELECT
@line1.STIntersection(@line2).STIntersects(@line1),   --0
@line1.STIntersection(@line2).STIntersects(@line2)   --0

The STIntersects() method returns 0 in both cases. When expressed in words, the result of this query suggests “The point at which line1 intersects line2 doesn’t intersect line1, or line2”. Huh?

Let’s look at another example, this time a Point-in-Polygon query:

[php]DECLARE @square geometry = geometry::STPolyFromText(‘POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))’, 0);
DECLARE @point geometry = geometry::STPointFromText(‘POINT(5.02 4.98)’, 0);
SELECT
@point.STIntersects(@square),
@point.STIntersection(@square).ToString();[/php]

Simple square polygon, ten units high by ten units wide, with a point roughly in the middle of it. Unsurprisingly, the STIntersects() method correctly reports that the two geometries intersect, but the point at which they intersect, given by STIntersection(), is stated as:

POINT (5.0199999809265137 4.9800000190734863)

We’d expect the point at which a Point intersects a Polygon to be exactly equal to the Point itself, yet there is a fractional difference in the coordinate values returned by STIntersection() than the original coordinates, 5.02 4.98, supplied. What’s interesting (as well as perhaps giving away a clue as to what is happening here) is that, as you increase the size of the square without changing the location of the point, the calculated point of intersection returned by STIntersection() gets less and less precise.

The following shows the point of intersection between the Point (5.02 4.98) and increasing-sized Polygons as returned by the STIntersection() method:

[php]POLYGON((0 0, 1000 0, 1000 1000, 0 1000, 0 0)) –> POINT (5.0200042724609375 4.9799957275390625)

POLYGON((0 0, 100000 0, 100000 100000, 0 100000, 0 0)) –> POINT (5.01953125 4.98046875)

POLYGON((0 0, 1E6 0, 1E6 1E6, 0 1E6, 0 0)) –> POINT (5.015625 4.984375)

POLYGON((0 0, 1E7 0, 1E7 1E7, 0 1E7, 0 0)) –> POINT (5 5)

POLYGON((0 0, 1E8 0, 1E8 1E8, 0 1E8, 0 0)) –> POINT (6 4)[/php]

The Explanation

So what’s going on here? It’s not the rounding issues caused by conversion from binary <-> decimal WKT explained earlier – that only occurs when new geometry data is instantiated from, or retrieved as, WKT. However, it is another type of conversion that happens internally…

Although SQL Server stores coordinates as floating point binary values, it performs spatial calculations using integer arithmetic. Floating point coordinates are, by the very nature of a floating point system, approximate, and floating point arithmetic is not “robust” – calculating the difference between two similar floating point values, for example, can create results containing errors with large orders of magnitude. Integer arithmetic, in contrast, is always exact and reliable, whatever the integer values supplied.

Therefore, in order to perform calculations on coordinates in a robust fashion, SQL Server first snaps those coordinates to an integer grid. Note that coordinate values are not simply converted to integer (which would lose any fractional precision), but they are snapped to the closest integer value on a dynamically-scaled integer grid. An analogy is to think back to mathematics lessons at school, in the days before MS Excel, when you had to draw graphs by hand on a piece of graph paper. The skill in drawing such graphs was to choose an appropriate scale – one in which each value in the underlying data could ideally be placed on a grid cell boundary, but subject to the fact that you had to get the whole range of data to fit on a fixed size piece of paper.

This is what SQL Server does with spatial calculations – it’s got an integer grid of fixed size (equivalent to the fixed size sheet of graph paper). In SQL Server 2008, this is a 27-bit grid. It then dynamically scales the coordinates involved in any particular operation to fit on this grid (the equivalent of choosing the scale for your graph), and each coordinate is snapped to the closest cell on the grid. Some results will lie exactly on a grid cell boundary, in which case the results returned by SQL Server will be exact but, in all other cases, the result will be subject to some error. The result of the integer calculation is then converted back into floating-point coordinates again, in the same scale as originally supplied.

SQL Server Integer Grid Snapping

The amount of error is determined by the overall scale (i.e. the size) of the geometries on which you are operating. The greater the range of coordinate values that must be covered by the grid, then the more coarse the grid resolution becomes, with each cell in the integer grid representing a larger value. This explains the effect demonstrated previously, whereby when the Polygon was of size POLYGON((0 0, 1E7 0, 1E7 1E7, 0 1E7, 0 0)) the coordinate values were snapped to the closest 0.25 unit. With a Polygon of size POLYGON((0 0, 1E8 0, 1E8 1E8, 0 1E8, 0 0)), coordinates were instead snapped to the closest 2 units.

Using a 27bit integer grid as in SQL Server 2008, on a line between the equator and the North Pole (10,000km), points will be snapped by no more than 10cm, which is pretty good accuracy for most applications.1 However, it means you should be very careful with interpreting results, especially if you are trying to test whether two instances are exactly equal (as in the STEquals() method). SQL Server Denali uses a 48bit grid, which effectively increases the grid resolution, meaning points will be snapped by a smaller amount, but still uses the same underlying method.

The Warm, Fuzzy Factor

Consider the following example, which creates a large triangular Polygon and a Point close to, but just underneath that triangle:

[php]DECLARE @triangle geometry = geometry::Parse(‘POLYGON((0 2.2, 40000000 20000002.2, 0 20000002.2, 0 2.2))’);
DECLARE @point geometry = geometry::Parse(‘POINT(5 4.6)’);[/php]

The formula for the line of the hypoteneuse of the triangle is y = 2.2 + (x/2). The POINT(5 4.6) lies below this line so does not intersect the triangle.

image

However, if you ask SQL Server, you’ll get a different result. SQL Server reports that the Point does intersect the triangle:

[php]SELECT
@point.STIntersects(@triangle) AS IsThePointIntheTriangle — 1[/php]

Now, you may think that the reason for this incorrect result is because the effect of snapping to the integer grid has caused the point to snap “upwards” into the triangle. However, this is not actually the case – the scale of the Polygon in this example means that the integer grid snaps coordinates to 0.5 unit. The y coordinate of the Point is actually snapped downwards from 4.6 to 4.5 – moving it further away from the triangle.

[php]SELECT
@point.STIntersection(@triangle).ToString() AS SnappedPoint; — POINT (5 4.5)[/php]

The reason for the incorrect result in this case is because, in order to account for possible errors caused by the integer-snapping process, SQL Server adds a “fuzziness”-factor to certain operations. In a nutshell, when working only with the snapped Point value of POINT (5 4.5), SQL Server can’t be certain whether the original supplied Point intersects the triangle or not.

In this example, the supplied Point was POINT(5 4.6), which lies outside the triangle. However, the original point could have been POINT(5 4.72), which lies inside the triangle but, when snapped to an integer grid to the closest 0.5 unit, would also have been snapped to the same point POINT(5 4.5). Seeing as SQL Server can’t be certain whether the point lies inside or outside, it seems to apply a “fuzzy” tolerance equal to half the snapping-distance of the integer grid. If the snapped point lies within the tolerance of the other geometry, SQL Server plays on the safe side and assumes that it intersects it.

This fuzziness can produce false positive results, with geometries deemed to intersect each other (or, indeed, themselves) when in fact they don’t. This can be one reason why large, very detailed polygons can be rejected by SQL Server – their size forces a coarse integer grid to be used, but the level of detail then creates self-intersection to occur as points coincide when snapped to the grid.

Convert, Reproject, and Load Spatial Data to SQL Server with OGR2OGR

I used OGR2OGR to join a set of shapefiles together prior to loading them into SQL Server using Shape2SQL. But OGR2OGR can do much more than simply appending shapefiles – it can convert data into different formats, transform coordinates into different spatial references systems, read and write a whole variety of spatial datasources, and (with a bit of fiddling) load them into SQL Server 2008.

To demonstrate, I thought I’d repeat the objective of my previous post, but instead of simply appending the Ordnance Survey open data shapefiles together, I would reproject them into a different SRID, and import them into SQL Server too using OGR2OGR alone, without using Shape2SQL or any other tools.

Use OGR2OGR to create a CSV file containing WKT

OGR2OGR can’t insert spatial data directly into SQL Server 2008, but it can create CSV files that can be read into SQL Server. To create an output file in CSV format, set the –f CSV output option. You can also manually set layer creation options to dictate the field delimiter and line return character of the CSV file, using the LINEFORMAT and SEPARATOR layer creation options.

By default, the CSV file created only contains columns containing the attributes associated with each shape (i.e the data in the .dbf file associated with a shapefile), but it doesn’t include the shape itself. To include the shape of the geometry in Well-Known Text format, you also need to state the –lco “GEOMETRY=AS_WKT” layer creation option.

But that’s not all – OGR2OGR can also reproject data between coordinate systems as it converts it. To convert to another spatial reference system, include the -t_srs parameter, together with either the full WKT definition (hard to escape properly,i.e. GEOGCS(ELLIPSOID(“WGS 84”, ……) or the EPSG code (much easier, i.e. EPSG:4326 for WGS84).

Putting this all together, you can use the following command to load a shapefile (in this case, the Ordnance Survey OS Vector Map settlement_area polygon shapefile), reproject the data into SRID 4326, and then save the result as a CSV file containing a column with the WKT of each resulting polygon:
[php]ogr2ogr -f "CSV" "Settlement_Area" "Settlement_Area.shp" -t_srs "EPSG:4326" -lco "GEOMETRY=AS_WKT" -lco "LINEFORMAT=CRLF" -lco "SEPARATOR=SEMICOLON" [/php]
NOTE – I’ve noticed some problems with the quote/speechmark characters in this blog being changed to “smartquotes”, even when I use the code-formatting option. OGR2OGR isn’t happy about this, so I recommend that you don’t try and copy and paste the line above into a CMD window – if you get an error about “Couldn’t fetch request layer uT” or something similar then be sure to retype using normal speech marks.

More information on CSV options for OGR2OGR can be found here: http://www.gdal.org/ogr/drv_csv.html

Create a Format File

To import the CSV into SQL Server, we have a range of options – you could use SSIS, or you could use the Tasks –> Import Data wizard. Or you could use BCP or BULK INSERT. However, I’m going to use OPENROWSET. This will allow me to write a single query to access the CSV file directly from T-SQL as if it were an existing table.

In order to do this, we first need to define a format file which specifies the datatypes of each of the columns in the input csv file, and how they should be mapped to datatypes of columns in a sql table. More information on format files can be found here.

The following shows the XML format file required for the OS VectorMaps Settlement Area shapefile just converted:

[php]<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="0" xsi:type="CharTerm" TERMINATOR="""/>
<FIELD ID="1" xsi:type="CharTerm" MAX_LENGTH="2147483647" TERMINATOR="";"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR=";"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="WKT" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="FEATCODE" xsi:type="SQLDECIMAL"/>
<COLUMN SOURCE="3" NAME="FEATDESC" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
[/php]
Query the CSV from SQL Server using the Format File

Finally, you can write a SELECT query in SQL Server to query the CSV file using  OPENROWSET in conjunction with the format file above. Since the CSV contains column headers, I’ve included the FIRSTROW=2 parameter to skip the header row. I’ve also used the geometry::Parse() method in the SELECT to dynamically create the geometry data from the Well-Known Text representation contained in the WKT column of the supplied CSV file.
[php]SELECT
FEATCODE,
FEATDESC,
geography::Parse(WKT)
FROM OPENROWSET(
BULK ‘C:Settlement_Area.csv’,
FORMATFILE=’C:Settlement_Area_format_file.xml’,
FIRSTROW = 2
) AS CSV;[/php]
And here’s the results showing my Ordnance Survey data, originally provided as a shapefile using projected coordinates in SRID 27700, now loaded as latitude/longitude coordinates (EPSG:4326) using the geography datatype:

image

Ordnance Survey Open Data into SQL Server 2008

One of the things that I didn’t realise until I started blogging was how interesting I would find the data collected on who actually reads my blog, where they’ve come from, and what they were looking for. As I expected based on the content I’ve published so far, many of my visitors have been looking for information on the Bing Maps tile system, on WMS servers, SQL Server spatial reference systems, and on terrain and elevation information. I hope that some of them have found some helpful information… However, last month, for example, I was also fascinated to find a number of people had found my site having searched for “Alastair Aitchison professional CV” – who are you? what were you looking for?!

Anyway, reading through my search engine referrals, another big area that people seem to be looking for information on is about using Ordnance Survey data in SQL Server. I have posted about the Ordnance Survey before, and about spatial data in SQL Server, but not specifically about the two together, so those people have probably been left disappointed. This post is an attempt to rectify the situation…

The Ordnance Survey, if you don’t know, is the national mapping agency of Great Britain, and it is the source of most of the high-quality mapping information in the country. For a long time, however, their data was tightly-controlled, and not affordable to use unless you were a large organisation capable of paying a corporate licence. This was hugely frustrating for small or charitable organisations trying to make use of spatial information.

Now, ex-Prime Minister Gordon Brown didn’t do much for the UK, but it is largely thanks to him that in the last year the situation has changed. The story goes that, at some formal dinner or other, Gordon Brown ended up having a conversation with Sir Tim Berners-Lee. The prime minister asked what the government should do to make better use of the internet, to which Berners-Lee replied “Put all the government’s data online”. It’s not happened overnight, but there have been some great steps taken towards increasing public access to UK government data, including crime information, census and demographic information, and spatial information from the Ordnance Survey. So here’s a step-by-step guide as to how to load that data.

Get Some OS Data

First, acquire the data from the Ordnance Survey Open Data website. There are several “products” to choose from. The ones I find most interesting are:

    1. StrategiMeridian2OS VectorMap District – vector polygon, polyline, and point features of e.g. administrative boundaries, developed land use areas, woodland, roads, rivers and coastline, at small- medium- and high- scale respectively.
    2. Code-Point Open – locates every postcode unit, and maps to corresponding coordinates / health authority / county etc.
    3. 1:50,000 Scale Gazetteer – as the name says, a gazetteer of placenames and locations

Note that the Ordnance Survey have not made all their data freely-available – if you want the really high-quality datasets (such as OS Mastermap), you’ll still have to pay for it, but OpenData gives you a good start. Even though the OpenData products are of limited resolution (equivalent to, say, the 1:50,000 Landranger series of maps), they are of good quality – better quality data than, say the equivalent  TIGER data in the U.S – and they are much better than having no free data at all, which was the case 12 months ago…

Prepare the Data

Most OS datasets are provided in ESRI shapefile format. If you’re using one of the smaller scale datasets (e.g. Strategi or Meridian2), each feature layer (i.e. roads, rivers, railways) will come in its own shapefile. You’ll probably want to keep each layer in its table in the database, so this is fine.

However, if you want to load the larger scale OS VectorMap data you’ll find that it is split up into smaller files, each one labelled with the appropriate 100km x 100km grid square reference (e.g. TQ, TL, NT…). Then, within the download of each of these grid squares,  there is a subdirectory for each 10km x 10km subsquare. So within the TQ directory there will be folders labelled TQ00, TQ01, TQ02, … , TQ99. Each of these subfolders follows exactly the same structure – containing one shapefile for each of the feature layers within that square. So, if you want to load the features of a complete 100km x 100km square of data in one go, you first need to merge the shapefiles in all of the subdirectories together. You can do so using OGR2OGR from the command line prompt as in the following script:

for /R %f in (*.shp) do (
if not exist %~nf.shp (
ogr2ogr -f “ESRI Shapefile” “%~nf.shp” “%f”
)
else (
ogr2ogr -f “ESRI Shapefile” -update -append “%~nf.shp” “%f”
)
)

This will look for all shapefiles in all subdirectories of the directory in which the script is run – if a file of the same name (e.g. AdministrativeBoundary.shp) has been found in another subdirectory already, they will be merged together, if not, a new file will be created. Resulting merged files will be created in the current directory. (This was adapted from a script by Darren Cope)

Load the Data

Having prepared the shapefiles, you can load them into SQL Server using a tool of your choice…. if you don’t want to splash out a commercial licence for Safe FME then I recommend Shape2SQL – you can use it to load shapefiles to SQL Server either via the GUI interface or via the commandline as described here.

Ordnance Survey data is defined using the National Grid of Great Britain, so always make sure that you load your data into a column of the geometry datatype, using SRID 27700.

And that’s all there is to it! Here’s the data loaded from combined table of the OS Vector Map settlement_area and road_line shapefiles, showing the buildings and roads of the city centre of Norwich (still beautiful even when viewed in the pastel shades of SQL Server Management Studio):