The question “How does Google create and maintain/add location records for their “Google Places” database?” was asked recently at Quora. I am reproducing my answer here so that readers who are new to the blog can get some background information on Google Places from a high level view:
Google obtains records for their business listings from
- the major list dealers like InfoUSA,
- feeds from trusted sources like CityGrid,
- scraping trusted structured websites like Superpages or BBB,
- scraping less trusted and less structured directories,
- user input via their MapMaker product and Community edit of unclaimed listings in Maps,
- across the web in general,
- and business claimed records via the Places Dashboard.
This data is essentially triangulated to create the Places search result.
Every time that they identify a unique PHONE/business/address/ combo they create what is known as a cluster into which all structured and unstructured known data about a business is placed.
The data that can be normalized is normalized and matched against the same field from all the sources. If there are discrepancies Google will resolve which is accurate by picking the data from the most trusted and most recent sources. Strong preference of trust is given to data from their own claiming process which requires direct post card verification. If a listing is unclaimed preference is given to verified lists like those of InfoUSA and then to trusted feeds and on down the chain of trust.
They do often end up with two clusters that are essentially identical or only differ in small details. They run a merge/purge that merges these two clusters into one. This system uses not just geographic signals but language similarities as well to decide if two listings should be merged. Errors and lack of granularity in this function can lead to merging of two unrelated businesses that are located physically close to one another and happen to be in the same line of work. If the system fails to merge two records, a business listing might lose rank as the cluster data is split between two records. At this point in time there is NO formal mechanism to unmerge two merged listings although there are some off-Google techniques that might accomplish it. This is known among the cognoscenti as a “Cluster-F**k”.
Here is an article that summarizes their clustering technology