utf | Mapsys.info Mapsys.info

Google Places: Google Removes Spam

It appears that Google has removed most but not all review spam from the Moishe’s Moving System’s Places page and from many of the other Places pages affected by this scam. On Moishe’s Places page, the spam that remains (besides their response spam) was posted between July 1 and July 3 and seems to still affect 35 or so other moving companies nationwide. Whether Google just removed the spam affecting the most companies or it is still a work in progress is not yet clear. Kudos to Google for moving on this problem.

Here are a few samples of the spam that still remains and is affecting moving companies country wide:

Another interesting sidelight is that Google is not alone in having been hit with this spam. According to Google’s index, Superpages has been seeing this stuff since February, 2010. It is also present in Rateitall.com, Judy’s Book, Yellowbot, InsiderPages, MyMovingReviews and Kudzu starting last fall and continuing into early this year. While this dreck is visible in all of these sites, it is much less pervasive than at Google. Whether it was already taken down elsewhere or the extortionists are just ramping up their game is not yet clear.

Fake reviews are a problem whether perpetrated by the businesses themselves or by others attempting to gain advantage at the expense of the business. The answer to the problem is not totally clear but a solution probably will need a number of components:

More FTC enforcement and education
Better filtering algorithms on the part of the search engines
Improved and more viable business complaint options, dispute resolution and removal mechanisms.

Google Places is not the only environment in which this abuse is taking place. But Google can and should provide a lead in developing an exemplary review environment that is fair to the public and fair to the businesses being reviewed.

PHP and UTF-8 BOM

I recently wrote some PHP for the first time in ages, and noticed some of my pages were appearing on one development machine, in some browsers, preceded by the characters ï»¿. These characters didn’t show up when editing the pages, and they didn’t show up at all when served from a different server or when viewed in some other browsers.

Initially, I thought that it was something to do with not having configured the correct character set in the response header (which is generally the main cause of garbled characters appearing in webpages), but, checking the response header it seemed ok – I was outputting UTF-8 as desired:

[php]header(‘Content-type: text/html; charset=UTF-8’) ;[/php]

And browsers viewing the page were correctly auto-detecting the character encoding as UTF-8:

Then I checked the configuration of the server, which was also set up with Unicode support correctly. And then I checked the encoding of the PHP scripts themselves, which were all encoded using Unicode UTF-8 – (Windows Codepage 65001). So far, everything seemed consistent, so where were those garbled characters coming from?

UTF-8 with or without signature – your choice. (Or not).

The reason, as I found out, was that one of my development environments (Visual Studio – from which I’d made the most recent edits to the affected pages) was configured to save UTF-8 encoded files with signature. Here’s the options for Unicode character encoding in Visual Studio, showing UTF-8 both with and without signature (notice that they’re both the same codepage – 65001):

There seems to be very little convention or standardisation as to the use of this “signature”. I hadn’t really come across this problem before because I generally use Eclipse for PHP development. The encoding options there are shown below:

Notice that, although there are several flavours of UTF-16 available in Eclipse, there is only version of UTF-8, which is equivalent to Visual Studio’s without signature.

Then again, here are the options in Windows Notepad (yes, I use that sometimes as well). As in Eclipse, there is only one choice of UTF-8, but this time the sole option available provides the opposite behaviour – always saving UTF-8 with signature:

BOM BOM!

The optional “signature” in question is the Byte-Order Marker, or BOM. A byte-order marker is required for multibyte encoded data, including UTF-16, to indicate big-endianness or little-endianness – the order in which bytes are arranged. All of the save dialogs above give you the choice for specifying the byte order for Unicode UTF-16, since in a multibyte format the byte order matters. However, for UTF-8, which uses only a single byte for each character (that’s what the “8” stands for – 8 bits = 1 byte) a BOM is not required and doesn’t really make sense.

Even though UTF-8 always uses the same byte-order, a UTF-8 encoded file can begin with the bytes EF BB BF, which merely signifies that it is in UTF-8 format. It’s not really a BOM, hence why Visual Studio calls it a “signature”. The problem is that some clients don’t expect UTF-8 to have a BOM and, as it turns out, the PHP engine is one of them. At least,some builds of the PHP engine. One of my PHP servers, running on a linux machine, interpreted the UTF-8 file with signature fine, whereas another, running under Windows, tried to display the leading bytes as content on the page, which is how you end up with ï»¿.

The combination of different default encoding behaviours across different editors combined with different server/browser behaviours when interpreting UTF-8 files with BOM means that this problem can be a little tricky to diagnose.

This is reported as a PHP bug at http://bugs.php.net/bug.php?id=22108, but the workarounds are actually quite straightforward (once you know what the problem is!):

If you’re using Visual Studio, make sure you save your PHP files as UTF-8 withoutsignature. If you’re using Eclipse, this is the default anyway.
Compile your PHP with the –enable-zend-multibyte option, which will correctly parse the BOM at the start of the file
If you don’t need unicode at all, you could use ISO-8566-1, or another non-UTF-8 encoding

M	T	W	T	F	S	S
« Jan
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30