I recently wrote some PHP for the first time in ages, and noticed some of my pages were appearing on one development machine, in some browsers, preceded by the characters . These characters didn’t show up when editing the pages, and they didn’t show up at all when served from a different server or when viewed in some other browsers.
Initially, I thought that it was something to do with not having configured the correct character set in the response header (which is generally the main cause of garbled characters appearing in webpages), but, checking the response header it seemed ok – I was outputting UTF-8 as desired:
[php]header(‘Content-type: text/html; charset=UTF-8’) ;[/php]
And browsers viewing the page were correctly auto-detecting the character encoding as UTF-8:
Then I checked the configuration of the server, which was also set up with Unicode support correctly. And then I checked the encoding of the PHP scripts themselves, which were all encoded using Unicode UTF-8 – (Windows Codepage 65001). So far, everything seemed consistent, so where were those garbled characters coming from?
UTF-8 with or without signature – your choice. (Or not).
The reason, as I found out, was that one of my development environments (Visual Studio – from which I’d made the most recent edits to the affected pages) was configured to save UTF-8 encoded files with signature. Here’s the options for Unicode character encoding in Visual Studio, showing UTF-8 both with and without signature (notice that they’re both the same codepage – 65001):
There seems to be very little convention or standardisation as to the use of this “signature”. I hadn’t really come across this problem before because I generally use Eclipse for PHP development. The encoding options there are shown below:
Notice that, although there are several flavours of UTF-16 available in Eclipse, there is only version of UTF-8, which is equivalent to Visual Studio’s without signature.
Then again, here are the options in Windows Notepad (yes, I use that sometimes as well). As in Eclipse, there is only one choice of UTF-8, but this time the sole option available provides the opposite behaviour – always saving UTF-8 with signature:
BOM BOM!
The optional “signature” in question is the Byte-Order Marker, or BOM. A byte-order marker is required for multibyte encoded data, including UTF-16, to indicate big-endianness or little-endianness – the order in which bytes are arranged. All of the save dialogs above give you the choice for specifying the byte order for Unicode UTF-16, since in a multibyte format the byte order matters. However, for UTF-8, which uses only a single byte for each character (that’s what the “8” stands for – 8 bits = 1 byte) a BOM is not required and doesn’t really make sense.
Even though UTF-8 always uses the same byte-order, a UTF-8 encoded file can begin with the bytes EF BB BF, which merely signifies that it is in UTF-8 format. It’s not really a BOM, hence why Visual Studio calls it a “signature”. The problem is that some clients don’t expect UTF-8 to have a BOM and, as it turns out, the PHP engine is one of them. At least,some builds of the PHP engine. One of my PHP servers, running on a linux machine, interpreted the UTF-8 file with signature fine, whereas another, running under Windows, tried to display the leading bytes as content on the page, which is how you end up with .
The combination of different default encoding behaviours across different editors combined with different server/browser behaviours when interpreting UTF-8 files with BOM means that this problem can be a little tricky to diagnose.
This is reported as a PHP bug at http://bugs.php.net/bug.php?id=22108, but the workarounds are actually quite straightforward (once you know what the problem is!):
- If you’re using Visual Studio, make sure you save your PHP files as UTF-8 withoutsignature. If you’re using Eclipse, this is the default anyway.
- Compile your PHP with the –enable-zend-multibyte option, which will correctly parse the BOM at the start of the file
- If you don’t need unicode at all, you could use ISO-8566-1, or another non-UTF-8 encoding