PHP and UTF-8 BOM

I recently wrote some PHP for the first time in ages, and noticed some of my pages were appearing on one development machine, in some browsers, preceded by the characters . These characters didn’t show up when editing the pages, and they didn’t show up at all when served from a different server or when viewed in some other browsers.

Initially, I thought that it was something to do with not having configured the correct character set in the response header (which is generally the main cause of garbled characters appearing in webpages), but, checking the response header it seemed ok – I was outputting UTF-8 as desired:

[php]header(‘Content-type: text/html; charset=UTF-8’) ;[/php]

And browsers viewing the page were correctly auto-detecting the character encoding as UTF-8:

 


 
image

Then I checked the configuration of the server, which was also set up with Unicode support correctly. And then I checked the encoding of the PHP scripts themselves, which were all encoded using Unicode UTF-8 – (Windows Codepage 65001). So far, everything seemed consistent, so where were those garbled characters coming from?

UTF-8 with or without signature – your choice. (Or not).

The reason, as I found out, was that one of my development environments (Visual Studio – from which I’d made the most recent edits to the affected pages) was configured to save UTF-8 encoded files with signature. Here’s the options for Unicode character encoding in Visual Studio, showing UTF-8 both with and without signature (notice that they’re both the same codepage – 65001):

image

There seems to be very little convention or standardisation as to the use of this “signature”. I hadn’t really come across this problem before because I generally use Eclipse for PHP development. The encoding options there are shown below:

image

Notice that, although there are several flavours of UTF-16 available in Eclipse, there is only version of UTF-8, which is equivalent to Visual Studio’s without signature.

Then again, here are the options in Windows Notepad (yes, I use that sometimes as well). As in Eclipse, there is only one choice of UTF-8, but this time the sole option available  provides the opposite behaviour – always saving UTF-8 with signature:

BOM BOM!

The optional “signature” in question is the Byte-Order Marker, or BOM. A byte-order marker is required for multibyte encoded data, including UTF-16, to indicate big-endianness or little-endianness – the order in which bytes are arranged. All of the save dialogs above give you the choice for specifying the byte order for Unicode UTF-16, since in a multibyte format the byte order matters. However, for UTF-8, which uses only a single byte for each character (that’s what the “8” stands for – 8 bits = 1 byte) a BOM is not required and doesn’t really make sense.

Even though UTF-8 always uses the same byte-order, a UTF-8 encoded file can begin with the bytes EF BB BF, which merely signifies that it is in UTF-8 format. It’s not really a BOM, hence why Visual Studio calls it a “signature”. The problem is that some clients don’t expect UTF-8 to have a BOM and, as it turns out, the PHP engine is one of them. At least,some builds of the PHP engine. One of my PHP servers, running on a linux machine, interpreted the UTF-8 file with signature fine, whereas another, running under Windows, tried to display the leading bytes as content on the page, which is how you end up with .

The combination of different default encoding behaviours across different editors combined with different server/browser behaviours when interpreting UTF-8 files with BOM means that this problem can be a little tricky to diagnose.

This is reported as a PHP bug at http://bugs.php.net/bug.php?id=22108, but the workarounds are actually quite straightforward (once you know what the problem is!):

  • If you’re using Visual Studio, make sure you save your PHP files as UTF-8 withoutsignature. If you’re using Eclipse, this is the default anyway.
  • Compile your PHP with the –enable-zend-multibyte option, which will correctly parse the BOM at the start of the file
  • If you don’t need unicode at all, you could use ISO-8566-1, or another non-UTF-8  encoding

Android: New Eclipse plug-in

At the Google I/O conference a month ago, we demonstrated the next version of the Android Development Tools (ADT) plugin. Today we’re happy to announce that version 11 is done and available for download!

ADT 11 focuses on editor improvements. First, it offers several new visual refactoring operations, such as “Extract Include” and “Extract Style,” which help automatically extract duplicated layout fragments and style attributes into reusable layouts, styles, and themes.

Second, the visual layout editor now supports fragments, palette configurations, and improved support for custom views.

Last, XML editing has been improved with new quick fixes, code completion in more file types and many “go to declaration” enhancements.
 


 
ADT 11 packs a long list of new features and enhancements. Please visit our ADT page for more details. For an in-depth demo, check out the video of our Android Development Tools session at Google I/O, below.

Please note that the visual layout editor depends on a layout rendering library that ships with each version of the platform component in the SDK. We are currently working on a number of improvements to this library as well, which we plan to release soon for all platform versions. When we release the updates, some new features in ADT 11 will be “unlocked” – such as support for ListView previewing – so keep an eye on this blog for further announcements.