PHP and UTF-8 BOM

I recently wrote some PHP for the first time in ages, and noticed some of my pages were appearing on one development machine, in some browsers, preceded by the characters . These characters didn’t show up when editing the pages, and they didn’t show up at all when served from a different server or when viewed in some other browsers.

Initially, I thought that it was something to do with not having configured the correct character set in the response header (which is generally the main cause of garbled characters appearing in webpages), but, checking the response header it seemed ok – I was outputting UTF-8 as desired:

[php]header(‘Content-type: text/html; charset=UTF-8’) ;[/php]

And browsers viewing the page were correctly auto-detecting the character encoding as UTF-8:

 


 
image

Then I checked the configuration of the server, which was also set up with Unicode support correctly. And then I checked the encoding of the PHP scripts themselves, which were all encoded using Unicode UTF-8 – (Windows Codepage 65001). So far, everything seemed consistent, so where were those garbled characters coming from?

UTF-8 with or without signature – your choice. (Or not).

The reason, as I found out, was that one of my development environments (Visual Studio – from which I’d made the most recent edits to the affected pages) was configured to save UTF-8 encoded files with signature. Here’s the options for Unicode character encoding in Visual Studio, showing UTF-8 both with and without signature (notice that they’re both the same codepage – 65001):

image

There seems to be very little convention or standardisation as to the use of this “signature”. I hadn’t really come across this problem before because I generally use Eclipse for PHP development. The encoding options there are shown below:

image

Notice that, although there are several flavours of UTF-16 available in Eclipse, there is only version of UTF-8, which is equivalent to Visual Studio’s without signature.

Then again, here are the options in Windows Notepad (yes, I use that sometimes as well). As in Eclipse, there is only one choice of UTF-8, but this time the sole option available  provides the opposite behaviour – always saving UTF-8 with signature:

BOM BOM!

The optional “signature” in question is the Byte-Order Marker, or BOM. A byte-order marker is required for multibyte encoded data, including UTF-16, to indicate big-endianness or little-endianness – the order in which bytes are arranged. All of the save dialogs above give you the choice for specifying the byte order for Unicode UTF-16, since in a multibyte format the byte order matters. However, for UTF-8, which uses only a single byte for each character (that’s what the “8” stands for – 8 bits = 1 byte) a BOM is not required and doesn’t really make sense.

Even though UTF-8 always uses the same byte-order, a UTF-8 encoded file can begin with the bytes EF BB BF, which merely signifies that it is in UTF-8 format. It’s not really a BOM, hence why Visual Studio calls it a “signature”. The problem is that some clients don’t expect UTF-8 to have a BOM and, as it turns out, the PHP engine is one of them. At least,some builds of the PHP engine. One of my PHP servers, running on a linux machine, interpreted the UTF-8 file with signature fine, whereas another, running under Windows, tried to display the leading bytes as content on the page, which is how you end up with .

The combination of different default encoding behaviours across different editors combined with different server/browser behaviours when interpreting UTF-8 files with BOM means that this problem can be a little tricky to diagnose.

This is reported as a PHP bug at http://bugs.php.net/bug.php?id=22108, but the workarounds are actually quite straightforward (once you know what the problem is!):

  • If you’re using Visual Studio, make sure you save your PHP files as UTF-8 withoutsignature. If you’re using Eclipse, this is the default anyway.
  • Compile your PHP with the –enable-zend-multibyte option, which will correctly parse the BOM at the start of the file
  • If you don’t need unicode at all, you could use ISO-8566-1, or another non-UTF-8  encoding

File Geodatabase API Is Now Available on Linux

Well ask and you shall receive.

We’ve added a Linux version of the File GDB API.

It is available from the same download page as the windows version.

Now developers can develop file geodatabase code on a Linux machine.

Happy Feet!

Now I gotta get off my duff and do something with it. Good to see some Linux support here, feel like Esri is saying Use Me.

Googlers Down Under


Despite the recent flooding in Brisbane, Australia, linux.conf.au (lca) will proceed from January 24th to 29th, and Googlers from across the company will be there. LCA is a community-run technical conference for free and open source software enthusiasts, featuring but not limited to Linux. In addition to the many Googlers who will be attending, several Googlers will also be presenting at the conference.

The conference starts on Monday the 24th with a day of miniconfs, and Nóirín Shirley from Google’s Zurich office will be presenting “Open Source: Saving the World” as part of the Haecksen track.

Google’s Chief Internet Evangelist Vint Cerf will start the day on Tuesday the 25th with his keynote presentation, and later that morning he will present “In Search of Transmission Capacity – a Multicore Dilemma.” On Tuesday afternoon, Google Summer of Code Administrator Carol Smith will give a “Google Summer of Code Update” at the FOSS in Research and Student Innovation Miniconf.

On Wednesday January 26th, Google staff engineer and Linux kernel committer Ted Ts’o will explain “Making file systems scale: A case study using ext4.”

Andrew Gerrand and Nigel Tao of the Go team will give attendees “A Tour of Go” on Thursday the 27th, and Nóirín will present “Baby Steps into Open Source – Incubation and Mentoring at Apache,” which is based on her experience at the Apache Software Foundation.

On Friday the 28th, Carol will present her talk, “The 7 Habits of Highly Ineffective Project Managers” in the morning. A little later in the day, Daniel Bentley and Daniel Nadasi of the open source and Geo teams respectively will talk about “Opening a Closed World,” followed by Marc MERLIN, who works on infrastructure at Google. Marc will discuss “Saving Money with Misterhouse: Running Your Lights and HVAC System. Scaring your cat off the kitchen counter is just a bonus :)

LCA always closes with Open Day, a free day-long event where the general public can leearn about open source, open data – all things “open”. The Open Day is on Saturday the 29th, and Cat Allman of the Open Source Programs Office will be presenting her talk, “What is Open Source?” there.

Come learn more about the latest happenings in open source, and join us in showing support for Brisbane’s recovery. We hope to see you there!