Unicode Byte Order Mark

Eutychus · June 3, 2008, 12:08pm

I’m a bare-bones kinda guy. No, not physically. Okay, maybe physically as well, but that’s beside the point. What I mean is that when I work on my Disney website, I do all my HTML coding in NotePad, saving the pages as HTML files and going from there. For me, it works. I’ve also been using the W3C validator to check the validity of my coding. And usually all is well.

But there are a few pages where I’ve been getting, not an error, but a warning what sez:

"Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. "

I’ve been searching around to figure out what this is but all I get is a lot of technical gobbledegook what this uneducated boob can’t figger out.

So … anybody … in plain English … what is this monster? How do I fix it? Or is it even anything I need to worry about?

Mops · June 3, 2008, 12:45pm

A byte-order mark is a string of bytes that indicates the endianness of an encoding (i.e., for characters encoded in one or more bytes, whether the byte with the highest or the lowest value comes first - something that some operating systems handly differently than others).

Notepad prefaces UTF-8 encoded text with a byte-order mark which isn’t strictly necessary (in UTF-8) and might irritate some web browsers (but I haven’s seen any that actually chocke on it).

It’s much more of an issue in script files (Perl, PHP etc) where an Unix-based web server may actually throw an error.

For HTML files it’s not serious IMO but I recommend using Notepad++ where you have more control over your encoding.

Dead_Badger · June 3, 2008, 12:49pm

Hokay. Basically it’s a non-displaying character in the Unicode spec that tells software reading a file what order bytes are written in, in a given file. Because pretty much everything in the world still deals with 8-bit bytes, if you’re using UTF-16 (16-bit encoding) each character will require two bytes. Therefore you need to tell software reading the file which byte of each character is the big end, and which is the small end, so it can put the two bytes together in the right order to get the character code. The Unicode spec says you do this by sticking a byte-order mark right at the beginning of the file.

Because you’re using UTF-8 encoding, however, the byte-order mark is pointless because there’s only one byte per character. It’s in the spec, but some programs will be confused by it. Some editors still put it in the files, though, Notepad being one of them. I have no idea why it’s only doing it to some pages - it could be that it’s not saving all of your files as UTF-8, but only ones it thinks need it. Notepad is notoriously poor at figuring out which encodings to use (as even its developers admit). You could try changing to a different text editor, but it seems like quite a few Windows editors do this, so I’m not sure this would be ideal. I use UltraEdit, which allows you to choose, but it’s not free.

It really oughtn’t be a problem in any case, IMO - as far as I can tell it mostly causes issues for people writing PHP scripts and the like, which you aren’t. The worst that will happen for your site is that people with browsers that don’t understand the mark may see a little ï»¿ thingie (or something similar) in the top left of their page. I can’t see it appearing in Opera 9.27, Firefox 2.something or IE7, so (assuming you’ve tested it in IE 6) I think you’re probably okay for the majority of users.

Edit: pipped! Shouldn’t have used 16-bit letters, they got sent slower. Or something.

friedo · June 3, 2008, 1:30pm

Nope – UTF8 characters can be up to four bytes. The first 128 codepoints happen to have the exact same encoding as ASCII, so if you stick to those characters exclusively then byte endianness won’t matter.

Eutychus · June 3, 2008, 1:38pm

Got it! I didn’t think it was reallya big deal as I’ve tested the site on a number of browsers and quite a few different computers and I’ve never seen any issue with it. So, I’m just not going to worry about it. Thanks!

Dead_Badger · June 3, 2008, 2:12pm

Yeah, I oversimplified a bit. Although there still won’t be endianness issues per se, even when you go outside the ASCII characters; UTF-8 defines its own encoding, unrelated to endianness. I s’pose you could say UTF-8 is a kind of specialist endianness in itself, designed to ensure that (as you say) the ASCII codes map directly on to the first 8 bits. Still no need for a BOM, though, since the encoding is fixed even for multi-byte characters.

Anyway, bit of a digression at this point.

Derleth · June 3, 2008, 3:30pm

No, wrong: Endianness never matters in UTF-8. That was one of the major design goals. That’s the reason everyone should use it for all text documents. Putting a byte-order mark in UTF-8 is utterly pointless.

Bytegeist · June 3, 2008, 6:32pm

A nitpick: UTF-16 is a variable-length encoding just like UTF-8, and characters can require up to four bytes (two 16-bit values).

Derleth · June 4, 2008, 3:49am

It’s possible he confused it with UCS-2, which is a fixed-width 16-bit encoding that therefore can only represent a subset of the total Unicode scheme. (This is acceptable sometimes because Unicode is arranged such that the characters used most often are in the lower codepoints UCS-2 can represent. It’s still nowhere near as good as UTF-8 for multiple reasons, and you do need a byte order mark with UCS-2.)

Chronos · June 4, 2008, 3:57am

Hmm, recently I was getting some e-mails from a student, portions of which where showing up in Chinese characters (a language neither of us knows, so they weren’t intentional). I eventually discovered (via running it through od) that each character in the e-mail was two bytes, and that for the normal characters, the first byte was always 0, but for the mangled characters, the second byte was always 0 (and the first byte was the intended character). Could this problem have been caused by errant byte order marks somewhere along the line?

Bytegeist · June 4, 2008, 2:08pm

If the zero bytes are sometimes appearing first in each pair, and sometimes second, I suspect that a byte got dropped out of the stream somewhere in the middle, throwing everything out of kilter after a certain point.

Or perhaps multiple chunks of text were created with different orderings and without order marks, then concatenated together as one raw byte stream before being sent — but that doesn’t seem a likely thing to happen accidentally.

In any case, the byte ordering is supposed to be constant over the entire length of a given file or string. It’s not really allowed to fluctuate willy-nilly along the way. A well-formed UTF-16 or UCS-2 file should have a single BOM marker at the very beginning, and nowhere else.

Chronos · June 4, 2008, 4:49pm

Actually somewhat plausible, in this case. Portions of the e-mail would have been composed in one program, then cut-and-pasted into either a text editor or the mail program’s composer, then mailed to someone else, who possibly saved it in another file, then forwarded it to me. So there were plenty of opportunities for output from multiple programs on multiple platforms to get concatenated together.

Topic		Replies	Views
How do figure out the endianness of a JPG? Factual Questions	6	6257	October 5, 2004
Megabit? Factual Questions	64	2204	October 14, 2000
Get off my internet! (Old Fogey question about little boxes with letters in 'em) Factual Questions	22	2240	August 9, 2008
Why doesn't my browser show certain characters? Factual Questions	4	1172	March 27, 2007
â€ ? Factual Questions	7	704	November 2, 2004

Unicode Byte Order Mark

Related topics