I’m a bare-bones kinda guy. No, not physically. Okay, maybe physically as well, but that’s beside the point. What I mean is that when I work on my Disney website, I do all my HTML coding in NotePad, saving the pages as HTML files and going from there. For me, it works. I’ve also been using the W3C validator to check the validity of my coding. And usually all is well.
But there are a few pages where I’ve been getting, not an error, but a warning what sez:
"Byte-Order Mark found in UTF-8 File.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. "
I’ve been searching around to figure out what this is but all I get is a lot of technical gobbledegook what this uneducated boob can’t figger out.
So … anybody … in plain English … what is this monster? How do I fix it? Or is it even anything I need to worry about?
A byte-order mark is a string of bytes that indicates the endianness of an encoding (i.e., for characters encoded in one or more bytes, whether the byte with the highest or the lowest value comes first - something that some operating systems handly differently than others).
Notepad prefaces UTF-8 encoded text with a byte-order mark which isn’t strictly necessary (in UTF-8) and might irritate some web browsers (but I haven’s seen any that actually chocke on it).
It’s much more of an issue in script files (Perl, PHP etc) where an Unix-based web server may actually throw an error.
For HTML files it’s not serious IMO but I recommend using Notepad++ where you have more control over your encoding.
Hokay. Basically it’s a non-displaying character in the Unicode spec that tells software reading a file what order bytes are written in, in a given file. Because pretty much everything in the world still deals with 8-bit bytes, if you’re using UTF-16 (16-bit encoding) each character will require two bytes. Therefore you need to tell software reading the file which byte of each character is the big end, and which is the small end, so it can put the two bytes together in the right order to get the character code. The Unicode spec says you do this by sticking a byte-order mark right at the beginning of the file.
Because you’re using UTF-8 encoding, however, the byte-order mark is pointless because there’s only one byte per character. It’s in the spec, but some programs will be confused by it. Some editors still put it in the files, though, Notepad being one of them. I have no idea why it’s only doing it to some pages - it could be that it’s not saving all of your files as UTF-8, but only ones it thinks need it. Notepad is notoriously poor at figuring out which encodings to use (as even its developers admit). You could try changing to a different text editor, but it seems like quite a few Windows editors do this, so I’m not sure this would be ideal. I use UltraEdit, which allows you to choose, but it’s not free.
It really oughtn’t be a problem in any case, IMO - as far as I can tell it mostly causes issues for people writing PHP scripts and the like, which you aren’t. The worst that will happen for your site is that people with browsers that don’t understand the mark may see a little  thingie (or something similar) in the top left of their page. I can’t see it appearing in Opera 9.27, Firefox 2.something or IE7, so (assuming you’ve tested it in IE 6) I think you’re probably okay for the majority of users.
Edit: pipped! Shouldn’t have used 16-bit letters, they got sent slower. Or something.
Nope – UTF8 characters can be up to four bytes. The first 128 codepoints happen to have the exact same encoding as ASCII, so if you stick to those characters exclusively then byte endianness won’t matter.
Got it! I didn’t think it was reallya big deal as I’ve tested the site on a number of browsers and quite a few different computers and I’ve never seen any issue with it. So, I’m just not going to worry about it. Thanks!
Yeah, I oversimplified a bit. Although there still won’t be endianness issues per se, even when you go outside the ASCII characters; UTF-8 defines its own encoding, unrelated to endianness. I s’pose you could say UTF-8 is a kind of specialist endianness in itself, designed to ensure that (as you say) the ASCII codes map directly on to the first 8 bits. Still no need for a BOM, though, since the encoding is fixed even for multi-byte characters.
No, wrong: Endianness never matters in UTF-8. That was one of the major design goals. That’s the reason everyone should use it for all text documents. Putting a byte-order mark in UTF-8 is utterly pointless.
It’s possible he confused it with UCS-2, which is a fixed-width 16-bit encoding that therefore can only represent a subset of the total Unicode scheme. (This is acceptable sometimes because Unicode is arranged such that the characters used most often are in the lower codepoints UCS-2 can represent. It’s still nowhere near as good as UTF-8 for multiple reasons, and you do need a byte order mark with UCS-2.)
Hmm, recently I was getting some e-mails from a student, portions of which where showing up in Chinese characters (a language neither of us knows, so they weren’t intentional). I eventually discovered (via running it through od) that each character in the e-mail was two bytes, and that for the normal characters, the first byte was always 0, but for the mangled characters, the second byte was always 0 (and the first byte was the intended character). Could this problem have been caused by errant byte order marks somewhere along the line?
If the zero bytes are sometimes appearing first in each pair, and sometimes second, I suspect that a byte got dropped out of the stream somewhere in the middle, throwing everything out of kilter after a certain point.
Or perhaps multiple chunks of text were created with different orderings and without order marks, then concatenated together as one raw byte stream before being sent — but that doesn’t seem a likely thing to happen accidentally.
In any case, the byte ordering is supposed to be constant over the entire length of a given file or string. It’s not really allowed to fluctuate willy-nilly along the way. A well-formed UTF-16 or UCS-2 file should have a single BOM marker at the very beginning, and nowhere else.
Actually somewhat plausible, in this case. Portions of the e-mail would have been composed in one program, then cut-and-pasted into either a text editor or the mail program’s composer, then mailed to someone else, who possibly saved it in another file, then forwarded it to me. So there were plenty of opportunities for output from multiple programs on multiple platforms to get concatenated together.