Exploring HTML Entities Again

by | June 2nd, 2009

Right now I am knee deep in HTML 5, CSS 3 and all related topics to support the new edition of HTML: The Complete Reference. Usually I don’t share too much about the sausage making that goes on during one of these massive efforts, but after hearing a few comments from students I figure maybe I should share a bit more as I go along this time, so every once in a while a sneet peek so here is one – entities.

HTML Entities

So if you have written any significant amount of (X)HTML you likely have had an occasion to insert a special character using a named entity like   or © or maybe even as a numeric entity like     and © which are equivalent to the named entities presented. Dealing with these entities isn’t fun but they are relatively predictable and looking them in a book or online with a chart or even a nicer tool isn’t that hard. However, there is a bit more to know than you might think, you see writing a “Complete Reference” I get to look very closely at things and uncover lots of interesting little oddities.

Entity Case Sensitivity?

Question: Are character entities case sensitive?  Answer – mostly, er…kind of?

Don’t you hate such nebulous answers in something that should be well defined.  Test © and © in most browsers it will will likely render exactly the same. However, case insensitivity isn’t always the case. Consider À (À) and à (à) really are two different things.  Even in the cases where there isn’t something similiar case will matter, for example, £ will render as £ while £ will render as well £

Case sensitive!

Case sensitive!

So we should assume case sensitivity, but what is going on, which entities are case sensitive and which aren’t?  Well roughly the pre-HTML 4 are not likely case sensitive while the post-HTML 4 entities are.   HTML 5 tries to formalize this whole mess and documents that &  © < > " ® and ™ can be written either way but nothing else.  Roughly save the trademark these are as I said the pre-HTML 4 named entities and even that could be argued since it used to sit inappropriately in a charset no-mans land in ASCII 127-159.

Victory – a small detail knocked down for HTML 5 and an important syntax point that a quick perusal of HTML books (mine included) has gone completely unnoticed for a decade.

Entity Parse Problems

So when we mess-up markup like forget a tag or don’t close a quote we see the browser parser “fixing” things for us, albeit sometimes wrongly.  Well it turns out that this also holds for entities.  For example, given this &QUOTE; entity which is clearly a typo for " it will render in one browser as it gets fixed and not in most others.  No bonus points for guessing – yes IE blows it.

IE fixing entities for me?

IE fixing entities for me?

Interesting they may actually be trying to be somewhat correct in their automatic insertion of the trailing ; in the entity.  If you read the specification there is some suggestion that problems be rectified.  However, the decision of how to fix an error in a predictable and consistent manner, well that isn’t agreed upon.  [ Take a peek at the HTML 5 spec chaos for more info]  On that note a very big change with HTML 5 will actually be to indicate what should happen in case of syntax errors.  I guess if you can’t beat syntax into the heads of the masses you might as well codify how their nasty “tag soup” should taste.

Explorer Fully Gets Entities-Finally!

A little known detail that has alluded many Web developers is that up until IE8 many entities actually didn’t work.  We learned this actually in the first edition of the HTML book over 10 years ago when we actually bothered to test the entities rather than just copying pasting the chart onto our site or book.  Every edition was the same, a few changes here and there but a bunch of nasty boxes in places of the appropriate symbol’s under the most popular browser.  Well it is time to report that things are a bit better now.
Finally Internet Explorer supports all the common entities

Finally Internet Explorer supports all the common entities

It Ain’t Over Yet

Year after year I seem to meet Web developers and students who think that just over the hill the green grass and blue skies of standards land awaits.  Well kind readers such optimistic thinking may help you sleep at night but after doing this for quite some time it is clear to me that even with specs there are simply implementation mistakes and well market forces that will make this unlikely any time soon–if ever.  Need some proof?  Well even in entities which are much improved in browsers lots of little quirks exist, especially when we consider that Unicode is and has been here.

How about a unicode entity?  Do those work?  Maybe sort of, well not really, gotta go numeric otherwise it just spells out.

The fun continues with entities in Unicode!

The fun continues with entities in Unicode!

Even then once you insert the entities you might wonder what they are going to look like.  Consider the friendly snowman dingbat here in a variety of browsers.
The Unicode snowman revealing the differences in browser entities

The Unicode snowman revealing the differences in browser entities

Happy or sad, with hat or not, buttons, or snow our little dingbat shows that making little details the same across all browsers is about as likely as building a June snowman in San Diego.  Not to spoil the ending of the book, but every chapter and appendix keeps showing the more things change, the more they really do stay the same.


Thomas Powell is a long-time web industry veteran, as well as the founder and CEO of PINT.

  • OK, most of this is over my head, but entertaining.

    Give us more Thomas, please.