XML Encoding


XML documents can contain international characters, like Norwegian æøå, or French êèé.

To avoid errors, you should specify the encoding used, or save your XML files as UTF-8.


Character Encoding

Character encoding defines a unique binary code for each different character used in a document.

In computer terms, character encoding are also called character set, character map, code set, and code page.


Unicode

Unicode is an industry standard for character encoding of text documents. It defines (nearly) every possible international character by a name and a number.

Unicode has two variants: UTF-8 and UTF-16.

UTF = Universal character set Transformation Format.

UTF-8 uses 1 byte (8-bits) to represent basic Latin characters, and two, three, or four bytes for the rest.

UTF-16 uses 2 bytes (16 bits) for most characters, and four bytes for the rest.


UTF-8 - The Web Standard

UTF-8 is the standard character encoding on the web.

UTF-8 is the default character encoding for HTML5, CSS, JavaScript, PHP, SQL, and XML.


XML Encoding

The first line in an XML document is called the prolog:

<?xml version="1.0"?>

The prolog is optional. Normally it contains the XML version number.

It can also contain information about the encoding used in the document. This prolog specifies UTF-8 encoding:

<?xml version="1.0" encoding="UTF-8"?>

The XML standard states that all XML software must understand both UTF-8 and UTF-16.

UTF-8 is the default for documents without encoding information.

In addition, most XML software systems understand encodings like ISO-8859-1, Windows-1252, and ASCII.


XML Errors

Most often, XML documents are created on one computer, uploaded to a server on a second computer, and displayed by a browser on a third computer.

If the encoding is not correctly interpreted by all the three computers, the browser might display meaningless text, or you might get an error message.

Look at these two XML files: Note saved with right encoding and Note saved with wrong encoding.

For high quality XML documents, UTF-8 encoding is the best to use. UTF-8 covers international characters, and it is also the default, if no encoding is declared.


Conclusion

When you write an XML document:

  • Use an XML editor that supports encoding
  • Make sure you know what encoding the editor uses
  • Describe the encoding in the encoding attribute
  • UTF-8 is the safest encoding to use
  • UTF-8 is the web standard


Color Picker

colorpicker