Sashwat Gupta's blog: Normalization and Canonicalization of XML

Normalized xml is the XML stripped of white spaces.
Multiple methods can be applied by using the following schema types:-

xsd:normalizedString (http://www.w3.org/TR/xmlschema11-2/#normalizedString)
xsd:token(http://www.w3.org/TR/xmlschema11-2/#token)

These types do not restrict the use of white spaces rather are instructions to the processor to ignore the spaces (according to their respective rules).
e.g xsd:token is supposed to merge multiple white spaces into one, so for an element defined in xsd as

<xs:element name="tkn" type="xs:token"/>

the value can be provided as:-

<tkn>toks        en     </tkn>

This will not result in an schema validation error but the parser should treat it like a string with the following value:-

<tkn>toks en</tkn>

Canonical form of an XML
The canonical form of an XML document is physical representation of the document produced by the following method:-

The document is encoded in UTF-8
Line breaks normalized to #xA on input, before parsing
Attribute values are normalized, as if by a validating processor
Character and parsed entity references are replaced
CDATA sections are replaced with their character content
The XML declaration and document type declaration (DTD) are removed
Empty elements are converted to start-end tag pairs
Whitespace outside of the document element and within start and end tags is normalized
All whitespace in character content is retained (excluding characters removed during line feed normalization)
Attribute value delimiters are set to quotation marks (double quotes)
Special characters in attribute values and character content are replaced by character references
Superfluous namespace declarations are removed from each element
Default attributes are added to each element
Lexicographic order is imposed on the namespace declarations and attributes of each element

The rules for Canonical form of xsd are very detailed and do not cover the normalization of elements. Both of these forms supplement each other.

Canonical form is very useful while generating hash for the xml and are used in generating the WS-Security BinarySecurityToken.

Sashwat Gupta's blog

Wednesday, July 7, 2010

Normalization and Canonicalization of XML

No comments: