XML Notes

Trang

Intro

Trang is a java based program used for inferring XML Schema (XSD) from a source XML document. It attempts to guess what kind of data-types should be used for the various elements in the XML file.

Get Trang

Download Trang from the following location:
» Trang download...

Use Trang

To infer a schema for an XML document, do the following:

  1. Download Trang (see above)
  2. Make sure the filename is trang.jar, and do not uncompress it
  3. Put Trang and the XML file you want to infer from into the same folder (if they are not already)
  4. Use this command to generate an XSD schema:
    java -jar trang.jar xmlfile.xml schema.xsd (where 'xmlfile.xml' is your xml file's filename, and 'schema.xsd' is the filename for the new schema)

XML Schema

This example will be referred to in the following two sections on types.

Example: XML

<?xml version="1.0" encoding="utf-8"?>
<root>
	<id>45</id>
	<details>
		<firstname>John</firstname>
		<surname>Doe</surname>
		<address>
			<housenumber>76</housenumber>
			<roadname>Pine Road</roadname>
			<area>Winton</area>
			<city>Bournemouth</city>
			<county>Dorset</county>
			<postcode>BH9 1AB</postcode>
		</address>
	</details>
	<medical>
		<allergies>
			<allergy>Cat hair</allergy>
			<allergy>Pollen</allergy>
		</allergies>
		<notes>Occasionally suffers from bouts of acute death.</notes>
	</medical>
	<salary>22000</salary>
</root>

Example: Schema

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
	<xs:element name="root">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="id"/>
				<xs:element ref="details"/>
				<xs:element ref="medical"/>
				<xs:element ref="salary"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="id" type="xs:integer"/>
	<xs:element name="details">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="firstname"/>
				<xs:element ref="surname"/>
				<xs:element ref="address"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="firstname" type="xs:NCName"/>
	<xs:element name="surname" type="xs:NCName"/>
	<xs:element name="address">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="housenumber"/>
				<xs:element ref="roadname"/>
				<xs:element ref="area"/>
				<xs:element ref="city"/>
				<xs:element ref="county"/>
				<xs:element ref="postcode"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="housenumber" type="xs:integer"/>
	<xs:element name="roadname" type="xs:string"/>
	<xs:element name="area" type="xs:NCName"/>
	<xs:element name="city" type="xs:NCName"/>
	<xs:element name="county" type="xs:NCName"/>
	<xs:element name="postcode" type="xs:string"/>
	<xs:element name="medical">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="allergies"/>
				<xs:element ref="notes"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="allergies">
		<xs:complexType>
			<xs:sequence>
				<xs:element maxOccurs="unbounded" ref="allergy"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="allergy" type="xs:string"/>
	<xs:element name="notes" type="xs:string"/>
	<xs:element name="salary" type="xs:integer"/>
</xs:schema>

Explanation

The schema example above was inferred from the example xml file, using Trang.

Before moving on, it is important to note that since Trang is not a human, it does not act the same way a human would, when hand-writing a schema. A human would most likely define an XML schema in one of two ways:

Trang uses a method a bit like the latter, but instead of starting with all the simple types and attributes (more on this later), and then moving to the complex types, and then finally defining the outer (root) element, Trang sort of does it backwards, starting at the outside top, and following the document down into the 'branches' and defining their types. This can sometimes make the inferred schema a bit hard to follow, but once you see what's going on, its not too difficult.

Detail

The first section inside the schema's root element decribes the outer element.

<xs:element name="root">
	<xs:complexType>
		<xs:sequence>
			<xs:element ref="id"/>
			<xs:element ref="details"/>
			<xs:element ref="medical"/>
			<xs:element ref="salary"/>
		</xs:sequence>
	</xs:complexType>
</xs:element>
...

This describes an element called root. In our example, this is (as the name suggests), the root element of the document. Since this element contained more elements, it is defined as a complex type (xs:complexType). Inside this complex type, we can see it contains some more elements: id, details, medical and salary. These are not described as either a simple type or a complex type, neither do they have a name or any other properties defined, as you might expect. Instead, they have the ref attribute. This means it is referring to a type defined elsewhere, called by the given name. However (and this is what I mean by backwards), it has not been defined yet. The definitions for some follow after:

...
<xs:element name="id" type="xs:integer"/>
...

This defines the id element which we have referred to above. It is a simple type of the type integer. This is denoted by xs:integer in the type attribute.

The rest of the schema document follows pretty much the same format. However, several different data types are available, and there are also various attributes which can be applied to change the behaviour, for example allowing multiple occurances, or only one occurance.

XSD Data Types

Here are some links to the W3Schools pages on XSD data types.

Restrictions/Facets

XSD allows fine control over what values for elements are allowed, through use of restrictions, also known as facets. This can include things such as minimum and maximum values for numeric elements, and pattern matching and other restrictions for string types.

Some restriction examples (with brief explanations) are given below. Also, at the end of this section are some links to the W3Schools site to find out more.

<xs:element name="age">
	<xs:simpleType>
		<xs:restriction base="xs:integer">
			<xs:minInclusive value="0"/>
			<xs:maxInclusive value="120"/>
		</xs:restriction>
	</xs:simpleType>
</xs:element>

This example defines an element called age, which is of the integer type. Restrictions have been used to only allow values between 0 and 120 (inclusive), by using the xs:minInclusive and xs:maxInclusive elements, with appropriate values, inside the xs:restriction element. The base attribute of this element defines what type this element is derived from, or based on. This is all enclosed within a xs:simpleType element, as this is a simple type, rather than a complex type.

<xs:element name="car">
	<xs:simpleType>
		<xs:restriction base="xs:string">
			<xs:enumeration value="Audi"/>
			<xs:enumeration value="Volkswagen"/>
			<xs:enumeration value="BMW"/>
		</xs:restriction>
	</xs:simpleType>
</xs:element>

This example defines a type based on a string, which only allows one of a given set of values, either Audi, Volkswagen or BMW.

<xs:element name="initials">
	<xs:simpleType>
		<xs:restriction base="xs:string">
			<xs:pattern value="[a-zA-Z][a-zA-Z][a-zA-Z]"/>
		</xs:restriction>
	</xs:simpleType>
</xs:element>

This example defines an element named initials which is again based on a string. However, this time the value must match the given pattern. In this case, it must contain exactly 3 characters, each of which must be an uppercase or lowercase letter (a to z).

There are many other ways in which the values can be restricted, and how various restrictions could be combined in order to refine the schema. To see more example like these, visit the W3Schools Restrictions/Facets page.