Skip Navigation LinksHome : CBC Classes : XML : Schemas : Restrictions

XML Schema Data Types & Restrictions

  1. Basics
  2. Enumeration - specifying a list of values
  3. Strings, tokens and normalized strings
  4. Whitespace
  5. Patterns
  6. Length
  7. Restricting numbers
  8. Restricting dates and times
Setting Restrictions - Basics

One of the big limits of DTDs is that the only thing you can say about element data is that it must be there. You can't limit the type of data, or say that the data must be in a certain range or be certain values. You can restrict the values for attributes in a DTD, but not for an element.

With schemas you can spell out exactly what the element data should be. In addition to specifying the type of data, we can also place restrictions on the values. This is done by changing the element declaration so that it uses an open and close tag, such as:

  <xsd:element name="elementName">
  </xsd:element>

and by adding the xsd:simpleType and xsd:restriction elements. The type of the data must be moved from the element start tag to the xsd:restriction start tag, where it's now called the base. For example:

  <xsd:element name="weight" type="xsd:integer" />

would be changed to:

  <xsd:element name="weight">
    <xsd:simpleType>
      <xsd:restriction base="xsd:integer">
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

So far we've said that the element is going to have restrictions on it's data, but we haven't said what those restrictions are. This is done by using the elements shown in the following table. We'll go through examples for all of these restrictions, but take note that you may need to use more than one restriction or use the same restriction element more than once.

enumeration Any string that is a valid value. If you have more than one valid value use multiple enumeration elements.
minInclusive For numbers. All values must be greater than or equal to the specified value.
minExclusive For numbers. All values must be greater than the specified value.
maxInclusive For numbers. All values must be less than or equal to the specified value.
maxExclusive For numbers. All values must be less than the specified value.
totalDigits For numbers. The total number of digits, before and after the decimal, in the number. Does not include the decimal.
fractionDigits For numbers. The total number of digits after the decimal in the number.
length The total number of characters in the value.
minLength The minimum number of characters in the value.
maxLength The maximum number of characters in the value.
pattern A pattern defined by a regular expression.
whitespace Instructs how to handle whitespace characters. Valid values are preserve, replace and collapse.

If you want more detail about the restrictions here's the link to the W3C Schema Document. The section you want to read is 4.3 Constraining Facets.


Enumeration - a list of valid values

The enumeration element forces an element's values to be one that you specify. You can specify as many as you want, but you have to use one enumeration element for each value. For example, if you had a color element that you can only have the value red, green or blue you would use:

  <xsd:element name="color">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="red" />
        <xsd:enumeration value="green" />
        <xsd:enumeration value="blue" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

While this is pretty straightforward, there are a couple of issues you need to be aware of:

  1. The values in an enumeration are case sensitive, so Red and red are different, and would both need an enumeration element if you wanted to use both. (Or you might consider using a pattern.)
  2. Each value can be made up of multiple words. For example value="golden retriever" is valid. But this requires that the data be exactly the same two words. So "golden retriever" would be be valid, but "golden" or "retriever" would not.
  3. If the restriction uses a base of xsd:string then every character, including all of the whitespace characters like tabs, carriage returns, etc., are checked. You may want to use xsd:token or xsd:normalizedString instead. (see below).

strings, tokens and normalizedStrings

One of the problems with enumerating strings is dealing with white space characters. For example, assume you have the following restriction which limits the color element to values of red, green or blue:

  <xsd:element name="color">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="red" />
        <xsd:enumeration value="green" />
        <xsd:enumeration value="blue" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

You would think that the following XML data would be valid:

  <color>
    red
  </color>

But surprisingly this is not valid. The reason for this is that XML data has a line feed (#xA) and carriage return (#x20) before and after the red, plus a couple of space characters (#x20). In other words the validator sees this as:

  <color>#xA #xD #x20 #x20red#xA #xD</color>

when it's looking for:

  <color>red</color>

The obvious solution for this is to write everything on the same line. For example:

  <color>red</color>

But you can also change the base data type to token or normalizedString. A normalizedString is like a string, but before validating or displaying the data all carriage returns, line feeds and tabs are removed. A token is the same as a normalizedString except it also removes any leading or trailing spaces, and compresses multiple adjacent space characters into a single space character. As an example, the following restriction uses base="xsd:token"

  <xsd:element name="color">
    <xsd:simpleType>
      <xsd:restriction base="xsd:token">
        <xsd:enumeration value="red" />
        <xsd:enumeration value="green" />
        <xsd:enumeration value="blue" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

So the following data would now be valid:

  <color>
    red
  </color>

When you're learning about schemas, the easy thing to do is always use base="xsd:token", which allows almost anything in the XML data. But in practice you must take into consideration how the data is being used and how it is being stored, and then choose the best data type.


Whitespace

The whitespace restriction is a little weird. It's different than the other restrictions because it doesn't limit the data; instead it provides instruction for how to deal with whitespace characters such as tabs, carriage returns etc. before applying other restrictions. I'm probably missing something, but it's actually doing the same thing as changing from base="xsd:string" to base="xsd:token" or base="xsd:normalizedString".

Using the restriction <whitespace value="preserve" > is just like using base="xsd:string"; the data remains untouched. Using the restriction <whitespace value="replace" > is just like using base="xsd:normalizedString". It replaces all carriage-returns, linefeeds, and tabs to spaces. Using the restriction <whitespace value="collapse" > is just like using base="xsd:token". It replaces all carriage-returns, linefeeds, and tabs to spaces; and it removes leading and trailing spaces, and collapses multiple spaces into a single space.

In other words, the following two sets of rules are equivalent.

  <xsd:element name="color">
    <xsd:simpleType>
      <xsd:restriction base="xsd:token">
        <xsd:enumeration value="red" />
        <xsd:enumeration value="green" />
        <xsd:enumeration value="blue" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>
  <xsd:element name="color">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:whitespace value="collapse" />
        <xsd:enumeration value="red" />
        <xsd:enumeration value="green" />
        <xsd:enumeration value="blue" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Personally, I stick with changing the base data type. If you decide to use the whitespace restriction, be aware that some validators won't allow you to use something like:

      <xsd:restriction base="xsd:token">
        <xsd:whitespace value="preserve" />

This is because token strips characters out, but preserve says to leave them in. You can use the whitespace restriction to remove more characters, but it can't be looser than what you set with the base type. In other words, base="xsd:string" can have any whitespace value, base="xsd:normalizedString" can have whitespace values of replace or collapse, and base="xsd:token" can only have a whitespace value of collapse.

Patterns

One of the problems with using enumerated types is in some cases there can be so many variations, it's often impossible, or at least unrealistically difficult to list them all. For example, if the data is supposed to be an Internet domain name, it would take years to list out even a fraction of the possibilities. The pattern restriction was designed to handle these types of situations. Like the name suggests, the pattern element allows you to define a pattern that can be used to check the data.

Defining the patterns is done with something called regular expressions, which have been used forever in the UNIX world. You will learn the basics of regular expressions here, but if you want to do anything even halfway complicated it's much easier to do an Internet search and use a pre-built expression.

Patterns are built by combining plain text with wildcard or variable sections. The following table shows some of the basic regular expression wildcards. Note that each of these wildcards will match exactly one character.

Regular Expression Matches
abcd Any occurrence of "abcd"
. Any single occurrence of any character (except newline)
[abc] A single occurrence either the character "a" the character "b" or the character "c"
[a-z] A single occurrence of any lower case letter
[A-Z] A single occurrence of any upper case letter
[a-zA-Z] A single occurrence of any letter, either upper or lower case
[012] A single occurrence either the character "0" the character "1" or the character "2"
[0-9] Any single numeric character
[0-9A-Za-z] Any numeral or any upper or lower case letter
[^a] Any character that is not the character "a" (The ^ means not)
[^a-z] Any character that is not a lower case letter
[^1] Any character that is not the character "1"
[^0-9] Any character that is not a number
[-a-z] Any lower case letter or a "-"
[0-9-] Any number or a "-"
[^-a-z] Any character except a lower case letter or a "-"
[]0-9] Any number or a "]"
] The character "]"
[0-9]] Any number followed by the character "]"
[0-9\]] Any number or the character "]"
[0\-9] The character “0” or the character “-“ or the character "9"
[\^1] The character "^' or the character "1"
\d Any single number, the same as [0-9]
\D Any character that is not a number, the same as [^0-9]
\w Any single alphanumeric character, or the same as [a-zA-Z0-9] (In UNIX this also includes "_", or the same as [a-zA-Z0-9_]).
\W Any non-word character, the same as [^a-zA-Z0-9_]
\s This is not supported in XML. Any whitespace character such as(space, tab, newline)
\S Any non-whitespace character
\n (not tested) The newline character
\r (not tested) The carriage return character
\t (not tested) The tab character
\nnn The ASCII character with the octal value nnn
\xnn The ASCII character with the hex value nn

Examples

[0-9][0-9] Any number 00-99

Room [1-2][0-9][0-9] "Room" followed by 100 through 299

Title: ....[0-9] "Title:" followed by any 4 characters, followed by 1-9

Modifiers

The preceding table showed the single character wildcards. While they are helpful, they can result in some pretty long patterns. For example, the pattern for a four digit number would be: [0-9][0-9][0-9][0-9] or \d\d\d\d. Not horrible, but the patterns can get long.

To improve this, modifiers can be used to say how many times a character may appear. To use a modifier, put it in the pattern after a character or wildcard. For example * says there may be zero or more occurrences of the previous character. So a pattern of A* would match 0 or more occurences of the letter A.

The following table lists the modifiers that can be used with XML patterns.

Modifier Meaning
* zero or more occurrences of the preceding
? zero or one occurrences of the preceding
+ one or more occurrences of the preceding
{x} x number of occurrences of the preceding
{x,y} x to y number of occurrences of the preceding

Or and grouping with parentheses

The last thing to discuss before we look at some examples is choosing between two patterns, and grouping characters. You can choose between 2 characters by placing the pipe "|" character between them in the pattern. For example: R|r would match the letter R or the letter r. For this reason, the "|" is referred to as "or" in patterns.

If you've been paying attention, you're probably thinking that choosing between R and r could also be done with the pattern [Rr], and you would be exactly correct. Where the "or" comes in useful is when we choose between entire patterns instead of between individual characters. For example, I may want to choose between "Anthony" and "Tony". To do this, the parentheses are used to group characters, just like they are in math. So to choose between "Anthony" and "Tony" the pattern would be: (Anthony) | (Tony).

Examples

Pattern Matches
[Rr]ed Red or red
[a-z] Matches 1 occurence of each of the listed letters. [Rr]ed matches Red or red

String Length

There are occasions when you want to restrict the number of characters of the data. For example, you might want to limit movie titles to 30 characters, or you may want to make sure that state abbreviations are exactly 2 characters in length. You could write a pattern to do this, but there's a simpler way. The length, minLength and maxLength restrictions do exactly what they say.

Say for example that you have a carModel element, and you want the data to be at least 2 characters but no longer than 20 characters. This is done with the following set of rules:

  <xsd:element name="carModel">
    <xsd:simpleType>
      <xsd:restriction base="xsd:token">
        <xsd:minLength value="2" />
        <xsd:maxLength value="20" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

The length restriction forces the data to be exactly a certain number of characters. For example, the following example forces the data for the modelID to be exactly 9 characters:

  <xsd:element name="modelID">
    <xsd:simpleType>
      <xsd:restriction base="xsd:token">
        <xsd:length value="9" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Restricting numbers

There are several different ways to restrict numeric data including setting the minimum and maximum values, or by specifying the number of digits in the number.

The first method we'll look at is specifying the minimum and/or maximum values. The minimum value can be set using either the minInclusive element or the minExclusive element. The minInclusive element includes values that are greater than or equal to the supplied values, while minExclusive only includes values that are greater than the supplied value. The maxInclusive element and the maxExclusive element work the same way, but they obviously set the maximum instead of the minimum. As an example, the following rules restrict the values for the weight element to those between 6 and 18, but also including 6 and 18.

  <xsd:element name="weight">
    <xsd:simpleType>
      <xsd:restriction base="xsd:decimal">
        <xsd:minInclusive value="6" />
        <xsd:maxInclusive value="18" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

The following example uses minExclusive and maxExclusive, so the valid values are between 6 and 18, but 6 and 18 are not valid.

  <xsd:element name="weight">
    <xsd:simpleType>
      <xsd:restriction base="xsd:decimal">
        <xsd:minExclusive value="6" />
        <xsd:maxExclusive value="18" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Another method for restricting values is to specify how many digits the value can contain. The totalDigits element specifies the maximum number of digits, which includes both before and after the decimal, but doesn't count the decimal itself. For example, 1234.5678 has a digit count of 8. The fractionDigits element is used to limit the maximum number of digits after the decimal.

The totalDigits element and the fractionDigits element can both be set, as shown in the following example. The entire number is limited to 8 digits, and there can only be 2 after the decimal.

  <xsd:element name="weight">
    <xsd:simpleType>
      <xsd:restriction base="xsd:decimal">
        <xsd:totalDigits value="8" />
        <xsd:fractionDigits value="2" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Note - For some reason XMLPad never likes your data if you use totalDigits and/or fractionDigits.

It's a little hard to use totalDigits to define the range of of valid values. But it can be used in combination with minInclusive and maxInclusive

The last, and my least favorite, way to restrict numeric values, is to use the different base data types. If you look at the table you'll see that some of the types have a built in limit. I'd rather just use minInclusive and maxInclusive because I can't remember what the various data type limits are.

decimal no limit
float 32-bit number. 224e-149 to e104
double 64-bit number. 253e-1075 to e970
long 9223372036854775807 to -9223372036854775808
int 2147483647 to -2147483648
short 32767 to -32768
byte 127 to -128
unsignedLong 0 to 18446744073709551615
unsignedInt 0 to 4294967295
unsignedShort 0 to 65535
unsignedByte 0 to 255
unsignedShort 0 to 65535

In practice, the base data types are used quite a bit to optimize data storage, not to restrict the range of values. That isn't to say they won't restrict the values, they will; but that's not why they're used. In huge database tables like those used by Amazon.com or eBay, where they may hold millions of records, it's crucial to optimize the type of data stored in each field, and better to just use a byte of storage if that's all that needed. So, if a data field is supposed to hold a byte of data, use the byte base data type. If you use a data type that requires more storage, this will be required in every one of the millions of records and significantly increase the storage requirements for the database.


Restricting Dates and Times

Restricting dates and times is done with the same minInclusive, minExclusive, maxInclusive and maxExclusive that we used to restrict numbers. For example, the following would be used to restrict dates to those between January 1, 2005 and December 31, 2010:

  <xsd:element name="permitDate">
    <xsd:simpleType>
      <xsd:restriction base="xsd:date">
        <xs:minInclusive value="2005-01-01" />
        <xs:maxInclusive value="2010-12-31" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>