XML DTD and Schemas

Type system to enforce data constraints.

Document Type Definitions (DTDs)

Example: An Address Book

drawing

Specifying the Structure

Regular expression syntax (inspired from UNIX regular expressions)

expression denotes
name a name element
greet? an optional (0 or 1) greet elements
name, greet? a name followed by an optional greet
addr* 0 or more address lines
tel | fax a tel or a fax element
(tel | fax)* 0 or more repeats of tel or fax
email* 0 or more email elements

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

Element Type Definition

For each element type E, a declaration of the form:

<!ELEMENT   E   content-model>

where the content-model is an expression:

content-model ::= 
  EMPTY  | ANY | #PCDATA |  P1, P2 | P1 | P2 |  P1?  | P1+  | P1* | (P)
expression denotes
P1 , P2 concatenation
P1 | P2 disjunction
P? optional
P+ one or more occurrences
P* the Kleene closure
(P) grouping

The definition of an element consists of exactly one of the following:

Mixed content is described by a repeatable OR group

 (#PCDATA | element-name | …)*

Inside the group, no regular expressions – just element names; i.e. #PCDATA must be first followed by 0 or more element names, separated by |; The group can be repeated 0 or more times

Address Book Document with an Internal DTD

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE addressbook [
   <!ELEMENT addressbook (person*)>
   <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)>
   <!ELEMENT name    (#PCDATA)>
   <!ELEMENT greet   (#PCDATA)>
   <!ELEMENT address (#PCDATA)>
   <!ELEMENT tel     (#PCDATA)>
   <!ELEMENT fax     (#PCDATA)>
   <!ELEMENT email   (#PCDATA)>
]>
<addressbook>
  <person>
    <name>Jeff Cohen</name>
    <greet>Dr. Cohen</greet>
    <email>jc@penny.com</email>
  </person>
</addressbook>

Some Difficult Structures

Each employee element should contain name, age and ssn elements in some order

<!ELEMENT employee
    ((name, age, ssn) | 
     (age, ssn, name) |
     (ssn, name, age) | 
      ...
      ...
    )>

Too many permutations!

Attribute Specification in DTDs

<!ELEMENT height (#PCDATA)>
<!ATTLIST height 
      dimension CDATA #REQUIRED
      accuracy  CDATA #IMPLIED >

The format of an Attribute Definition

<!ATTLIST element-name attr-name attr-type attr-default>

The default value is given inside quotes

Attribute types:

ID, IDREF, IDREFS are used for references

Attribute Default

Recursive DTDs

<DOCTYPE genealogy [
    <!ELEMENT   genealogy (person*)>
    <!ELEMENT   person (
        name,
        dateOfBirth,
        person,         -- mother
        person   ) >    -- father        
]>

Problem with this DTD: Parser does not see the recursive structure and looks for “person” sub-element indefinitely!

<DOCTYPE genealogy [
    <!ELEMENT   genealogy (person*)>
    <!ELEMENT   person (
        name,
        dateOfBirth,
        person?,        -- mother
        person?  ) >    -- father  
    ...       
]>

The problem with this DTD is if only one “person” sub-element is present, we would not know if that person is the father or the mother.

Using ID and IDREF Attributes

  <!DOCTYPE family [
   <!ELEMENT family   (person)* >
   <!ELEMENT person  (name) >
   <!ELEMENT name    (#PCDATA) >
   <!ATTLIST  person 
        id ID #REQUIRED
        mother IDREF #IMPLIED
        father IDREF #IMPLIED
        children IDREFS #IMPLIED >
 ]>

IDs and IDREFs

<person id="898" father="332" mother="336" children="982 984 986">

Some Conforming Data


<family>
    <person  id="lisa"  mother="marge" father="homer"> 
      <name> Lisa Simpson </name> 
    </person>
    <person  id="bart"  mother="marge" father="homer"> 
      <name> Bart Simpson </name> 
    </person>
    <person id="marge" children="bart lisa"> 
        <name> Marge Simpson </name>
    </person> 
    <person id="homer" children="bart lisa">
        <name> Homer Simpson </name>
    </person>
</family>

Limitations of ID References

An Alternative Specification

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE family [
    <!ELEMENT family (person)* >
    <!ELEMENT person (name, mother?, father?, children?) >
       <!ATTLIST person id ID #REQUIRED >
    <!ELEMENT name (#PCDATA) >
    <!ELEMENT mother EMPTY >
       <!ATTLIST mother idref IDREF #REQUIRED >
    <!ELEMENT father EMPTY >
       <!ATTLIST father idref IDREF #REQUIRED >
    <!ELEMENT children EMPTY >
       <!ATTLIST children idrefs IDREFS #REQUIRED >
]>

Empty sub-elements instead of attributes

The Revised Data

<family>
  <person id="marge">
    <name>Marge Simpson</name>
    <children idrefs="bart lisa"/>
  </person>
  <person id="homer">
    <name>Homer Simpson</name>
    <children idrefs="bart lisa" />
  </person>
 <person id="bart">
   <name>Bart Simpson</name>
   <mother idref="marge"/>
   <father idref="homer"/>
 </person>
 <person id="lisa">
   <name>Lisa Simpson</name>
   <mother idref="marge"/>
   <father idref="homer"/>
 </person>
</family>

Consistency of ID and IDREF Attribute Values

Adding a DTD to the Document

A DTD can be

Connecting a Document with its DTD

An internal DTD

<?xml version="1.0"?>
<!DOCTYPE db [<!ELEMENT ...> … ]>
<db> ... </db>

A DTD from the local file system:

    <!DOCTYPE db SYSTEM "schema.dtd">

A DTD from a remote file system:

<!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">

Well-Formed XML Documents

An XML document (with or without a DTD) is well-formed if

Valid Documents

A well-formed XML document is valid if it conforms to its DTD, that is,