Type system to enforce data constraints.
Specifying the Structure
Regular expression syntax (inspired from UNIX regular expressions)
expression | denotes |
---|---|
name |
a name element |
greet? |
an optional (0 or 1) greet elements |
name, greet? |
a name followed by an optional greet |
addr* |
0 or more address lines |
tel | fax |
a tel or a fax element |
(tel | fax)* |
0 or more repeats of tel or fax |
email* |
0 or more email elements |
So the whole structure of a person entry is specified by
name, greet?, addr*, (tel | fax)*, email*
Element Type Definition
For each element type E
, a declaration of the form:
<!ELEMENT E content-model>
where the content-model
is an expression:
content-model ::=
EMPTY | ANY | #PCDATA | P1, P2 | P1 | P2 | P1? | P1+ | P1* | (P)
expression | denotes |
---|---|
P1 , P2 |
concatenation |
P1 | P2 |
disjunction |
P? |
optional |
P+ |
one or more occurrences |
P* |
the Kleene closure |
(P) |
grouping |
The definition of an element consists of exactly one of the following:
Mixed content is described by a repeatable OR group
(#PCDATA | element-name | …)*
Inside the group, no regular expressions – just element names; i.e. #PCDATA
must be first followed by 0 or more element names, separated by |
; The group can be repeated 0 or more times
Address Book Document with an Internal DTD
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE addressbook [
<!ELEMENT addressbook (person*)>
<!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT greet (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT tel (#PCDATA)>
<!ELEMENT fax (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
<addressbook>
<person>
<name>Jeff Cohen</name>
<greet>Dr. Cohen</greet>
<email>jc@penny.com</email>
</person>
</addressbook>
Some Difficult Structures
Each employee element should contain name, age and ssn elements in some order
<!ELEMENT employee
((name, age, ssn) |
(age, ssn, name) |
(ssn, name, age) |
...
...
)>
Too many permutations!
Attribute Specification in DTDs
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
dimension CDATA #REQUIRED
accuracy CDATA #IMPLIED >
The format of an Attribute Definition
<!ATTLIST element-name attr-name attr-type attr-default>
The default value is given inside quotes
Attribute types:
ID, IDREF, IDREFS are used for references
Attribute Default
Recursive DTDs
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person, -- mother
person ) > -- father
]>
Problem with this DTD: Parser does not see the recursive structure and looks for “person” sub-element indefinitely!
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person?, -- mother
person? ) > -- father
...
]>
The problem with this DTD is if only one “person” sub-element is present, we would not know if that person is the father or the mother.
Using ID and IDREF Attributes
<!DOCTYPE family [
<!ELEMENT family (person)* >
<!ELEMENT person (name) >
<!ELEMENT name (#PCDATA) >
<!ATTLIST person
id ID #REQUIRED
mother IDREF #IMPLIED
father IDREF #IMPLIED
children IDREFS #IMPLIED >
]>
IDs and IDREFs
<person id="898" father="332" mother="336" children="982 984 986">
Some Conforming Data
<family>
<person id="lisa" mother="marge" father="homer">
<name> Lisa Simpson </name>
</person>
<person id="bart" mother="marge" father="homer">
<name> Bart Simpson </name>
</person>
<person id="marge" children="bart lisa">
<name> Marge Simpson </name>
</person>
<person id="homer" children="bart lisa">
<name> Homer Simpson </name>
</person>
</family>
Limitations of ID References
An Alternative Specification
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE family [
<!ELEMENT family (person)* >
<!ELEMENT person (name, mother?, father?, children?) >
<!ATTLIST person id ID #REQUIRED >
<!ELEMENT name (#PCDATA) >
<!ELEMENT mother EMPTY >
<!ATTLIST mother idref IDREF #REQUIRED >
<!ELEMENT father EMPTY >
<!ATTLIST father idref IDREF #REQUIRED >
<!ELEMENT children EMPTY >
<!ATTLIST children idrefs IDREFS #REQUIRED >
]>
Empty sub-elements instead of attributes
The Revised Data
<family>
<person id="marge">
<name>Marge Simpson</name>
<children idrefs="bart lisa"/>
</person>
<person id="homer">
<name>Homer Simpson</name>
<children idrefs="bart lisa" />
</person>
<person id="bart">
<name>Bart Simpson</name>
<mother idref="marge"/>
<father idref="homer"/>
</person>
<person id="lisa">
<name>Lisa Simpson</name>
<mother idref="marge"/>
<father idref="homer"/>
</person>
</family>
Consistency of ID and IDREF Attribute Values
Adding a DTD to the Document
A DTD can be
Connecting a Document with its DTD
An internal DTD
<?xml version="1.0"?>
<!DOCTYPE db [<!ELEMENT ...> … ]>
<db> ... </db>
A DTD from the local file system:
<!DOCTYPE db SYSTEM "schema.dtd">
A DTD from a remote file system:
<!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">
An XML document (with or without a DTD) is well-formed if
A well-formed XML document is valid if it conforms to its DTD, that is,