## XML DTD and Schemas Type system to enforce data constraints. ### Document Type Definitions (DTDs) * A way to specify the structure of XML documents. * DTD adds syntactical requirements in addition to the well-formed requirement. * DTDs help in - Eliminating errors when creating or editing XML documents. - Clarifying the intended semantics. - Simplifying the processing of XML documents. * DTDs - Use “regular expression” like syntax to specify a grammar for the XML document. - Have limitations such as weak data types, inability to specify complex constraints, no support for schema evolution, etc. #### Example: An Address Book drawing **Specifying the Structure** Regular expression syntax (inspired from UNIX regular expressions) | expression | denotes | |------------------|--------------------------------------| | ``name`` | a name element | | ``greet?`` | an optional (0 or 1) greet elements | | ``name, greet?`` | a name followed by an optional greet | | ``addr*`` | 0 or more address lines | | ``tel | fax`` | a tel or a fax element | | ``(tel | fax)*`` | 0 or more repeats of tel or fax | | ``email*`` | 0 or more email elements | So the whole structure of a person entry is specified by ``name, greet?, addr*, (tel | fax)*, email*`` * Each element type of the XML document is described by an expression * the leaf level element types are described by the data type (#PCDATA) - parsed character data * Each attribute of an element type is also described in the DTD by enumerating some of its properties (OPTIONAL, etc.) **Element Type Definition** For each element type ``E``, a declaration of the form: ``` ``` where the ``content-model`` is an expression: ``` content-model ::= EMPTY | ANY | #PCDATA | P1, P2 | P1 | P2 | P1? | P1+ | P1* | (P) ``` | expression | denotes | |-------------|---------| | ``P1 , P2`` | concatenation | | ``P1 | P2`` | disjunction | | ``P?`` | optional | | ``P+`` | one or more occurrences | | ``P*`` | the Kleene closure | | ``(P)`` | grouping | The definition of an element consists of exactly one of the following: * \#PCDATA * A regular expression (as defined earlier) * EMPTY: element has no content * ANY: content can be any mixture of PCDATA and elements defined in the DTD **Mixed content** is described by a repeatable OR group ``` (#PCDATA | element-name | …)* ``` Inside the group, no regular expressions – just element names; i.e. ``#PCDATA`` must be first followed by 0 or more element names, separated by ``|``; The group can be repeated 0 or more times **Address Book Document with an Internal DTD** ``` ]> Jeff Cohen Dr. Cohen jc@penny.com ``` **Some Difficult Structures** Each employee element should contain name, age and ssn elements in some order ``` ``` Too many permutations! **Attribute Specification in DTDs** ``` ``` * The dimension attribute is required * The accuracy attribute is optional * CDATA is the "type" of the attribute – character data The format of an Attribute Definition ``` ``` The default value is given inside quotes Attribute types: * CDATA * ID, IDREF, IDREFS ID, IDREF, IDREFS are used for references Attribute Default * \#REQUIRED: the attribute must be explicitly provided * \#IMPLIED: attribute is optional, no default provided * "value": if not explicitly provided, this value inserted by default * \#FIXED "value": as above, but only this value is allowed **Recursive DTDs** ``` -- father ]> ``` Problem with this DTD: Parser does not see the recursive structure and looks for “person” sub-element indefinitely! ``` -- father ... ]> ``` The problem with this DTD is if only one “person” sub-element is present, we would not know if that person is the father or the mother. Using ID and IDREF Attributes ``` ]> ``` **IDs and IDREFs** * ID attribute: unique within the entire document. - An element can have at most one ID attribute. - No default (fixed default) value is allowed. * \#required: a value must be provided * \#implied: a value is optional * IDREF attribute: its value must be some other element’s ID value in the document. * IDREFS attribute: its value is a set, each element of the set is the ID value of some other element in the document. ``` ``` Some Conforming Data ``` Lisa Simpson Bart Simpson Marge Simpson Homer Simpson ``` **Limitations of ID References** * The attributes mother and father are references to IDs of other elements. * However, those are not necessarily person elements! * The mother attribute is not necessarily a reference to a female person. **An Alternative Specification** ``` ]> ``` Empty sub-elements instead of attributes The Revised Data ``` Marge Simpson Homer Simpson Bart Simpson Lisa Simpson ``` **Consistency of ID and IDREF Attribute Values** * If an attribute is declared as ID - The associated value must be distinct, i.e., different elements (in the given document) must have different values for the ID attribute. - Even if the two elements have different element names * If an attribute is declared as IDREF - The associated value must exist as the value of some ID attribute (no dangling “pointers”) * Similarly for all the values of an IDREFS attribute * ID, IDREF and IDREFS attributes are not typed **Adding a DTD to the Document** A DTD can be * _internal_: The DTD is part of the document file * _external_: The DTD and the document are on separate files * An external DTD may reside - In the local file system (where the document is) - In a remote file system **Connecting a Document with its DTD** An internal DTD ``` … ]> ... ``` A DTD from the local file system: ``` ``` A DTD from a remote file system: ``` ``` #### Well-Formed XML Documents An XML document (with or without a DTD) is **well-formed** if * Tags are syntactically correct * Every tag has an end tag * Tags are properly nested * There is a root tag * A start tag does not have two occurrences of the same attribute #### Valid Documents A well-formed XML document is **valid** if it conforms to its DTD, that is, * The document conforms to the regular-expression grammar * The attributes types are correct, and * The constraints on references are satisfied