## XML Basics ### I. Introduction * XML: A W3C standard to complement HTML * Two facets of XML: document-centric and data-centric * Motivation - HTML describes presentation - XML describes content * User defined tags to markup “content” * Text based format * Ideal as “Data Interchange” format. * Ideal for “distributed” applications (client-server) * All major database products have been retrofitted with facilities to store and construct XML documents. * XML is closely related to object-oriented and so-called semi-structured data. ### II. Semistructured Data An HTML document (student list) to be displayed on the Web ```

John Doe

Id: s111111111

Address:

Number: 123
Street: Main

``` To make the previous student list suitable for machine consumption on the Web, it should have the following characteristics: * Be **object-like** * Be **schemaless** (not guaranteed to conform exactly to any schema, but different objects have some commonality among themselves * Be **self-describing** (some schema-like information, like attribute names, is part of data itself) Data with these characteristics are referred to as **semistructured**. Set of label-value pairs. ``` { name: "Alan", tel: 2157786, email: "a@abc.com" } ``` Graph Model: Nodes represent objects connected by labeled edges to values drawing

The values themselves may be structures. ``` { name: {first: "Alan", last: "Black" }, tel: 2157786, email: "a@abc.com" } ``` drawing

Duplicate labels allowed ``` { name: "Alan", tel: 2157786, tel: 2498762 } ``` The syntax is easily generalized to describe sets of objects ``` { person: { name: "Alan", tel: 2157786, email: "a@abc.com" }, person: { name: "Sara", tel: 2136877, email: "sara@abc.com" }, person: { name: "Fred", tel: 7786312, email: "fred@abc.com" } } ``` All objects within a set need not have the same structure ``` { person:{name: "Alan",tel: 2157786,email: "a@abc.com" }, person:{name: {first: “Sara”,last: “Black”},email: “s@abc.com”}, person:{name: “Fred”, tel: 7786312, height: 168} } ``` Relational Data is easily represented ``` { r1: { row: {a: a1, b: b1, c: c1}, row: {a: a2, b: b2, c: c2} }, r2: { row: {c: c2, d: d2}, row: {c: c3, d: d3}, row: {c: c4, d: d4} } } ``` Object-oriented data is also naturally represented (each node has a unique object id, either explicitly mentioned or system generated) ``` { person: &o1{ name: "Mary", age: 45, child: &o2, child: &o3 }, person: &o2{ name: "John", age: 17, relatives: { mother: &o1, sister: &o3 } }, person: &o3{ name: "Jane", country: "Canada", mother: &o1 } } ``` #### Semistructured Data Model Formal syntax for semi-structured data model ``` ::== | oid | oid ::== atomicvalue | ::== { label:, ..., label: } ``` * An oid value is said to be DEFINED if it appears before a value; otherwise it is said to be USED * An ssd-expression is CONSISTENT if - an oid is defined at most once, and - If an oid is used, it must also be defined. A flexible and powerful data model that is capable of representing data that does not have to follow the strict rules of databases. #### What is Self-describing Data? **Non-self-describing** (relational, object-oriented): ``` Data part: (#12345, ["Students", {["John Doe", s111111111, [123, "Main St"]], ["Joe Public", s222222222, [321, "Pine St"]] } ] ) Schema part: PersonList[ ListName: String, Contents: [ Name: String, Id: String, Address: [Number: Integer, Street: String] ] ] ``` **Self-describing:** Attribute names embedded in the data itself, but are distinguished from values. Doesn’t need schema to figure out what is what (but schema might be useful nonetheless) ``` (#12345, [ ListName: "Students", Contents: { [ Name: "John Doe", Id: "s111111111", Address: [ Number: 123, Street: "Main St" ] ] , [ Name: "Joe Public", Id: "s222222222", Address: [ Number: 321, Street: "Pine St" ] ] } ] ) ``` ### III. XML: eXtensible Markup Language * Suitable for semi-structured data and has become a standard * Used to describe content rather than presentation * Differs from HTML in following ways: - New tags may be defined at will by the author of the document (extensible) - No semantics behind tags. For instance, HTML’s ``…

`` means: render contents as a table; in XML: doesn’t mean anything special. - Structures may be **nested arbitrarily** - XML document may contain an optional schema that describes its structure - Intolerant to bugs; Browsers will render buggy HTML pages but XML processors will reject ill-formed XML documents. **XML Elements** element: piece of text bounded by user-defined matching tags: ``` Alan 42 agb@abc.com ``` Note: * Element includes the start and end tag * No quotation marks around strings; XML treats all data as text. This is referred to as PCDATA (Parsed Character Data). * Empty elements: ```` can be abbreviated to ```` Collections are expressed using repeated structures. Ex. The collection of all persons on the 4th floor: ``` People on the 4th floor Alan42agb@abc.com Patsy36ptn@abc.com Ryan58rgz@abc.com

``` **XML Attributes** Attributes define some properties of elements Expressed as a name-value pairs ``` trompette six trous 420.12

31 rue Croix-Bosset 92310 Sevres France

``` As with tags, user may define any number of attributes Attribute values must be enclosed within quotation marks. **Attributes vs Elements** * A given attribute can occur only once within a tag; Its value is always a string * On the other hand, tags defining elements/sub-elements can repeat any number of times and their values may be string data or sub-elements * Same data may be encoded using attributes or elements or a combination of the two ``` agb@abc.com ``` or ``` 42 agb@abc.com ``` **XML References** Use ``id`` attribute to define a reference (similar to oids) Use ``idref`` attribute (possibly within an empty element) to refer to a previously defined reference. Use ``idrefs`` attribute to refer to a set of references ``` -- defines an id or a reference NE Nevada CCN Carson City

-- refers to object called s2;

``` **Mixing Elements and Text** XML allows us to mix PCDATA and sub-elements within an element. ``` This is my best friend Alan 42 I am not sure of the following email address agb@abc.com ``` This seems un-natural from a database perspective, but from a document perspective, this is quite natural! **Order** The semi-structured data model is based on unordered collections, whereas XML is ordered. The following two pieces of semi-structured data are equivalent: ``` person: {fname: "John", lname: "Smith:} person: {lname: "Smith", fname: "John"} ``` but the following two XML data are not: ``` JohnSmith Smith>John ``` To make matters worse (-:, attributes are NOT ordered in XML; Following two are equivalent: ``` ``` **Other XML Constructs** _Comments:_ ``` ``` _Processing Instruction (PI):_ ``` ``` Such instructions are passed on to applications that process XML files. _CDATA (Character Data):_ used to write escape blocks containing text that otherwise would be considered markup: this is not an element]]> Entities: ``<`` stands for ``<`` **Well-Formed XML Documents** An XML document is **well-formed** if * Tags are syntactically correct * Every tag has an end tag * Tags are properly nested * There is a root tag * A start tag does not have two occurrences of the same attribute An XML document must be well-formed before it can be processed. A well-formed XML document will parse into a node-labeled tree #### Terminology drawing

* Elements are nested * Root element contains all others drawing

**XML Data Model (DOM Tree)** drawing

* Document Object Model (DOM) – DOM Tree * Leaves are either empty or contain PCDATA * Unlike ssd tree model, nodes are labeled with tags.