Learn Java Programming
10. Java and XML
The following are the objectives of this module:
- Why do we need XML
- What is XML DTD – and XML Schema
- What are elements and attributes
- Difference between SAX and DOM
- Understanding the JDOM library
What is XML
There are many definitions of XML – here are some of the popular few:
- XML is a way of transmitting data in platform independent manner (hardware [storage] and software [representation] independent)
- XML describes data and focuses on what the data is.
- A set of recommendations published by the World Wide Web Consortium ( W3C)
- XML bridges the gap between application data-types and storage mechanisms across hardware platforms and programming languages.
Whatever definition appeals to you, XML is really a data transfer and encoding mechanism, so that information can actually be passed between highly disparate technologies on completely different operating environments.
Why is XML so popular
With the advent of the internet, more and more computers are getting connected to each other. As more and more systems started interconnecting the biggest problem that people face is that of data exchange.
Different platforms store data in different formats – also each programming languages has its own representation of the basic data-types
For example, an int value in one hardware platform (for example Windows) is different on a different platform (say Linux) – Software applications could not reliably read and store data from different diverse platform and language representations.
To solve this problem, we had to zero in on one common factor across the diverse hardware and software platforms. On common format supported across all hardware platforms and programming languages was the Text representation.
Since all the operating systems and platforms supported text format, it was natural to use a text based medium for exchanging messages and data.
XML is a mechanism for describing the data. Using XML we define the data and also describe the data. This data can then be mapped to any particular implementation by the receiving system.
This data exchange has been facilitated by following a common and agreed format as defined and laid down by an international body like the W3C. Almost all programming languages have agreed to the format defined by W3C.
This explains why XML is the most popular format for data exchange.
DTD – Document Type Definition
DTD really stands for the acronym – Document Type Definition. A well-formed XML adheres to the given DTD. A DTD defines the legal elements of an XML document.
Remember that the purpose of XML is to facilitate data exchange. Let say two systems agree to exchange data and use XML as the means to do it. To meaningfully interpret the data, the systems that are exchanging data need to define a proper structure for the XML.
The purpose of a DTD is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements. As an example, consider the following DTD:
<?xml version="1.0" encoding="UTF-8" ?>
<!ELEMENT students (student)+ >
<!ELEMENT student (name, dob, course) >
<!ELEMENT name (firstname, lastname) >
<!ELEMENT firstname (#PCDATA) >
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT doj (#PCDATA)>
<!ELEMENT course (#PCDATA)>
<!ATTLIST student
yearofjoining CDATA #IMPLIED>
<!ATTLIST doj
day CDATA #REQUIRED
month CDATA #REQUIRED
year CDATA #REQUIRED>
<!ATTLIST course
semester CDATA #REQUIRED>
This DTD is used to define the root element as “students”. Now “students” is composed of one ore more “student” elements. The + value denotes one or more student
<students>
<student>
. . .
</student>
</students>
<students>
<student>
. . .
</student>
<student>
. . .
</student>
</students>
The “student” element is composed of the sub-elements “name”, “dob” and “course”, in the given order. The “name” element is composed of firstname and lastname
<!ELEMENT student (name, dob, course) >
<!ELEMENT name (firstname, lastname) >
The “firstname” comprises of text data <!ELEMENT firstname (#PCDATA) >
PCDATA denotes Parsed Character Data – this data can contain character data and markup like ‘<‘ or ‘>’. PCDATA can contain other elements and text data.
CDATA denotes text that is passed through as is, and is not parsed. It would serve to note that PCDATA would be parsed by the XML engine and not CDATA.
Continuing with our example, the “lastname” is also Parsed Character Data.
The “student” element has an attribute called “yearofjoining” The value of #IMPLIED meaning that this attribute is an optional attribute. A value of #REQUIRED would denote that this attribute is mandatory.
A valid well-formed XML adhering to the above DTD is as given below:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE students SYSTEM "student.dtd">
<students>
<student yearofjoining="2010">
<name>
<firstName>Humpty</firstName>
<lastName>Dumpty</lastName>
</name>
<doj day="31" month="12" year="2010" />
<course semester="2">Java Programming</course>
</student>
</students>
Please note the DOCTYPE declaration in the xml
<!DOCTYPE students SYSTEM "student.dtd">
This specifes that the xml in the document adheres to the DTD definition given by student.dtd and describes the students element. The student.dtd is found in the local system as denoted by SYSTEM, and generally in the current folder.
Alternatively, if the DTD is hosted on a public URL, then the DOCTYPE declaration is as given below:
<!DOCTYPE web-app PUBLIC '-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN'
'http://java.sun.com/dtd/web-app_2_3.dtd'>
The above DOCTYPE declaration specifies that this xml document describes the web-app data as defined in the PUBLIC url given by http://java.sun.com/dtd/web-app_2_3.dtd
XML Schemas
XML Schemas are an alternative and now preferred way to describe the XML data. XML Schemas define a vocabulary that can be used to describe XML content. An XML Schema helps to define the relationship between the data.
XML Schema documents are basically XML documents
<xsd:schema
xmlns:xsd="http://www/w3.org/2000/10/XMLSchema"
targetNamespace="https://itseasytolearn.com/javaprogramming/schema1"
version="1.0">
</xsd:schema>
To define an element in XMLSchemas, we use the element declaration. Elements are declared in an XML Schema by declaring them within element declaration:
<element name="Customer" type="string" />
<attribute name="admissionNumber" type="string" minOccurs="1">
. . .
</element>
Simple and Complex Type Definitions:
The XML Schema allows to define both simple and complex type definitions.
Simple type definitions are based on primitive data types, and they apply to attribute values and element – which has no child elements.
<simpleType name="partNumber">
<restriction base="string">
<length value="12" fixed="true"
</restriction>
</simpleType>
Here partNumber is has the data-type which is a “string” , and this string has only 12 characters.
Complex type definitions are composed of one or more element definitions.
<complexType name="elephant" content="elementOnly">
<element name="head" type="String" minOccurs="1" />
<element name="tail" type=" String " minOccurs="0" />
<element name="leg" type=" String " minOccurs="4" />
</complexType>
The above example denotes the data for an element object. The elephant data contains a head an there is only one head. The tail element can or may not be present – and an elephant has four minimum legs.
Working With XML
When working with XML, in our programs we generally have to produce XML or parse XML data.
Since the act of producing XML or parsing XML is generally boiler-plate code, we shall make use open-source libraries for producing and parsing XML. It is not worthwhile to write our own parser. We will not “re-invent the wheel” here, but make user of some great open source libraries that have have been developed by the Java community.
Basically there are two different styles of XML procesing:
- SAX
- DOM
SAX – Simple API for XML
SAX based XML parsers are an event based parsers – They parse the XML data and look for patterns. For example “<” denotes the beginning of a tag and “>” denotes the end of a tag.
SAX based parsers scan the input stream for XML tags and construct the XML element data to the object format. SAX based parsers are the right choice to use in a limited memory environment, like programming on mobile devices.
DOM – Document Object Model
The other popular format for XML processing is the Document Object Model. In DOM, the XML document is organized as a Tree structure in memory. Each element denotes a Node and nodes are interconnected and end in a leaf
Most modern day XML parsers today are a combination of both SAX and DOM. They parse the data using SAX and build a DOM representation.
Let us look at one popular open-source XML parser called JDOM
To use JDOM we need download the JAR file from http://jdom.org/
Once you have downloaded the binary jar file, we need to add it to the program classpath. And we are set to use JDOM.
Constructing XML data using JDOM
JDOM provides a number of convenient classes to construct XML data. Let us say we wanted to construct the following XML snippet in our program.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE students SYSTEM "student.dtd">
Here is the JDOM code that will achieve it using Element and Document objects.
We first define the root element and then create a Document using the root element. The DocType is set to describe “students” as given in the local DTD “student.dtd”
Element root = new Element("students");
Document doc = new Document(root);
DocType docType = new DocType("students", "student.dtd");
doc.setDocType(docType);
To add content to the XML, we use the methods on the Element object, basically the setText() and the setAttribute() methods. Since the firstName and lastName elements belong to the name element
<name>
<firstName> ... </firstName>
<lastName> ... </lastName>
</name>
we use the addContent() method to compose the name element as consisting of firstName and lastName elements.
Element student = new Element("student");
student.setAttribute("yearofjoining", "2010");
Element name = new Element("name");
student.addContent(name);
Element firstName = new Element("firstName");
firstName.setText("Tom");
name.addContent(firstName);
Element lastName = new Element("lastName");
lastName.setText("Cruise");
name.addContent(lastName);
root.addContent(student);
Element doj = new Element("doj");
doj.setAttribute("day", "31");
doj.setAttribute("month", "12");
doj.setAttribute("year", "2010");
Writing the XML to a file
The JDOM library has a number of methods for formatting the XML data into a human-readable format. In the example below, we construct a Document “doc” and write it to a XML file, called “xmldata.xml”
. . .
Element root = new Element("students");
Document doc = new Document(root);
DocType docType = new DocType("students", "student.dtd");
doc.setDocType(docType);
. . .
Format format = Format.getPrettyFormat();
format.setIndent(" ");
format.setLineSeparator(System.getProperty("line.separator"));
XMLOutputter outputter = new XMLOutputter(format);
System.out.println(outputter.outputString(doc));
FileWriter fw = new FileWriter("xmldata.xml");
fw.write(outputter.outputString(doc));
Parsing XML
JDOM provides a lot of convenient APIs to parse the XML data. Assuming the XML data is present in the String variable called “xmlString” here is how we parse the data:
Document document = new Document();
SAXBuilder saxBuilder = new SAXBuilder();
saxBuilder.setValidation(false);
document = saxBuilder.build(new StringReader(xmlString));
Element root = document.getRootElement();
List childrenList = root.getChildren();
Note we have disabled validation on the SaxBuilder, because we are more lenient. You could choose to be more restrictive and set the validation to true.
The SaxBuilder will parse the xmlString and construct a DOM “Document” object. From this object, we can call the getRootElement() to get the root element. From the root element, it is possible to get the child elements using the getChildren() method.
Accessing the parsed data in the XML
Once the data is parsed into the DOM representation, in order to access the individual elements and the order and the attributes and values, we need to know how the data has been structured.
This is where where the knowledge of the DTD becomes very important. With access to the DTD, the parser program can now use the JDOM library and now know what value to fetch for the elements and the attributes.
Iterator iter = childrenList.iterator();
while (iter.hasNext()) {
Element studentElement = (Element) iter.next();
String yearOfJoining =
studentElement.getAttributeValue("yearofjoining");
System.out.println(yearOfJoining);
Element name = studentElement.getChild("name");
Element firstNameElement = name.getChild("firstName");
Element lastNameElement = name.getChild("lastName");
String firstName = firstNameElement.getText();
String lastName = lastNameElement.getText();
}
XML usage in the real world
In the real-world, whenever data needs to exchanged between a server and a client, the server generally publishes a DTD or a XML Schema in a known location.
Clients who need to consume the service can send a request in the manner expected by the server, as specified by the DTD or a XML Schema document.
This removes the barrier and allows any client from any operating system or any programming language (which may be PHP, Perl, Ruby, .NET or even Java) running on any hardware (Power PC, Intel based or AMD) on any operating system (Linux or Solaris or Mac OS or Windows)
This way we allow for communication between any systems on disparate hardware or operating systems to communicate with each other.
HTTP, which is the HyperText Transfer Protocol is supported by all the platforms and programming languages. If we also use the ubiquitous internet protocol Http and transfer XML documents, then we have a true web service.
In fact, XML over Http, is a true web service.
Summary:
XML is the given standard for information interchange. The quick tour of XML helped us to understand the following:
- Why we need XML
- XML DTD and XML Schema
- Differences between SAX and DOM parsers
- Using the JDOM library to construct and parse XML data