Intro to XML and JSON #3: XML Items & Keys
03 Apr 2019
We’ve seen XML from a “30,000-foot view.”
We understand what kind of data XML can help us with.
Let’s learn how to write XML!
This gets a little detailed, so I’m breaking it into 3 posts published on the same day to give you a coffee break.
In this first of three posts on XML, we’ll cover three topics:
- The punctuation you use to indicate an “item”
- The punctuation you use to “nest” items inside each other
- The 2 punctuation options XML gives you to add “key-value pairs” to an item
(Click here if you need a refresher on what I mean by items, keys, and values.)
Pssst – too much screen time? Print Intro to XML, JSON, & YAML – the book
A new section on YAML is exclusive to the book edition.
Posts In This Series
- Part 1 - Intro to XML and JSON
- Part 2 - Intro to XML and JSON #2: Data's Shape
- Part 3 - This Article
- Part 4 - Intro to XML and JSON #4: XML Values
- Part 5 - Intro to XML and JSON #5: XML/CSV Conversions
- Part 6 - Intro to XML and JSON #6: JSON
- Part 7 - Intro to XML and JSON #7: Recap & Real World Use
- Part 8 - Intro to XML, JSON, & YAML: the book
Viewing “Pretty” XML & JSON
We’ll have a lot of examples in this series. I recommend that you edit them and play with seeing them in a “pretty” format! XML, JSON – paste & click “Tree View”.
Warning: only put sample data into the “beautifier” links above. Never put your company’s confidential data into a stranger’s web site.
Items in XML: the “Element” / “Tagset”
We talked about data having “items,” “keys,” & “values” in the last post.
In XML, the “item” is called an “element.”
The punctuation that XML uses to define the beginning and end of an “element” is a “tagset.”
A “tagset” looks like this:
<Person></Person>
The format is:
- A word is repeated twice.
- Each occurrence of the word is surrounded by less-than and greater-than signs, forming a “tag.”
- Hence, together, they are called a “tagset.”
- The second “tag” is distinguished from the first by having a forward-slash (“
/
“) immediately after its less-than sign. - The first “tag” is meant to indicate the beginning of the element.
- The second “tag” is meant to indicate the end of the element.
The fact that the tagset exists in your text file means that it exists as a conceptual “item” in your data.
Empty Elements
It doesn’t matter whether or not there’s anything typed between the tags (after the first greater-than, before the last less-than).
The “element” is still a conceptual item that “exists” in your data, simply because you bothered to create a “tagset” to represent it.
If it doesn’t have anything between the tags, you can think of it a little like a row full of nothing but commas in a CSV file.
It’s still there!
It’s just blank.
In fact, there’s even a fancy shortcut for typing “empty” tagsets.
This single tag is equivalent to a tagset with nothing in the middle like the one above.
Note the forward-slash before its greater-than sign:
<Person/>
Nesting Elements
I compared our “blank” element (or “item“) to an empty CSV file row, but it has a pretty big difference from an empty CSV file row: it has a name!
Elements having names is a key aspect of XML.
The fact that elements each have a name also enables XML to have two approaches to indicating an element’s “keys+values,” whereas JSON only has one approach.
That can give XML flexibility and also make it confusing.
Worse yet is that people don’t always choose between the two approaches for consistent reasons.
- Sometimes they choose in ways that help them accurately represent their data. We’ll talk about that a lot in the text that follows.
- Other times, it seems like they simply choose “approach #1” because it takes up vertical space and avoid “approach #2” because it takes up horizontal space when you include line breaks and whitespace in XML for human readability.
- Humans don’t typically like to scroll in both directions when reading large data files. 🤷🏿♂️
Here’s one W3Schools teacher’s opinion about best practice.
Knowing that everything we’re about to learn might be completely ignored by someone else, let’s dive in anyway and take a detailed look at both of XML’s approaches to defining elements’ “keys” and “values.”
“Keys” & their “Values” - Approach 1 (“more tagsets”)
The first approach is to nest an element inside an element.
Consequently, the inner element’s name isn’t just functioning as its name.
By virtue of being nested, the inner element’s name is also functioning as a “key” indicating some aspect of the outer element.
Here’s an example of “nested” elements in XML:
<Shirt>
<Color></Color>
<Fabric></Fabric>
</Shirt>
There are 3 conceptual “items,” or “elements,” in this data, each of which has a name.
Even though two elements are nested inside the third, all 3 can stand alone as “elements” in the grammar of XML.
- Q: Don’t believe me?
- A: You can do the same thing with English grammar! Look:
You can write an English sentence that has multiple complete sentences inside of it; to write a sentence with multiple complete sentences inside, simply separate the two with semicolons.
The fact that the elements named “color” and “fabric” are nested between the tags of the element named “shirt” means that they are also indicating that this particular shirt has keys named “color” and “fabric” (the values to both of which are currently blank).
“Keys” & their “Values” - Approach 2 (“attributes”)
The second approach is to define a key as an “attribute” of an “element” (“item”).
- An “attribute” is typed inside the opening tag of a tagset, after the name of the element, separated from it or from other attributes by whitespace.
- An “attribute” consists of a key and a value, separated by an equals sign (
=
).- The key goes to the left of the
=
and does not need any quotes around it. - The value goes to the right of the
=
in single or double quotes.
- The key goes to the left of the
Here’s some advice about writing XML element attributes from Microsoft.
This time using approach #2, let’s indicate again that we have a shirt with a “color” and a “fabric,” but that the values for “color” & “fabric” are blank.
<Shirt Color="" Fabric=""></Shirt>
Or, in shortcut notation, since there’s now nothing inside the “Shirt” tagset:
<Shirt Color="" Fabric=""/>
The main thing to remember about the second approach is that “color” and “fabric” are not standalone elements.
They are “attributes” of the element named “Shirt”.
“Attributes”: no further nesting allowed
You can’t put more standalone “elements” between the quotes after the “=” of an “attribute.” You’re done. Only a plain-text value can go there.
You can’t do this:
<Shirt Color="" Fabric="<Washable></Washable>"></Shirt>
Whereas if you were using approach #1 to nesting, you could do this:
<Shirt>
<Color></Color>
<Fabric>
<Washable></Washable>
</Fabric>
</Shirt>
“Attributes”: no duplicate keys allowed
You also can’t give any element more than one “attribute” of the same name.
You can’t do this:
<Shirt Color="" Color="" Fabric=""></Shirt>
Whereas if you were using approach #1 to nesting, you could do this:
<Shirt>
<Color></Color>
<Color></Color>
<Fabric></Fabric>
</Shirt>
Beware “true emptiness” versus “text with 0 characters”
Note that “color” and “fabric” are truly empty in approach #1, but “blank text” (""
) in approach #2.
If you’re using a computer to process an existing XML file, these concepts might be interpreted differently.
One might argue that:
- there’s more of a “nothingness” in the nested-tags example (it truly doesn’t have a color), whereas:
- there’s more of a “value without any letters in it”-ness in the attributes example.
If you’re processing XML-formatted data with a tool that makes such a distinction, you may have to explicitly instruct your tool to look for either “truly blank” or “empty text” whenever you, as a business rule, believe they should be treated as equivalent to each other.
Don’t mix styles confusingly
A word of warning about mixing the two approaches – this is valid XML, but hard to make sense of.
<Shirt Color="" Fabric="">
<Color></Color>
<Color></Color>
<Fabric>
<Washable></Washable>
</Fabric>
</Shirt>
A human might look at the shirt above and think it has:
- 3 colors
- 2 fabrics
However, a computer trying to read the XML would disagree with the human.
So please don’t write XML like the example above, because it’s confusing.
If you have the misfortune of needing to read XML written in this style by someone else, it’s a good idea to get in the habit of reading it the way the computer thinks of it.
In other words, think of the shirt as having:
- 1 color attribute
- 1 fabric attribute
- 2 full-on elements nested within it, each named “color”
- 1 full-on element nested within it named “fabric”
That way, if you ever need to write or understand code that helps read the XML file, you’ll be able to design or follow the code’s logic more effectively.
Takeaways
Items are defined by giving them names and putting them inside punctuation called “tagsets.”
Once you’ve defined an item, it’s known as an “element.”
To define key-value pairs for a given “element,” XML lets you choose one of two approaches:
- Nest an element inside of an element. The inner element’s name serves as a “key” for its parent element, and the contents of the inner element serve as the “value” for that key.
- The more “Russian dolls“-like or “siblings”-like your data looks, the more likely you’ll need this approach.
- Give the opening tag of an element “attributes.”
- Concise when your data lets you use it, but no “nested nesting” or “duplicate keys” allowed with this approach.
For any given type of key (such as “color” of a shirt), it’s best to pick one of the two approaches and stick to it throughout the entire piece of XML you’re writing. That will help avoid confusing people.