CSV, XML, and JSON: A Total Beginner's Primer
06 Nov 2018
CSV, XML, and JSON. All 3 are incredibly common file formats for transferring text data between computers, particularly over the internet. But what’s the difference, and when should you use each one?
(Programmers and Salesforce business analysts: you’ll want to know this!)
XML & JSON are for lists of lists. CSVs are for tables.
That’s a painful oversimplification, but I think it’s a good start if you’re completely new to the concepts.
Keep an eye out for a separate blog post teaching you the basics of reading and writing XML and JSON files (CSV is easy).
Birthday List (XML or JSON)
Let’s say you maintain a list of your best friends to facilitate buying them birthday presents (that’s nice of you).
You will want to write down a few facts about each friend that help you remember what they like.
The “facts” that influence your birthday presents for each friend will be different, because every friend’s life is different.
For example, you might write:
- Uwe
- Bday: Nov. 8
- Kids:
- Johnny (4)
- Matilda (2)
- Collection: cat memes
- Dan
- Bday: Jan. 27
- Food:
- wine
- pickles
- Ridhi
- Bday: Sep. 16
- Collection: frogs
- Hani
- Bday: Apr. 9
- Job: nurse
- Gift: help / babysitter
- Kids: Lee (8)
Uwe and Hani have kids, but Dan and Ridhi don’t. Uwe and Ridhi collect things, but the Dan and Hani don’t. Dan’s a foodie. Hani is busy with her job and aging parents and kid and doesn’t want things; she wants free time. But … if you’re busy, you might be able to get away with a coffee mug with a cute slogan about nurses on it.
Although you have some overlap, it’s very normal for everyone on your list to have different characteristics from everyone else.
Look below at how much space it would take to store your “friends list” as a table and how many cells end up blank. It just doesn’t make sense to store your friends list as a table! You would never write it this way on paper:
Name | Bday | Collection | Kid 1 | Kid 2 | Food 1 | Food 2 | Job | Gift |
---|---|---|---|---|---|---|---|---|
Uwe | Nov. 8 | Johnny (4) | Matilda (2) | |||||
Dan | Jan. 27 | wine | pickles | |||||
Ridhi | Sep. 16 | frogs | ||||||
Hani | Apr. 9 | Lee (8) | nurse | Help / Babysitter |
(FYI, the technical term for a mostly-blank table is a “sparse” table.)
When you have data shaped like this “friends list,” XML & JSON are excellent standardized formats in which you can type it.
XML and JSON are simply standards for plain-text files where careful punctuation:
- delineates one item from the next
- labels items’ characteristics with “keys”
- indicates “values” (which may be more items) for those “keys”
XML and JSON are great standards for storing data when:
- Each item represented in the data might have “characteristics” that have nothing to do with those of other items represented in the data
- Each item represented in the data might have other items nested inside of it
They’re extremely similar to each other, although a few differences are:
- XML has 2 ways of specifying “keys and their values” to describe the characteristics of an item. JSON only has 1.
- XML calls the chunk of text representing an item an “element.” JSON calls it an “object.”
- JSON is a bit more concise to type when the “values” of items’ characteristics are not nested sub-items.
For example, Dan’s food preferences might be written:- XML:
<food>wine</food><food>pickles</food>
- JSON:
"food" : ["wine","pickles"]
- XML:
I’ll dedicate a separate blog post to teaching you how to write an XML file or a JSON file.
Use Case: computer configuration files
Believe it or not, “settings for how computer programs should run” are a lot like your list of friends and their personality traits.
All of the data is sufficiently “related” and “used for the same general purpose” that it belongs in one file.
And yet, different parts of the data might be used in different ways.
Take a look at this “configuration file” for an imaginary Salesforce database (yes, actual Salesforce XML configuration files are a lot more complex):
- Authorized Users:
- Sysadmins:
- Hani
- Dan
- Normal:
- Uwe
- Sysadmins:
- Permissions:
- Can Create Tables:
- Sysadmins: Yes
- Normal: No
- Can Delete Tables:
- Sysadmins: Yes
- Normal: No
- Can Create/Edit/Delete Data In Tables:
- Sysadmins: Yes
- Normal: Yes
- Can Create Tables:
You and a computer have a lot in common, don’t you?!
Bookkeeper’s Spreadsheet (CSV)
Now let’s say you’re a bookkeeper. It’s your job to log the date, vendor, and price of every purchase your company makes.
Because every record in your data set will have the same types of characteristics (that is, the same “keys”) as every other record, it makes a lot of sense to store this data as a table:
Purchase ID | Date | Vendor | Tax Category | Price |
---|---|---|---|---|
000001 | 01/01/2010 | Apple | Office Supplies | $60,000.00 |
000002 | 01/01/2010 | Applebee’s | Meals | $50.87 |
000003 | 01/02/2010 | Apple | Repairs | $928.20 |
000004 | 01/10/2010 | Steelcase | Office Supplies | $20,289.98 |
CSV stands for “comma separated values.” It means what it sounds like.
Although Excel and other spreadsheet editors can display “CSV” files nicely on a screen, being plain text is the key that distinguishes CSV files from native Excel files.
Files in the “csv” format are plain-text files that meet the following standards:
- Line breaks delimit one item from another.
- (As in Excel spreadsheets, whether the first line should be interpreted as a header row is up to you.)
- Commas delimit one column from another.
- Quotes can be used to help clarify which commas are “separating” columns and which commas are part of the data (as in the “price” column in our example).
The CSV format is far more concise to type than XML or JSON when:
- Each item represented in the data has roughly the same “characteristics” as every other item represented in the data
- Your data has no “nesting” of items inside each other
Furthermore, the CSV format is natively understood by Microsoft Excel and other spreadsheet-editing programs. It’s a win-win:
- Business users like it because they can manipulate it with Excel.
- Computer code likes it because it’s highly structured (making it understandable to a computer).
If your data works well as a CSV file, and you don’t have a good reason to choose another format, choose CSV.
JSON as a CSV replacement: a special case
You might have noticed: there’s trouble in CSV paradise.
What if we have data that meets these criteria, but some of the values themselves include line breaks?
- What if we also had a
'Notes'
column on our bookkeeping log? - What if someone had a legitimate reason to type a note that included a line break?
There are special codes that we can type that mean:
“computer – when you display the file on the screen, replace this code with a line break”
One common one is \r\n
.
In practice, though, it’s hard to force humans to remember to type those kinds of codes.
CSV files typically get hair-pulling messy as soon as your data includes line breaks.
Although there’s more overhead to typing “plain old table”-style data as an XML or a JSON file than there is to typing it as a CSV file, none of the punctuation used to structure XML or JSON files involves a line break, so using XML or JSON can slightly improve your confidence that your data will be correctly interpreted by whatever computer will be opening the file and displaying it on the screen for you.
Typically, it’s awkward and it’s overkill, the way same storing our “friends list” as a table was awkward.
Here’s what storing our “bookkeeping log” the way we wrote down our “friends list” data might look like – normally, you probably wouldn’t write it this way on paper:
- 000001
- Date: 01/01/2010
- Vendor: Apple
- Tax Category: Office Supplies
- Price: $60,000.00
- 000002
- Date: 01/01/2010
- Vendor: Applebee's
- Tax Category: Meals
- Price: $50.87
- 000003
- Date: 01/02/2010
- Vendor: Apple
- Tax Category: Repairs
- Price: $928.20
- 000004
- Date: 01/10/2010
- Vendor: Steelcase
- Tax Category: Office Supplies
- Price: $20,289.98
But clearly you can.
As I showed with <food>wine</food><food>pickles</food>
(XML) vs. "food" : ["wine","pickles"]
(JSON), JSON is a bit more concise than XML for table-like data where there is no “sub-item” nesting.
And simple “table” data appropriate for CSV files doesn’t have “sub-item” nesting. That’s part of what makes it a simple table.
Therefore, if you have to choose JSON or XML to represent data that could have easily been stored in the CSV format, it’s more common to choose JSON.
Use Case: one standard to rule them all
There’s another reason, besides line-break-proofing, that is actually even more common for “why you would use XML or JSON instead of a CSV file” to represent simple table-style data: sticking to a single standard.
Think of it like buying a sturdy hiking backpack, even if you only use it for carrying your lunch to work on the city bus.
Should you ever need to go hiking … you’re set!
This is why Salesforce’s REST API sends you the data for SOQL queries in the JSON format, not in the CSV format.
- If you used it to fetch data from a single table, a CSV file would come out neat and nice.
- (Indeed, Salesforce’s Data Loader tool transforms the JSON Salesforce sends it into CSV format so that you never know Salesforce didn’t send a CSV file.)
- But as soon as you do a query that involves multiple tables with parent-child relationships, you’re out of the realm of “simple table” data and into the realm of “items nested inside items.”
- (Ever noticed your data can get a bit awkward with parent-child queries in Data Loader? Data Loader does the best it can for you, but entire sets of “child” records stuffed into a single cell of a CSV file aren’t always the easiest thing in the world to read.)
Salesforce didn’t want to have to program two different “APIs”–one that dealt with CSV files and one that dealt with JSON files. So they simply said, “all the data will be transmitted as JSON files.”
Sometimes, the decision to choose a certain file format is as simple as “only wanting to buy one backpack.”
Questions? Let me know!