ETL QA is interpersonal communication
23 Aug 2021
Table of Contents
I’ve spent a week writing e-mails and calling people after enabling a new primary-database-to-secondary-database feed.
I think one of the hardest parts about troubleshooting whether unexpected issues are “garbage in, garbage out” data-quality issues or whether they’re “I wrote bad code” issues is getting clarity about the input data quality from people who do the data entry without making them feel like I’m accusing them of failing to do their jobs.
I just wrote a script that queries our main enterprise resource planning (“ERP”) database for highly sensitive information and loads it into a 3rd-party cloud-hosted database every morning for the convenience of the “Department A” staff who use the 3rd-party database.
One of the things I put into my script was a filter that kicks off the process by requiring the 3rd-party database to send my code a list of the IDs of the data it expects to receive more details about. I then filter my “sensitive details” queries by that list.
At the moment, our ERP has sensitive data about 61 people. Although the sensitive data entry is done by Department B, it’s all done on behalf of Department A, so I’d expect it to be pretty much a 100% overlap with the “people important to Department A” dataset that my extract, transform, load (“ETL”) process uses as a filter.
And yet, only about 30 of the people who’ve had sensitive information attached to them in our ERP are actually part of the “list of expected IDs” coming from the 3rd-party software.
So I wrote an e-mail:
Hi all,
I just compared the 30 ERP IDs who were in ERP-table-top-secret but didn’t data-load against ERP-table-XYZ _(the ERP table into which we redundantly do “expected IDs important to Department A” data entry) this morning.
Only 3 of those 30 IDs showed up in ERP-table-XYZ, which matches Jane’s assertion that 3 more people were recently added to lists of expected IDs in the last few days.
So … I guess we just have a fair number of people (30) whom Department B has done sensitive data entry on behalf of Department A for, but … who are just not on Department A’s list of important people yet?
John & Mark, does that sound accurate to you? If so, I won’t worry about the quality of my data feed and will feel that I can close this ticket.
ETL quality assurance often feels like a tightrope:
- Don’t get technical enough and specific enough with your “is this really right?” questions, and people will just nod along like, “you’re the expert, thanks for helping us” and not even realize you’re asking them a question, rather than musing aloud rhetorically.
- Ask a question in a way that gets an answer you need, but then have to backpedal and reassure people that they didn’t do anything wrong in their jobs. (Because a lot of the time, if you’re good at requirements analysis and coding, final QA does end up being a “garbage in, garbage out” issue.)
It often mystifies people who don’t do professional ETL, or don’t do professional data entry & reporting, that I don’t actually need their data to be “good” to get my code “good.”
I can trust that “good” will arrive later as long as the expectations vs. reality numbers for “bad” match each other.
Haven’t done 90% the work you’ve been meaning to do this week? Not judging; couldn’t care less! Except for being really pleased that “90%” is the number you gave me, since “90%” is the amount of expected data I’m not seeing in my feed, which means I wrote my feed correctly.
But if that’s not a concept you work in day-to-day like I do? That concept sounds like nonsense.
It sounds like I’m passively-aggressively nagging you about your backlog of work.
Which I always feel terrible about.
Takeaways
- Be ready for a lot of tough communication across different areas of expertise (yay, practicing soft skills!). Imagine yourself in their shoes (a big reason people who learned tech mid-career are awesome – they literally can relate to more than one “blind men and the elephant” position at a time) and communicate appropriately.
- Be ready to write a lot of “proofreading queries” on the fly that help you be able to send people numbers they can relate to like “30 out of 61” and maybe even a supporting list of IDs behind those numbers, even if those queries have nothing at all to do with the queries in your ETL automation. Recognizeable numbers (“31! That’s exactly how many people I haven’t done data entry on yet!”) and data-points they can look up individually “(I looked up the first the IDs in your list, and now I see what’s going on – those are people we’re all waiting on to sign a form)” can give business users a lot of “aha!” moments when they’d otherwise feel like you’re speaking Greek to them. This morning’s query literally looked like this:
select count(distinct id_key)
from table_xyz
where id_key in ('id_1', 'id_2', 'id_3', '...') -- the IDs from ERP-table-top-secret that didn't load
I got that list itself from a flat-file out of the data-load, so there was a good bit of Python & Pandas to generate the ('id_1', 'id_2', 'id_3', '...')
string, too, since it’s way faster to process data in plaintext files with Python than with SQL.
In other words, follow the “don’t make me think” principle. YOU, the data expert, are the person who have the power to slice & dice data, not them … so when you ask your question, pre-query some supporting details to go with your question.
For a real-world analogy, don’t ask your roommate, “my water bottle is missing, can you help me find it?” Ask: “my water bottle is missing and I last saw it at home after we got back from the park 3 days ago, and we haven’t gone anywhere since so it’s probably in the house or the car; can you help me find it?”