Cleaning bad Pardot data with Python
15 Feb 2019
Table of Contents
- Background
- Password Precautions
- Code Explanation
- Modules needed
- Logging Python into Pardot
- Establish the filepath you want to download your Pardot prospects to
- Download your Pardot prospects as a plaintext file full of JSON-formatted text
- Comment out that last block of code
- Read the file into memory and “whittle it down”
- Upload your changes
- Verify that the changes actually went through
- Why didn’t we use Pandas?
- Full code
- Further Resources
This is an intermediate-level “Python for Salesforce” post – but still “admineloper”-level (e.g. you can code a basic trigger comfortably). Today we’ll go into a bit different usage for Python than we’ve been doing with CSV/XLSX-modifying!
Background
Pardot doesn’t have a “Data Loader” tool like Salesforce does, nor am I aware of third-party graphical user interfaces to selectively mass-update its data like DemandTools’s MassImpact module can do to Salesforce data.
Nevertheless, I found myself needing to clear out a value from a field on thousands of Pardot records. Pardot has an API for loading and querying, so Python to the rescue – I built my own single-purpose “Data Loader” script!
(Click here for more about what an API is – Pardot’s falls under “category #2,” “general-purpose API.”)
Thanks to sloppy trigger programming on my part, I had introduced the word “null
“ into the MailingStreet
field of about 5,000 Salesforce Contact records that should have simply had a blank value in that field. 😳
The Pardot connector helpfully “synced” these "null"
values to the corresponding Prospects’ MailingStreet
records.
It’s short work, particularly with DemandTools, to query any Salesforce Contacts WHERE MailingStreet="null"
and update them to an actual NULL
(rather than the word "null"
).
However, even with one-way sync turned on from Salesforce to Pardot, Pardot has a nasty habit of trying to “help” populate fields on Salesforce contacts when they’re blank.
So we get stuck – we literally can’t set any Pardot-synced-field value to NULL
in Salesforce, because within minutes, Pardot will “help” by putting it back.
I needed code I could run against Pardot while DemandTools was cleaning up Salesforce so that the two databases would get cleaned at the same time and stop trying to “fix” each others’ NULL
field values.
(And that I could easily re-run a few more times over the next hour, cleaning up any “Oh, hi! I’m Pardot Sync and I’m here to help!” stragglers from Salesforce and Pardot cleanup not running perfectly simultaneously.)
Here’s my code.
Password Precautions
To get this code to work, you have to type your Pardot password and “user key” into your code. That’s not a good thing.
- Only run this code on a computer you trust.
- Only run this code over a network connection you trust.
- If you save this Python script in a file, type “CONTAINS PASSWORD” at the beginning of the filename so you’ll remember it’s a file in which you have to clear out passwords before you close it.
- If you’re working in an IDE that treats subsequent “run” button clicks like one big session (and caches the values of variables from “run” to “run”), leverage that to help you remember to clear out your password from your file. Run a few lines of code, with the rest commented out, to fetch an “API key” and save it into a variable called
api_key
.
Then comment out the “fetching” code, backspace over your password, and type the word “PASSWORDGOESHERE
,” and save your script.
Do the same with your e-mail address and user key.
Ta da! Now you can stop worrying about whether you left a password in plain text in the contents of your script. - You’re going to have to get expert advice if you need to store this script and run it in any sort of automated fashion. Typing your password into the body of the script itself and leaving it there is simply not an option.
Code Explanation
Modules needed
First things first … let’s add some code to our program to “import” some modules (extensions to the Python language) that will make our job easier.
import requests
import json
import urllib
import datetime
Logging Python into Pardot
Here’s code, which I learned the necessary URLs to build, at Pardot’s API documentation (click here for more about what an API is – Pardot’s falls under “category #2,” “general-purpose API”).
ukey = 'PUTUSERKEYHERE'
api_key = requests.post(
'https://pi.pardot.com/api/login/version/3'
, params='email='+urllib.parse.quote('PUTEMAILADDRESSHERE')
+'&password='+'PUTPASSWORDHERE'
+'&user_key='+ukey
+'&format=json'
).json().get('api_key')
#print(api_key)
This api_key = ...
is the code I was saying you should leave as “'PUTEMAILADDRESSHERE
,” “'PUTPASSWORDHERE
,” & “'PUTUSERKEYHERE
,” and comment out (put a #
at the beginning of it) when not in use, so that you don’t accidentally leave your credentials in your code.
You can un-comment print(api_key)
for a moment and run that line of code if you want to verify that logging in went well.
Appropriate users
If you normally use Salesforce credentials to log into Pardot, you’ll have to add a new Pardot user with a username that doesn’t exist in your Salesforce org and use that user’s credentials to log into Pardot. “Salesforce-connected” Pardot usernames can’t log into Pardot with code.
Click here for instructions on setting up such a user in Pardot.
Establish the filepath you want to download your Pardot prospects to
Here, we’ll save our output to c:\example\pardotprospectsYYYYMMDDHHMMSS.txt
(with “YY…SS” being replaced by the actual timestamp when we start our code).
pqueryresultsfilepath = 'c:\\example\\pardotprospects'+datetime.datetime.now().strftime('%Y%m%d%H%M%S')+'.txt'
Comment out that last line of code if running in chunks
If you’re planning to run your code in chunks (commenting out parts you aren’t currently running), rather than as one long script, you’ll want to put a #
before the word pqueryresultsfilepath
in the previous line of code as soon as you’ve run it once.
Why?
Well … because you’re going to use that filename twice, in separate chunks of code.
If you run those chunks separately, you don’t want to try to write to a file with “0738” in its filename at 7:38AM, then read from a file with “0741” in its filename at 7:41AM. 😆
Alternatively, you could:
- Use today’s date without a time (
'%Y%m%d'
instead of'%Y%m%d%H%M%S'
) – unless, of course, you’re working around midnight. - Leave out the timestamp from the filename altogether and simply come up with a file path that you know isn’t already on your computer (to avoid overwriting something you cared about).
Download your Pardot prospects as a plaintext file full of JSON-formatted text
Presuming you have a lot of Pardot prospects, this is going to take a while.
Grab some water and stretch while you wait.
But stretching and water are probably better for you.
Because this block of code takes so long to run, I recommend you “comment it out” so that it doesn’t execute when not in use.
You can do that by turning it into one big “piece of plaintext” by putting triple quotes on their own lines above and below it as follows:
'''
Code I don't want to execute
More code I don't want to execute
'''
When you use the “query
” function of Pardot’s API, you have a little bit of “WHERE
“-type filtering power on certain fields.
Unfortunately, it’s not a lot. For example, address_one
(the back-end name Pardot gives MailingStreet
through its API) isn’t a field on which you can filter a query.
So I’ll have to download all 100,000 Pardot Prospect records.
Bummer.
That’s a lot of data, and takes about 10-30 minutes (depending on whether I’m limiting the columns I download about those prospects or not), so I write the data to my hard drive (hence choosing a good filename above).
In this case, I know I just want the id
and address_one
columns from any Prospect record I download.
Interestingly, if address_one
is blank for a Prospect, Pardot won’t include address_one
at all in that prospect’s portion of the data dump it sends in response to my query
.
But Pardot will send me every prospect, because they all, of course, have an id
.
Once it’s all downloaded, the .txt
file on my disk will look like this:
[{"id": 01010101, "address_one": "123 Sunny St"}, {"id": 04040404, "address_one": "null"}, {... and so on and so forth, up to 200 of these per line ...}]
[{"id": 05050505}, {"id": 07070707, "address_one": "null"}, {... and so on and so forth, up to 200 of these per line ...}]
[{"id": 08080808, "address_one": "456 Cloudy"}, {"id": 02020202}, {... and so on and so forth, up to 200 of these per line ...}]
[...and so on and so forth, until no data left...]
In the code I’m writing about today, I know exactly which fields I want: address_one
and id
.
Other times, I’ve wanted to do a full “tell me everything you know about every Prospect” dump from Pardot – in that case, when building the contents of “pquerydata
,” I’ve left out the piece of code that said +'&fields='+','.join(fieldsdesired)
and have left out the line of code where I set aside words into a Python list called fieldsdesired
.
Even when downloading “everything,” though, I haven’t always cared about certain overwhelmingly detailed fields that come back with an “all columns” data dump.
In my opinion, they get in the way of visually exploring the .txt
file I just wrote to disk when I haven’t actually decided what code I need to write next, and am inspecting my data to figure it out.
Typically, when I’m doing such a dump, instead of defining a list called fieldsdesired
, I’ve defined one called uselesskeys
.
Right between the pquerydata =
line and the pquerycounter =
line is a nice place, if you ask me.
For example:
uselessfields = ['assigned_to','visitor_activities','last_activity','lists','visitors','profile','campaign']
I’m not able to tell Pardot that I don’t care about the fields I find “useless,” but I am able to avoid writing any data stored under those keywords to disk.
In the code I’m about to show you, there are two lines of code that start with pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict ...
I simply stick an extra space and the phrase if k not in uselessfields
right after subdict.items()
and before the closing curly-brace }
that follows it.
Anyway, for today, as mentioned, I explicitly told Pardot which fields I wanted it to include in the data dump.
Without further ado, here’s the code to create my .txt
file:
fieldsdesired = ['id','address_one']
prospectqueryURL = 'https://pi.pardot.com/api/prospect/version/3/do/query'
pquerydata = 'user_key='+urllib.parse.quote(ukey)+'&api_key='+api_key+'&fields='+','.join(fieldsdesired)+'&format=json'
pquerycounter = 0
pqueryresultcount = 0
pquerycurrentresultslist = []
pqueryinitresp = requests.post(prospectqueryURL, params=pquerydata).json()
if pqueryinitresp['@attributes']['stat'] == 'ok':
pqueryresultcount = pqueryinitresp['result']['total_results']
print('Expecting to dump ' + str(pqueryresultcount) + ' prospects to hard drive.')
pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict in pqueryinitresp['result']['prospect']]
with open(pqueryresultsfilepath, 'w') as f:
f.write(json.dumps(pquerycurrentresultslist)+'\n')
pquerycounter = pquerycounter + len(pqueryinitresp['result']['prospect'])
print('Should have now dumped ' + str(pquerycounter) + ' prospects to hard drive.')
else:
print('Could not even get an initial response out of Pardot about attempted prosepct dump.')
with open(pqueryresultsfilepath, 'a') as f:
while pquerycounter < pqueryresultcount:
pqueryextraresp = requests.post(prospectqueryURL, params=pquerydata+'&offset='+str(pquerycounter)).json()
if pqueryextraresp['@attributes']['stat'] == 'ok':
pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict in pqueryextraresp['result']['prospect']]
f.write(json.dumps(pquerycurrentresultslist)+'\n')
pquerycounter = pquerycounter + len(pqueryextraresp['result']['prospect'])
print('Should have now dumped ' + str(pquerycounter) + ' prospects to hard drive.')
else:
print('Trying again for offset ' + str(pquerycounter))
print('Done dumping prospects to hard drive.')
There’s a lot of code there, and today I’m not going to go into all the details of how each bit works.
The gist of what it does is as follows:
- It talks to a computer waiting for requests at https://pi.pardot.com/api/prospect/version/3/do/query and says, “Hey,
ukey
would like some data – here’sapi_key
as proof it’s reallyukey
!” - It stores the response from that remote computer into
pqueryinitresp
- It examines the contents of
pqueryinitresp
to see if they’re a “Okay, here you go!” (pqueryinitresp['@attributes']['stat'] == 'ok'
) with a bunch of data. - If so, it checks to see if that’s all the data or if there was a note saying, “That was just part of it – I have more” (
while pquerycounter < pqueryresultcount
). - If there’s more, it knocks back on the door at https://pi.pardot.com/api/prospect/version/3/do/query and keeps fetching packages until there are no more “but wait, there’s more!” notes.
- It displays a note to the person running the code saying, “All done! You have (#) prospects in Pardot. They’re all on your hard drive now. Have fun.”
Comment out that last block of code
Did you see my note earlier about this taking 10-30 minutes for a large dataset?
You’ve got what you need on your hard drive now.
Comment out that last block of code so you don’t unnecessarily re-download your data from Pardot.
Read the file into memory and “whittle it down”
with open(pqueryresultsfilepath, 'r') as f:
pqueryjson = [item for sublist in f.readlines() for item in json.loads(sublist)]
print('There are ' + str(len(pqueryjson)) + ' prospect records in the file.\n' + str(pqueryresultcount) + ' were expected.')
for item in pqueryjson:
topop = []
for k,v in item.items():
if k not in ['id','address_one']:
topop.append(k)
elif k == 'address_one':
if v == 'null':
item[k] = ''
else:
topop.append(k)
if len(topop) > 0:
[item.pop(popme) for popme in topop]
pqueryjson = [item for item in pqueryjson if len([x for x in item.keys() if x=='address_one']) > 0] # Strip out any prospects that NOW don't have any "Status" values
print('There are ' + str(len(pqueryjson)) + ' prospect records to fix')
print(pqueryjson[0]) # Examine the first record
print([list(x) for x in set(tuple(x) for x in [sorted(x.keys()) for x in pqueryjson])]) # What "field sets" are we looking at in the records to fix?
print(len(pqueryjson)/50) # How many API POST calls will we be doing, if 50 at a time?
Here’s what the code above does:
- Populate a list-typed variabled named “
pqueryjson
” with data from the.txt
file we saved to our hard drive, 1 list item per Pardot prospect.
(with open...
through...were expected.)
)
Printed out as text, it would look a lot like the.txt
file, only one big[
…]
list instead of max-200-length lists on separate lines.
If it were printed out as “JSON”-formatted text, it would look more like:[{"id": 01010101, "address_one": "123 Sunny St"}, ... {"id": 02020202}, ... thousands of Pardot prospects, one {} per prospect]
- Loop over every item in the list “
pqueryjson
“
(for item in pqueryjson
)
(Note that the word “item
” is nothing special – I could have called it “x
” or “lovely_customer
“) - For any given item in the list that I’m looping over:
- Set up an empty list-typed variable named “
topop
.” We’ll store “fields we don’t care about” here. - Loop over each fieldname-and-its-value subcomponent of “
item
”.
If it isn’t the “id
” field, and it isn’t the “address_one
” field, add the field name to the “fields we don’t care about” list “topop
.”
Obviously, that’s a bit redundant, since we already told Pardot not to give us any fields but those two, but I like leaving it in for easier copy-paste to scripts against files where I downloaded all columns.
If you left it out, you’d delete the two lines starting withif k not in...
and change theelif
to anif
inelif k=='address_one'
.
Even if the field isaddress_one
, if its value isn’t the word"null"
, we don’t really care about it for this particular Pardot prospect. Therefore, for this prospectonly, we’ll add the fieldaddress_one
to “topop
.”
On the other hand, if it is a"null"
field_one
, we’ll alter the value offield_one
for the current prospect to be blank text (“''
“). - After working all the way through this “inner loop” over field-value pairs for a given prospect, check if we actually put any field names into “
topop
” for that prospect.
If so, eliminate all such field-value pairs from the contents of the particular “Pardot prospect” record currently stored in “item
” – they’re just not that important.
([item.pop(popme) for popme in topop]
)
- Set up an empty list-typed variable named “
- Now that we’ve looped over every item (every Pardot prospect) in “
pqueryjson
,” we’ll replace the contents of “pqueryjson
” with a smaller list of only the Pardot prospects that still have anaddress_one
field in them.
(`pqueryjson = [item … > 0]) - We have a couple of “
print(...)
” statements to give us some stats about the “new & improved” contents of “pqueryjson
.”
Whereas before, pqueryjson
might have looked like:
[{"id": 01010101, "address_one": "123 Sunny St"}, {"id": 04040404, "address_one": "null"}, {"id": 05050505}, {"id": 07070707, "address_one": "null"},{"id": 08080808, "address_one": "456 Cloudy"}, {"id": 02020202}]
After running this code, pqueryjson
looks like:
[{"id": 04040404, "address_one": ""}, {"id": 07070707, "address_one": ""}]
- We altogether got rid of every prospect except “#04040404” & “#07070707” because these two were the only prospects that had anything wrong with their data.
- We whittled down “prospect #04040404” & “prospect #07070707” to “just the fields that had problems, and with their ‘new & improved’ values.”
As with other long-running chunks of code, you might want to surround this entire section of code with '''
…'''
when it’s not in use.
That said, it’s not particularly long-running. This “file analysis” piece of code only took my computer a second or two to chew through 100,000 two-column records in my .txt
file.
Upload your changes
print('Beginning update of ' + str(len(pqueryjson)) + ' prospect records, 50 at a time')
for chunk in [pqueryjson[i:i + 50] for i in range(0, len(pqueryjson), 50)]:
ptoupdatejson = json.dumps({'prospects':{str(item['id']):{k:v for k,v in item.items() if k != 'id'} for item in chunk}})
pbatchupdurl = 'https://pi.pardot.com/api/prospect/version/3/do/batchUpdate'
pbatchupddata = 'user_key='+ukey+'&api_key='+api_key+'&format=json'+'&prospects='+ptoupdatejson
pbatchupdresult = requests.post(pbatchupdurl, params=pbatchupddata).json()
print(pbatchupdresult)
print('Done updating prospect records (there were ' + str(len(pqueryjson)) + ' to do)')
Note: As with other long-running chunks of (just assume all code that includes the phrase “requests.
…” is long-running), you might want to surround this entire section of code with '''
…'''
when it’s not in use.
In this code, we use the “prospect
” function of Pardot’s API, which allows you to send it data to write to Pardot prospects through its batchUpdate
call.
There’s a limit to batchUpdate
of 50 prospects at a time, so we’ll have to break up “pqueryjson
” into chunks of 50 (or fewer, perhaps, in the final chunk).
We’ll loop over each chunk, invoking the Pardot API once per “chunk.”
The key to breaking up pqueryjson
into chunks of 50 starts with range(0, len(pqueryjson), 50)
, a pretty cool little piece of code.
Let’s say we had 374 prospects in need of fixing.
range(0, len(pqueryjson), 50)
essentially produces the following list:
[0, 50, 100, 150, 200, 250, 300, 350]
The other secret of our success is this magic formula, which produces a “list of sublists”” from SOME_BIG_LIST
:
[SOME_BIG_LIST[i:i + 50] for i in LIST_OF_MULTIPLES_OF_50]
The “for chunk in ... :
bit just makes sure that we loop over those sublists. So with this…
for chunk in [pqueryjson[i:i + 50] for i in range(0, len(pqueryjson), 50)]:
…We’re essentially saying that we want to loop over “sub-lists” of the list that is stored in “pqueryjson
.”
- First over “items 0-49” (which is what “
pqueryjson[0:50]
” actually produces) - Then over “items 50-99” (which is what “
pqueryjson[50:100]
” actually produces) - Etc.
Technically, our last slice will try to grab “items 350-399” even though we only have 374 prospects in “pqueryjson
.”
Luckily, the SOME_LIST[FROM_HERE:TO_HERE]
syntax in Python is smart enough to figure out to ignore excess if TO_HERE
is too big of a number for the list size. Python just produces a “sub-list” that goes “to the end of the list.”
All right … so … lovely. We’re looping over “pqueryjson
,” 50 records at a time.
- Q: What are we doing with those 50-record chunks?
- A: Most of the “woah!” is in this line of code:
ptoupdatejson = json.dumps({'prospects':{str(item['id']):{k:v for k,v in item.items() if k != 'id'} for item in chunk}})
Let’s break it down, from the inside to the outside.
- Given data representing a particular “Pardot prospect” item stored under the variable name “
item
” (note: as in earlier code, I could have called it “awesome_customer
instead ofitem
), the code{k:v for k,v in item.items() if k != 'id'}
produces a copy of the prospect’s data that is missing its “id
” field-value pair. - What’s that “ID-less” copy good for?
Glad you asked.{str(item['id']):PROSPECT_MINUS_ID for item in chunk}
loops over every “item
” representing a “Pardot prospect” in our 50-record “chunk
,” and builds an ID-indexed (“str(item['id']) : ...
“) lookup table (“{LOOKUPKEY:LOOKUPVALUE for ...}
“).
Blegh. That’s a mouthful.
Let’s look at what it does to data.
It transforms our data from this shape:[ {'id': '04040404', 'address_one': ''}, {'id': '07070707', 'address_one': ''} ]
into this shape:{ '04040404':{'address_one':''}}, '07070707':{'address_one':''} }
If we were dealing with other non-ID fields besidesaddress_one
, those fields and the “values we want to set them to for a given prospect” would be listed in the inner-curly-braces asKEY:VALUE
colon-delimited pairs, separated from other field-value pairs by commas.
For example, maybe in a more complicated data set we’d have:{ '04040404':{'favorite_food':'Hamburger', 'favorite_animal':'Cat'}}, '07070707':{'address_one':'', 'favorite_animal':'Dog'} }
- Next, this code: “
{'prospects':OUR_NEW_DATA_SHAPE}
” puts a big “prospects:” annotation around the data I just described, as a formality, so that it now looks like this:{'prospects':}{'222':{'address_one':''}},'333':{'statusProg1':''}}
Why?
Because that’s how the Pardot API insists data be transmitted to it. - Finally, all of these curly braces and colons and such actually still mean special things in Python (it’s “dict“-typed data).
To send our data over the internet to Pardot, we need to get just a plain old piece of text full of curly braces, colons, etc.
So we surround the whole thing intojson.dumps()
, the output of which is plaintext that looks just like the plain text I’ve been typing in showing you sample data.
From there on out, it’s pretty smooth sailing.
A couple of notes, though:
- Did you notice that we’re only transmitting the actual fields we want to change to Pardot, not every field a prospect might have had data in?
- For example, that’s why we added “addresses that were already fine” to
topop
much earlier in our code. - This is particularly important when purposely changing cell values in a table to be blank.
It’s really nice to be surgically precise about data you’re erasing! 😆
- For example, that’s why we added “addresses that were already fine” to
- If you’re confused by “writing code with curly braces, square brackets, commas, and colons in Python” vs. “reading stuff that looks just like it in a
.txt
file” / “sending stuff that looks just like it over the internet, but allegedly now it’s ‘plain text,’” you’re not alone!
It’s deliberate that Python made the rules for writing their code look like “the rules for writing plaintext representations of data in the JSON format.”
You’ll just have to trust me that in Python, there’s a difference between these two concepts:- “text full of
{
’s and}
’s” (e.g. “JSON-formatted” plaintext) - “Python data defined by typing
{
’s and}
’s” (a Python “dict”)
- “text full of
All right, so we wrote some pretty complex code to prepare our data for transmission, but now we have all our data in a standard that Pardot’s API will be happy with!
pbatchupdurl = ...
and pbatchupddata = ...
gather up a URL for contacting Python, our user key, our API key, and our “ptoupdatejson
” plaintext into variables. These variables enhance the readability of the code that actually talks to Pardot’s API using Python’s “module” that communicates over the web, “requests.”
While we’re still in a “chunk of 50” piece of our big loop, we do a requests.post(...)
operation to talk to Pardot, and then we store a “JSON”-formatted copy of whatever Pardot sends back as a response into a variable called “pbatchupdresult.
” We display this output to the screen just so we can watch to make sure everything goes all right.
Verify that the changes actually went through
There are 2 ways to do this:
- Spot-check “MailingStreet” and the “Status” fields for a few records out of “
pqueryjson
” at https://pi.pardot.com/prospect/read/id/SOMEIDNUMBER.
For example, you can display a sampling of “pqueryjson
” like:import random for x in random.sample(range(0, len(pqueryjson)-1), 4): print(pqueryjson[x])
- Re-run the “Download your Pardot prospects as a plaintext file full of JSON-formatted text” & “Read the file into memory and “whittle it down” sections of this tutorial.
Yeah, it’ll take a while to re-download all your Pardot prospects.
(Stretch & water break!)
But at least you can feel confident if your output explicitly says:There are 0 prospect records to fix
Actually, what you’d probably notice, rather than “There are 0 prospect records to fix
,” is this error at the end of your output:
...(a bunch of junk called a "stack trace")...
print(pqueryjson[0]) # Examine the first record
IndexError: list index out of range
Oops. I didn’t fool-proof my code for examining the “first record in need of fixing” when there was no “first record” remaining in pqueryjson
after taking away everything that was “already just fine.”
If you’d like to fix that, look up “if
” statements and try using len(pqueryjson)
to only show you the first record in pqueryjson
if there are any records in pqueryjson
.
It’s a nice little homework exercise!
Why didn’t we use Pandas?
I tried this with Pandas. But I quickly realized that I couldn’t tell, in table format, whether a field was empty because I’d personally set it to be empty or because it was already empty in the first place, when it came time to push data back into Pardot.
When setting cell values in a table to NULL
, I wanted to be over-cautious and ensure I was only dealing with data that I personally had decided needed to be changed.
It just didn’t help me solve my business problem to think of my data as a table, even though technically, “Prospects” most definitely is a “table” in Pardot.
(If you want to get really geeky with computer science, one could say that the data I cared about, arranged in table format, was a sort of “sparse matrix” in the context of “data I didn’t care about.” “Sparse matrices” are often easier to process when thinking of each row as an independent “dict” of fields & values than as a full-on rows-and-columns table. See Dan McCreary’s “Making Sense of NoSQL” if you want to dive deeper into that concept.)
“Not needing to think of my data as a table” was all the more true because Pardot’s API didn’t even send me a CSV file the way Salesforce Data Loader does … it sent me JSON-formatted text.
I could have used pandas.read_json(...)
to turn my data into a table, but it just didn’t help solve my problem.
It turned out that native Python commands for removing items from, and modifying the contents of, “dicts” was a lot better suited to my business needs than turning all this data into a table-like shape would have been.
Full code
Impatient to try this and just want the script? (Be sure to read about password precautions back at the top!)
import requests
import json
import urllib
import datetime
import random
ukey = 'PUTUSERKEYHERE'
api_key = requests.post(
'https://pi.pardot.com/api/login/version/3'
, params='email='+urllib.parse.quote('PUTEMAILADDRESSHERE')
+'&password='+'PUTPASSWORDHERE'
+'&user_key='+ukey
+'&format=json'
).json().get('api_key')
print(api_key)
pqueryresultsfilepath = 'c:\\example\\pardotprospects'+datetime.datetime.now().strftime('%Y%m%d%H%M%S')+'.txt'
fieldsdesired = ['id','address_one']
prospectqueryURL = 'https://pi.pardot.com/api/prospect/version/3/do/query'
pquerydata = 'user_key='+urllib.parse.quote(ukey)+'&api_key='+api_key+'&fields='+','.join(fieldsdesired)+'&format=json'
pquerycounter = 0
pqueryresultcount = 0
pquerycurrentresultslist = []
pqueryinitresp = requests.post(prospectqueryURL, params=pquerydata).json()
if pqueryinitresp['@attributes']['stat'] == 'ok':
pqueryresultcount = pqueryinitresp['result']['total_results']
print('Expecting to dump ' + str(pqueryresultcount) + ' prospects to hard drive.')
pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict in pqueryinitresp['result']['prospect']]
with open(pqueryresultsfilepath, 'w') as f:
f.write(json.dumps(pquerycurrentresultslist)+'\n')
pquerycounter = pquerycounter + len(pqueryinitresp['result']['prospect'])
print('Should have now dumped ' + str(pquerycounter) + ' prospects to hard drive.')
else:
print('Could not even get an initial response out of Pardot about attempted prosepct dump.')
with open(pqueryresultsfilepath, 'a') as f:
while pquerycounter < pqueryresultcount:
pqueryextraresp = requests.post(prospectqueryURL, params=pquerydata+'&offset='+str(pquerycounter)).json()
if pqueryextraresp['@attributes']['stat'] == 'ok':
pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict in pqueryextraresp['result']['prospect']]
f.write(json.dumps(pquerycurrentresultslist)+'\n')
pquerycounter = pquerycounter + len(pqueryextraresp['result']['prospect'])
print('Should have now dumped ' + str(pquerycounter) + ' prospects to hard drive.')
else:
print('Trying again for offset ' + str(pquerycounter))
print('Done dumping prospects to hard drive.')
with open(pqueryresultsfilepath, 'r') as f:
pqueryjson = [item for sublist in f.readlines() for item in json.loads(sublist)]
print('There are ' + str(len(pqueryjson)) + ' prospect records in the file.\n' + str(pqueryresultcount) + ' were expected.')
for item in pqueryjson:
topop = []
for k,v in item.items():
if k not in ['id','address_one']:
topop.append(k)
elif k == 'address_one':
if v == 'null':
item[k] = ''
else:
topop.append(k)
if len(topop) > 0:
[item.pop(popme) for popme in topop]
pqueryjson = [item for item in pqueryjson if len([x for x in item.keys() if x=='address_one']) > 0] # Strip out any prospects that NOW don't have any "Status" values
print('There are ' + str(len(pqueryjson)) + ' prospect records to fix')
print(pqueryjson[0]) # Examine the first record
print([list(x) for x in set(tuple(x) for x in [sorted(x.keys()) for x in pqueryjson])]) # What "field sets" are we looking at in the records to fix?
print(len(pqueryjson)/50) # How many API POST calls will we be doing, if 50 at a time?
print('Beginning update of ' + str(len(pqueryjson)) + ' prospect records, 50 at a time')
for chunk in [pqueryjson[i:i + 50] for i in range(0, len(pqueryjson), 50)]:
ptoupdatejson = json.dumps({'prospects':{str(item['id']):{k:v for k,v in item.items() if k != 'id'} for item in chunk}})
pbatchupdurl = 'https://pi.pardot.com/api/prospect/version/3/do/batchUpdate'
pbatchupddata = 'user_key='+ukey+'&api_key='+api_key+'&format=json'+'&prospects='+ptoupdatejson
pbatchupdresult = requests.post(pbatchupdurl, params=pbatchupddata).json()
print(pbatchupdresult)
print('Done updating prospect records (there were ' + str(len(pqueryjson)) + ' to do)')
for x in random.sample(range(0, len(pqueryjson)-1), 4):
print(pqueryjson[x])
Note to self: fix bug
Note to self: fix this code for when there’s 1 more prospect than a multiple of 200 (e.g. 100,601).
(Sorry, readers, have to get to work now! But perhaps it’s another good homework problem you might beat me to.)
Problem: [{k:v for k,v in subdict.items()} for subdict in pqueryextraresp['result']['prospect']]
fails to write the last prospect to disk.
Cause: in the case of a “response of size 1,” pqueryextraresp['result']['prospect']
contains a single dict {'id': 111, 'address_one': '123 Sunny'}
instead of list of dicts [{...}, {...}]
.
Workaround until code fixed
- Run
print(pqueryextraresp['result']['prospect'])
to see the missing record. - Copy-paste its output, with a good text editor, surrounded by square brackets, into a new line at the last line of your
.txt
file (or don’t, if you can see it’s already good data that doesn’t need fixing). - Comment out the “login” & “filepath-defining” and “file-downloading” parts of this script and re-run the rest of the script (the “file-analysis” & “Pardot-update parts”), since they didn’t run when the program threw the following error:
pquerycurrentresultslist = [{k:v for k,v in subdict.items()} for subdict in pqueryextraresp['result']['prospect']]
AttributeError: 'str' object has no attribute 'items'
Further Resources
- If you want to feel modifying this code, look up tutorials about “list comprehensions” & “dict comprehensions” on Google until you feel like you get them.
They’re glorified “for-loops” for building “lists” & “dicts.”
(Google those, come to think of it, if that doesn’t make sense.)
They’re super common in Python but not necessarily in other programming languages you might have worked with.- Here’s a short reference
- Here’s a full-on tutorial by Katie Silverio
- (I’m still on the hunt for the perfect “total beginner’s tutorial about Python comprehensions. Let me know if you’ve found one!)
- More about working with files on your hard drive in Python, from Real Python
- The official “requests” documentation”
- Real Python’s tutorial about “requests”