Dedupe Salesforce NPSP Emails with Python
23 Nov 2018
Table of Contents
- Data & Expectations
- Running The Code Yourself
- The Code
- Stop reading here if you don’t want to learn Python
- Code Explanations
- “Output1” explanation
- Import “time,” start the clock
- Import “pandas,” set display options
- Specify “email address” column names
- Map out “alternate spellings” of email address fields
- Specify where “contacts.csv” is
- Specify output files
- Read “contacts.csv” into
pandas
and call the tabledf
- Add a
ChangedAnything
todf
- Grab
df
rows containing validPreferredEmail
values - Add a
PrefEmAddr
column todf
- “Loop” over the email addresses
- Only show changed lines
- Delete the
PrefEmAddr
&ChangedAnything
columns - Save “output1.csv”
- Dummy-checking the time taken
- “Output2” explanation
- Import “time,” start the clock
- Import “OrderedDict”
- Import “pandas,” set display options
- Specify “email address” column names
- Map out “alternate spellings” of email address fields
- Specify where “contacts.csv” is
- Specify output files
- Read “contacts.csv” into
pandas
and call the tabledf
- Sort
df
onHuman-Readable ID
- Re-set “row IDs” to start at 0
- Move all data except email addresses into the “row IDs.”
- “Stack” the email addresses
- Group
allEmailsSeries
by the email addresses it contains - Filter
groupedEmailSeriesGroupItems
intoemsAndTheirCellsFound
- Create empty “dumping spots”
output2aList
andoutput2bList
- “Loop” over the email addresses
- Save “output2a.csv” and “output2b.csv”
- Dummy-checking the time taken
- “Output1” explanation
Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.
Data & Expectations
Input Data
As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “contacts.csv,” that looked something like this (I consolidated his examples and added a bit of complexity):
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | [email protected] |
003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | ||
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | [email protected] | [email protected] |
003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | |
003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | |
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | [email protected] |
003…07 | IIII | Chidi | Anagonye | Work | [email protected] | ||
003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | ||
003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | ||
003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | ||
003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | ||
003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | [email protected] | [email protected] |
003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] |
Output File #1
Matthew wanted to see an “output1.csv” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader (D.L. docs), that looks like this:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] |
Output File #2A
To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” file “output2a.csv” that would look like the data below.
This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.
Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).
Each column is the Salesforce ID of a Contact in which the email address was found.
Master | Duplicate_1 | Duplicate_2 |
---|---|---|
003…01 | 003…07 | |
003…10 | 003…02 | 003…04 |
003…06 | 003…14 |
Output File #2B
For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “output2b.csv”, to facilitate human review. He suggested it look like the output below.
emAddr | Master | Master_FN | Master_LN | Duplicate_1 | Duplicate_1_FN | Duplicate_1_LN |
---|---|---|---|---|---|---|
[email protected] | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
[email protected] | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
Running The Code Yourself
Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!
Or, if you’re adventurous, download an “Anaconda suite” installer for your computer’s operating system and give it a try.
- Don’t bother installing VSCode when it asks if you want to.
- Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
- Type
print('hello')
in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see ifhello
is printed at the right-hand side of your screen in an output console (probably bottom right). - If so, congrats – you just wrote a program! Backspace out
print(hello)
and copy/paste my code in instead, then start tweaking it.
- Type
And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.
The Code
There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!
You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.
Code to produce “Output1” only
import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
df['ChangedAnything'] = False
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
for fieldName in emFieldNames:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
output1df = df[df['ChangedAnything']]
output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])
if len(output1df) > 0:
output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
with open(output1NoneFoundFilePath, 'w') as f:
f.writelines('No records found')
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Code to produce “Output2A” and “Output2B” only
import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
allEmailsSeries = df[emFieldNames].stack()
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output2bFact = OrderedDict()
if numCellsFound > 1:
for cellLoopCounter, cell in enumerate(foundCells):
if cell[2] not in seenFirstNamesAndTheirContacts:
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
if cellLoopCounter == 0:
output2bFact['emAddr'] = emAddr
output2bFact['Master'] = cell[1]
output2bFact['Master_FN'] = cell[2]
output2bFact['Master_LN'] = cell[3]
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
if cellLoopCounter == numCellsFound-1:
if 'Duplicate_1' in output2bFact:
output2bList.append(output2bFact)
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
output2aFact = OrderedDict()
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
if len(output2aList) > 0:
output2adf = pandas.DataFrame(output2aList)
output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
with open(output2aNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
if len(output2bList) > 0:
output2bdf = pandas.DataFrame(output2bList)
output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
with open(output2bNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Code to produce all 3 output files at once
import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
# ### Output1 portion starts here ###
df['ChangedAnything'] = False
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
for fieldName in emFieldNames:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
output1df = df[df['ChangedAnything']]
output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])
if len(output1df) > 0:
output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
with open(output1NoneFoundFilePath, 'w') as f:
f.writelines('No records found')
# ### Output1 portion ends here ###
# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
allEmailsSeries = df[emFieldNames].stack()
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output2bFact = OrderedDict()
if numCellsFound > 1:
for cellLoopCounter, cell in enumerate(foundCells):
if cell[2] not in seenFirstNamesAndTheirContacts:
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
if cellLoopCounter == 0:
output2bFact['emAddr'] = emAddr
output2bFact['Master'] = cell[1]
output2bFact['Master_FN'] = cell[2]
output2bFact['Master_LN'] = cell[3]
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
if cellLoopCounter == numCellsFound-1:
if 'Duplicate_1' in output2bFact:
output2bList.append(output2bFact)
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
output2aFact = OrderedDict()
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
if len(output2aList) > 0:
output2adf = pandas.DataFrame(output2aList)
output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
with open(output2aNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
if len(output2bList) > 0:
output2bdf = pandas.DataFrame(output2bList)
output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
with open(output2bNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
# ### Output2 portion ends here ###
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Stop reading here if you don’t want to learn Python
You don’t have to thoroughly understand the code above to make use of it.
Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.
Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!
For the curious, the rest of this post is total geekery peeking under the hood.
Code Explanations
“Output1” explanation
Import “time,” start the clock
First we import the time
extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).
import time
start_time = time.time()
Import “pandas,” set display options
Next we import the pandas
(spreadsheet-processing) extension to Python and say that, should we display any of its data on our screen with a print(...)
command (although in this script we don’t), we want to make use of our full screen width before trying to “wrap” excessive columns onto overflow lines.
import pandas
pandas.set_option('expand_frame_repr', False)
Specify “email address” column names
We’ll create a “variable” (nickname for use later in our code) called emFieldNames
, and in it, we’ll store a list of the names of columns where email addresses can be found.
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
Map out “alternate spellings” of email address fields
In a variable called pickToField
, we’ll create a “dictionary,” or “dict,” that lists out the values we might find in our PreferredEmail
column, and clarifies what “email address” column headers they each correspond to.
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
Specify where “contacts.csv” is
In the variable inputFilePath
, we’ll specify where on our hard drive our file “contacts.csv” can be found.
The letter r
before the open-quote tells Python that the \
characters in our code are meant to be interpreted literally, not as any sort of “special character.” (Usually, they’re special characters within Python text inside of quotes.)
inputFilePath = r'c:\example\contacts.csv'
If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'
Specify output files
In the variable output1FilePath
, we’ll specify where to save our output file (if there is data), and in output1NoneFoundFilePath
we’ll specify where to leave us a note saying there was no data.
output1FilePath = r'c:\example\output1.csv'
output1NoneFoundFilePath = r'c:\example\nonefound1.txt'
Read “contacts.csv” into pandas
and call the table df
Now we’ll use pandas
to read “contacts.csv” into Python.
We’ll call Python’s copy of the table df
because it’s short for the jargon pandas
uses to describe a table, which is “DataFrame.”
df = pandas.read_csv(inputfilepath, dtype=object)
Add a ChangedAnything
to df
Right away, we’ll add a column to our table df
called ChangedAnything
, with values of False
all the way down (since we haven’t yet changed anything).
df['ChangedAnything'] = False
df
looks like this now:
SalesforceID | Human-Readable ID | FirstName | … | ChangedAnything |
---|---|---|---|---|
003…01 | BBBB | Chidi | … | False |
003…02 | DDDD | Eleanor | … | False |
003…03 | EEEE | Jason | … | False |
003…04 | FFFF | Eleanor | … | False |
003…05 | GGGG | Janet | … | False |
003…06 | HHHH | Tahani | … | False |
003…07 | IIII | Chidi | … | False |
003…08 | JJJJ | Mindy | … | False |
003…09 | KKKK | Kamilah | … | False |
003…10 | AAAA | Eleanor | … | False |
003…11 | CCCC | Ellie | … | False |
003…12 | LLLL | Mindy | … | False |
003…13 | MMMM | Mindy | … | False |
003…14 | NNNN | Tahani | … | False |
Grab df
rows containing valid PreferredEmail
values
Next, we’ll build a special type of list specific to Pandas called a “Series” that will indicate, for each record in PreferredEmail
, whether the value found there is a legitimate name of a column in df
(always good to check for typos).
We’ll set that aside in a variable named validPrefEmailIndicatorTFSeries
.
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
validPrefEmailIndicatorTFSeries
looks like this:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
Name: PreferredEmail, dtype: bool
Add a PrefEmAddr
column to df
These next two lines are … a bit of “Python magic” … but what they do is add a new PrefEmAddr
column to our df
table that contains the actual preferred email address of each Contact record.
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
df
now looks like this:
SalesforceID | Human-Readable ID | FirstName | … | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|
003…01 | BBBB | Chidi | … | False | [email protected] |
003…02 | DDDD | Eleanor | … | False | [email protected] |
003…03 | EEEE | Jason | … | False | [email protected] |
003…04 | FFFF | Eleanor | … | False | [email protected] |
003…05 | GGGG | Janet | … | False | [email protected] |
003…06 | HHHH | Tahani | … | False | [email protected] |
003…07 | IIII | Chidi | … | False | [email protected] |
003…08 | JJJJ | Mindy | … | False | [email protected] |
003…09 | KKKK | Kamilah | … | False | [email protected] |
003…10 | AAAA | Eleanor | … | False | [email protected] |
003…11 | CCCC | Ellie | … | False | [email protected] |
003…12 | LLLL | Mindy | … | False | [email protected] |
003…13 | MMMM | Mindy | … | False | [email protected] |
003…14 | NNNN | Tahani | … | False | [email protected] |
“Loop” over the email addresses
Now we’ll “loop” over one “email address” column name at a time (as listed out in emFieldNames
).
Any code “inside” this “loop” will be able to tell which “column name” we’re currently inspecting, because for the duration of a given “pass” of the loop, an appropriate value ('PersonalEmail'
, 'WorkEmail'
, or 'AlternateEmail'
) will be stored in the variable fieldName
.
for fieldName in emFieldNames:
Select the cells that need to be cleared out
First thing we’ll do within each “loop” is build a special list (again, a Pandas “Series“) of whether or not (True
or False
) we find each row of the df
table “interesting.” We’ll save that list into a variable called needsClearingTFSer
.
In this case, “interesting” means that for a given row of our df
table, its value in the PreferredEmail
column is not the column we’re currently “looping” over (as specified in fieldName
), but that the email address found in the actual column specified by the current value of fieldName
is the spelled exactly the same as the email address found in the same row’s PrefEmAddr
field that we built earlier.
In other words, “It looks the same as the preferred email address, but it isn’t the preferred email address.”
In other other words, “Is it a redundant copy of the email address?”
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
During the 'PersonalEmail'
pass of our loop, for example, needsClearingTFSer
looks like this:
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
dtype: bool
Clear out cells that had redundant values
Next we’ll clear out any values found in cells at the intersections of the rows indicated by True
values in needsClearingTFSer
and the column indicated by fieldName
for our current “loop pass.”
We’ll also set ChangedAnything
for a given row to True
if we … well … changed anything.
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
After the entire “loop” as run through all 3 email address columns, our table df
looks like this:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | True | [email protected] | |
003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | False | [email protected] | ||
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | True | [email protected] | ||
003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | False | [email protected] | |
003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | False | [email protected] | |
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | True | [email protected] | |
003…07 | IIII | Chidi | Anagonye | Work | [email protected] | False | [email protected] | ||
003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | False | [email protected] | ||
003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | False | [email protected] | ||
003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | False | [email protected] | ||
003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | False | [email protected] | ||
003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] | False | [email protected] |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | True | [email protected] | ||
003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] | False | [email protected] |
Only show changed lines
Matthew asked that we only show lines where we changed something, so we’ll create a new copy of df
and call it output1df
. This copy will be filtered to only contain rows from df
where changedAnything
was True
.
output1df = df[df['ChangedAnything']]
output1df
looks like this:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | True | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | True | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | True | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | True | [email protected] |
Delete the PrefEmAddr
& ChangedAnything
columns
We don’t need our PrefEmAddr
and ChangedAnything
columns anymore, so let’s get rid of them:
output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)
Now our output1df
looks the way Matthew asked for it to look:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] |
Save “output1.csv”
Finally, we’ll save output1df
to our hard drive (or a “no records found” message instead) with the following code.
The index=False
setting means “don’t bother to write pandas
’s internal row numbers as their own dedicated column in the output.
The quoting=1
means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”
if len(output1df) > 0:
output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
with open(output1NoneFoundFilePath, 'w') as f:
f.writelines('No records found')
Dummy-checking the time taken
Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
“Output2” explanation
Import “time,” start the clock
First we import the time
extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).
import time
start_time = time.time()
Import “OrderedDict”
To ensure that we get the right Salesforce Contact IDs assigned to “Master” vs. “Duplicate 1” columns, we’ll need to ensure that we keep track of certain discoveries we make in a specific sequence.
Python’s OrderedDict
extension will let us do this.
from collections import OrderedDict
Import “pandas,” set display options
Next we import the pandas
(spreadsheet-processing) extension to Python and say that, should we display any of its data on our screen with a print(...)
command (although in this script we don’t), we want to make use of our full screen width before trying to “wrap” excessive columns onto overflow lines.
import pandas
pandas.set_option('expand_frame_repr', False)
Specify “email address” column names
We’ll create a “variable” (nickname for use later in our code) called emFieldNames
, and in it, we’ll store a list of the names of columns where email addresses can be found.
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
Map out “alternate spellings” of email address fields
In a variable called pickToField
, we’ll create a “dictionary, or “dict,” that lists out the values we might find in our PreferredEmail
column, and clarifies what “email address” column headers they each correspond to.
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
Specify where “contacts.csv” is
In the variable inputFilePath
, we’ll specify where on our hard drive our file “contacts.csv” can be found.
inputFilePath = r'c:\example\contacts.csv'
If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'
Specify output files
In the variable output1FilePath
, we’ll specify where to save our output file (if there is data), and in output1NoneFoundFilePath
we’ll specify where to leave us a note saying there was no data.
output2aFilePath = r'c:\example\output2a.csv'
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt'
output2bFilePath = r'c:\example\output2a.csv'
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt'
Read “contacts.csv” into pandas
and call the table df
Now we’ll use pandas
to read “contacts.csv” into Python.
We’ll call Python’s copy of the table df
because it’s short for the jargon pandas
uses to describe a table, which is “DataFrame.”
df = pandas.read_csv(inputfilepath, dtype=object)
This time, when I show you what df
looks like, as below, I’m going to include pandas
’s internal “row IDs,” because we’re going to do some trickery by changing them. I want you to be able to watch them change. Upon importing df
into Python from our CSV file, it looks like this:
(row ID) | SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|---|
(0) | 003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | [email protected] |
(1) | 003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | ||
(2) | 003…03 | EEEE | Jason | Mendoza | Work | [email protected] | [email protected] | [email protected] |
(3) | 003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | |
(4) | 003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | |
(5) | 003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | [email protected] |
(6) | 003…07 | IIII | Chidi | Anagonye | Work | [email protected] | ||
(7) | 003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | ||
(8) | 003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | ||
(9) | 003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | ||
(10) | 003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | ||
(11) | 003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] |
(12) | 003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | [email protected] | [email protected] |
(13) | 003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] |
Sort df
on Human-Readable ID
Matthew specifed that the alphabetically earliest Human-Readable ID
value for a given “grouping” of “likely duplicates” should be the “Master” for that grouping.
Since a lot of the tricks we’ll play in this code read things in the order we started with them in in df
, we’ll just re-sort df
. Notice that at this point, the “row IDs” start with 9, then go to 0, then 10, then 1, etc. We’ll fix that next.
df = df.sort_values('Human-Readable ID')
(row ID) | SalesforceID | Human-Readable ID | FirstName | FirstName |
---|---|---|---|---|
(9) | 003…10 | AAAA | Eleanor | … |
(0) | 003…01 | BBBB | Chidi | … |
(10) | 003…11 | CCCC | Ellie | … |
(1) | 003…02 | DDDD | Eleanor | … |
(2) | 003…03 | EEEE | Jason | … |
(3) | 003…04 | FFFF | Eleanor | … |
(4) | 003…05 | GGGG | Janet | … |
(5) | 003…06 | HHHH | Tahani | … |
(6) | 003…07 | IIII | Chidi | … |
(7) | 003…08 | JJJJ | Mindy | … |
(8) | 003…09 | KKKK | Kamilah | … |
(11) | 003…12 | LLLL | Mindy | … |
(12) | 003…13 | MMMM | Mindy | … |
(13) | 003…14 | NNNN | Tahani | … |
Re-set “row IDs” to start at 0
Told you we’d fix the row IDs.
df = df.reset_index(drop=True)
(row ID) | SalesforceID | Human-Readable ID | FirstName | … |
---|---|---|---|---|
(0) | 003…10 | AAAA | Eleanor | … |
(1) | 003…01 | BBBB | Chidi | … |
(2) | 003…11 | CCCC | Ellie | … |
(3) | 003…02 | DDDD | Eleanor | … |
(4) | 003…03 | EEEE | Jason | … |
(5) | 003…04 | FFFF | Eleanor | … |
(6) | 003…05 | GGGG | Janet | … |
(7) | 003…06 | HHHH | Tahani | … |
(8) | 003…07 | IIII | Chidi | … |
(9) | 003…08 | JJJJ | Mindy | … |
(10) | 003…09 | KKKK | Kamilah | … |
(11) | 003…12 | LLLL | Mindy | … |
(12) | 003…13 | MMMM | Mindy | … |
(13) | 003…14 | NNNN | Tahani | … |
Move all data except email addresses into the “row IDs.”
Now we’re going to play a nifty trick: we’re going to move any data, except the email addresses themselves, that we’ll need when building our output files, into the “row IDs.”
The order in which we list the fields that should move into “row ID” matters.
We need to make sure that the first item listed when we do this is something that uniquely identifies each row (if we couldn’t find such a piece of data, we’d want to include the old numeric “row ID” as the first item in this list – but I’m not going to show that code right now). That will help ensure that when we do a sort of “pivot”-like operation, we know that any “row ID” we find is truly the ID of a row, and not of a group of rows.
Just … trust me. 🤷
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
Our df
table now looks something like this (the row IDs don’t literally look like that with the //
characters, but it’s a decent representation of what’s going on).
(row ID) | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) | [email protected] | ||
(BBBB // 003…01 // Chidi // Anagonye // Work) | [email protected] | [email protected] | [email protected] |
(CCCC // 003…11 // Ellie // Shellstrop // Work) | [email protected] | ||
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) | [email protected] | ||
(EEEE // 003…03 // Jason // Mendoza // Work) | [email protected] | [email protected] | [email protected] |
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) | [email protected] | [email protected] | |
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) | [email protected] | [email protected] | |
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) | [email protected] | [email protected] | [email protected] |
(IIII // 003…07 // Chidi // Anagonye // Work) | [email protected] | ||
(JJJJ // 003…08 // Mindy // St. Claire // Personal) | [email protected] | ||
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) | [email protected] | ||
(LLLL // 003…12 // Mindy // St. Claire // Personal) | [email protected] | [email protected] | [email protected] |
(MMMM // 003…13 // Mindy // St. Claire // Work) | [email protected] | [email protected] | [email protected] |
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) | [email protected] |
“Stack” the email addresses
When Matthew first asked me about solving this problem, I said
Well, if I were trying to find duplicates in Excel, I suppose I’d do it this way:
- make 3 copies of the file
- delete 2 of the 3 email columns
- add a column indicating which email column remained
- cut/paste (concatenate) the rows of each of the 3 files one after the other into a big spreadsheet
- run “Find Duplicates” on that
Pandas’s stack
operation against the 3 column names we previously set aside in our emFieldNames
variable works something like that those first 4 steps.
It produces a special type of list specific to Pandas called a “Series)_”, which we’ll set aside into a variable called allEmailsSeries
.
allEmailsSeries = df[emFieldNames].stack()
As you can see in the output of allEmailsSeries
below, it’s a little funny-looking. It’s almost like grouped cells, or cells with line breaks in the text, in Excel. (Only a lot easier to work with using code, as we’ll soon see.)
(row ID) | (this column has no name) |
---|---|
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) | AlternateEmail: [email protected] |
(BBBB // 003…01 // Chidi // Anagonye // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(CCCC // 003…11 // Ellie // Shellstrop // Work) | WorkEmail: [email protected] |
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) | PersonalEmail: [email protected] |
(EEEE // 003…03 // Jason // Mendoza // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) | PersonalEmail: [email protected] AlternateEmail: [email protected] |
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) | PersonalEmail: [email protected] AlternateEmail: [email protected] |
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) | PersonalEmail: [email protected] WorkEmail: [email protected]</br>AlternateEmail: [email protected] |
(IIII // 003…07 // Chidi // Anagonye // Work) | WorkEmail: [email protected] |
(JJJJ // 003…08 // Mindy // St. Claire // Personal) | PersonalEmail: [email protected] |
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) | PersonalEmail: [email protected] |
(LLLL // 003…12 // Mindy // St. Claire // Personal) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(MMMM // 003…13 // Mindy // St. Claire // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) | AlternateEmail: [email protected] |
Group allEmailsSeries
by the email addresses it contains
You’ll have to trust me, but the next line of code is “Python magic” that produces a list of email addresses found in allEmailsSeries
and, for each email address, details about the rows where it was found.
We’ll save that information aside in a variable called groupedEmailSeriesGroupItems
.
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
groupedEmailSeriesGroupItems
isn’t pretty, but here it is, just in case you’re curious.
Trust me – the computer can read it, even if it’s hard for us to read.
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[1, 8], [0, 6], [0, 0], [1, 1], [2, 2], [0, 1]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[1, 1], [0, 0], [0, 0], [1, 1], [2, 2], [1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[0, 2, 3, 5], [9, 10, 1, 3], [1, 2, 1, 1], [4, 4, 4, 4], [0, 2, 1, 0], [2, 1, 0, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[5], [3], [1], [4], [0], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[6], [4], [3], [2], [0], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[4, 4, 4, 6], [2, 2, 2, 4], [4, 4, 4, 3], [3, 3, 3, 2], [2, 2, 2, 0], [0, 1, 2, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[10], [8], [5], [0], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[11, 11], [11, 11], [6, 6], [5, 5], [1, 1], [1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[9], [7], [6], [5], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[11], [11], [6], [5], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[12, 12, 12], [12, 12, 12], [6, 6, 6], [5, 5, 5], [2, 2, 2], [0, 1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[7, 7, 13], [5, 5, 13], [7, 7, 7], [0, 0, 0], [0, 0, 0], [0, 2, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[7], [5], [7], [0], [0], [1]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Filter groupedEmailSeriesGroupItems
into emsAndTheirCellsFound
Since groupedEmailSeriesGroupItems
was so terrible to read, we’re going to do more “Python magic” that will:
- Transform the “email addresses and the cells of
df
in which we found them” into something more legible - Throw away any email addresses that only appeared in one cell of
df
We’ll save this new information into a “dictionary, or “dict,” variable called emsAndTheirCellsFound
.
In this explanation, I’m still referring to “cells of df
,” but honestly, we’re never going to use df
again. We don’t need it.
Everything we need to know is in emsAndTheirCellsFound
, thanks to us changing the “row IDs” of df
to include the data we’d want later.
Avoiding the need to go back and consult df
for details like first name, last name, Salesforce ID, etc. helps our script run faster.
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
emsAndTheirCellsFound
, printed to be more legible, looks like this:
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'PersonalEmail'), ('IIII', '003...07', 'Chidi', 'Anagonye', 'Work', 'WorkEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'WorkEmail'), ('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail'), ('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work', 'WorkEmail'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal', 'PersonalEmail'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'WorkEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'AlternateEmail'), ('GGGG', '003...05', 'Janet', 'Della-Denunzio', 'Alternate', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'WorkEmail'), ('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'PersonalEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'WorkEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'PersonalEmail'), ('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail'), ('NNNN', '003...14', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail')]
Create empty “dumping spots” output2aList
and output2bList
We create 2 empty “lists” as a place to store data we build inside the upcoming “loop” that we’ll want in our final output (each item in the “list” will become a row of our output CSV file).
output2aList = []
output2bList = []
“Loop” over the email addresses
Now we’ll “loop” over one “email address” (e.g. [email protected]
) at a time.
Any code “inside” this “loop” will be able to tell which “email address” we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored in the variable emAddr.
Also accessible to us will be a “list” of details about each “cell” of df
where that email address was found, accessible under the variable name foundCells
.
For example, while we look at [email protected]
, foundCells
will contain a 2-item list, telling us that was found at the intersection of LLLL
and WorkEmail
and at the intersection of LLLL
and AlternateEmail
(along with some other details about those “cells,” such as the FirstName, LastName, etc. for the row where LLLL
was the Human-Readable ID
).
for emAddr, foundCells in emsAndTheirCellsFound.items():
Inspect a given email address
First thing we’ll do the moment we pull up a given emAddr
and its foundCells
list is set up a blank place to store “per-email-address”-level information we decide we want to set aside, as well as take a snapshot of how many items are in foundCells.
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output2bFact = OrderedDict()
Check if there even are duplicate “found cells” for the email address
If numCellsFound
(the length of foundCells
) isn’t >1
, the email address isn’t duplicated at all in our Contacts CSV, which means we can skip right along to the next email emAddr
in our “loop”:
if numCellsFound > 1:
“Loop” over the “cells where this email address was found” list
Presuming there actually are several “cells” of df
where an email address was found, now we need to loop over those cells.
Any code “inside” this “loop” will be able to tell which “cell” from foundCells
we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored ine variable cell
.
Also accessible to us will be a plain old number indicating which “pass” of this “inner loops” we’re on, stored in cellLoopCounter
. `cellLoopCounter’ starts at 0, not at 1, on its first pass.
Our program can also still see the “outer loop” information like what the current contents of emAddr
are.
for cellLoopCounter, cell in enumerate(foundCells):
Ask: Is this “found cell
” the first time we’ve seen the FirstName
it contains?
cell
is a type of Python data called a “tuple” that looks something like ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail')
. You can fetch any one of those things between the commas by putting its “position number” right after the word cell
in square brackets … just remember that the first one is 0, not 1.
So, for example, when ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail')
is the current value of cell
, then cell[2]
is Jason
.
We’re going to keep a running list of “the first time we’ve seen a given FirstName
for a given emAddr
.
(Each time we pull up a new emAddr
with our “outer loop,” we set up a fresh, empty seenFirstNamesAndTheirContacts
to help us do just that.)
if cell[2] not in seenFirstNamesAndTheirContacts:
If we haven’t yet seen a FirstName
before (within a given email address), we set it aside in seenFirstNamesAndTheirContacts
along with a copy of cell
that contains all of its values except that last one (the one specifying which email address column the email address was found in). That’s what the [:-1]
means. In other words, we’re just grabbing the info from cell
that identifies the row or the person (Contact), not every last detail about where they had that email address.
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
Furthermore, if it’s our first time seeing a FirstName
for a given e-mail address, we’ll check if it’s our first pass through the foundCells
for that e-mail address at all.
if cellLoopCounter == 0:
If it is, we’ll grab some details about that e-mail address and set them aside in an empty “dictionary” called output2bFact
that we created when we first pulled up the current emAddr
.
output2bFact['emAddr'] = emAddr
output2bFact['Master'] = cell[1]
output2bFact['Master_FN'] = cell[2]
output2bFact['Master_LN'] = cell[3]
If it’s our first time seeing a FirstName
for a given e-mail address, yet it’s not our first cell
for the e-mail address altogether, then it must be a duplicate record (e.g. Ellie
when we’ve already seen Eleanor
), so we’ll keep fleshing out output2bFact
with details about this new record. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many FirstName
s we’ve already seen for the current emAddr
and subtracting (the first one was “master” and sort of counts as “0,” but the count will be “1,” so now we have to subtract 1 to get our numbers right).
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
Ask: And if cell
is not the first time we’ve seen this FirstName
for an email address?
On the other hand, if we’ve seen this FirstName
before for this e-mail address, we’ll just create a copy of cell
that contains all of its values except that last one (the one specifying which email address column the email address was found in) and append it to the end of the list we’re keeping in seenFirstNamesAndTheirContacts
.
Subtle issue: if a given e-mail address is found in multiple columns for the same Contact (which could happen if we didn’t first run our “Output1,” fix the data with Data Loader, and re-fetch our Contacts), we will end up with redundant copies of the “Contact” data (e.g. ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work')
) stored in seenFirstNamesAndTheirContacts
.
That’s not a big deal, though, because Python is really good at detecting duplicates of that sort, and we’ll deduplicate for that later in our code.
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
Clean up at the end of the last pass through the foundCells
for an emAddr
We have some code that only makes sense to run when we’ve finished looping through the last cell
in foundCells
for a given emAddr
.
if cellLoopCounter == numCellsFound-1:
For example, now that we know we’ve finished building output2bFact
(because we’re all the way through all the records for an emAddr
), we can check to see if it even had any “duplicates” by looking for “Duplicate_1”.
If such a property of output2bFact
exists, we can go ahead and dump output2bFact
into our “temporary holding list” output2bList
.
if 'Duplicate_1' in output2bFact:
output2bList.append(output2bFact)
output2bFact
, for example, looks something like this (I enhanced it for legibility) at this point for [email protected]
:
Key: emAddr
Value: [email protected]
Key: Master
Value: 003...10
Key: Master_FN
Value: Eleanor
Key: Master_LN
Value: Shellstrop
Key: Duplicate_1
Value: 003...11
Key: Duplicate_1_FN
Value: Ellie
Key: Duplicate_1_LN
Value: Shellstrop
Furthermore, it’s time to do that promised cleanup of seenFirstNamesAndTheirContacts
, where we’ve been dumping a “dict” that contains all our unique “first names seen” and, for each of them, just the “contact” part …
“Loop” over the “first names seen” for a given emAddr
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
output2aFact = OrderedDict()
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
I’ve run out of “table of contents” indents (I can only go 6 levels deep), so we’ll have to talk about the entirety of this block of code here.
Firstly, while we are still within the “loop” for a given email address, such as [email protected]
, we will “loop over” the “first names” contained in seenFirstNamesAndTheirContacts
, which looks something like this:
Key: 'Eleanor'
Value: [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate')]
Key: 'Ellie'
Value: [('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work')]
Here’s the line of code responsible for making us loop over 'Eleanor'
and 'Ellie'
within seenFirstNamesAndTheirContacts
:
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
For each of those “loop passes” (that is, one for 'Eleanor'
and one for 'Ellie'
), the first thing we’ll do is set aside a place to store anything interesting we find:
output2aFact = OrderedDict()
Next, we’ll say that we really don’t care about data like the 'Ellie'
data, where there’s just one row in our dataset where the first name 'Ellie'
was found for the emAddr
'[email protected]'
, and only care more about data like 'Eleanor'
if len(contactsForFN) > 1:
In such cases, we’ll say that we want to do another loop to inspect each and every “Contact” record found for a combination of “email address” & “first name” like [email protected]
and Eleanor
:
for contactLoopCounter, contact in enumerate(contactsForFN):
Each time we go through that “loop,” the value of contact
looks something like this:
('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate')
If we’re dealing with the first contact
for a given firstName
, we’ll set it aside our empty “dictionary” called output2aFact
that we created when we first pulled up the current firstName
.
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
Otherwise, the contact
in question must be a duplicate record for the current emAddr
and firstName
combination, so we’ll keep fleshing out output2aFact
with details about this contact
. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many contact
s we’ve already seen for the current emAddr
+firstName
(fortunately, since our “loop counter” contactLoopCounter
started counting with “Master” as “loop count 0,” we’re good to go to start w/ “1” at our first duplicate):
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
output2aFact
, for example, looks something like this (I enhanced it for legibility) at this point for [email protected]
+Eleanor
:
Key: Master
Value: 003...10
Key: Duplicate_1
Value: 003...02
Key: Duplicate_2
Value: 003...04
Now that we know we’ve finished building output2aFact
(because we’re all the way through all the records for a firstName
within an emAddr
), we can dummy-check to ensure that it had any “duplicates” by looking for “Duplicate_1”.
Since we didn’t throw away any data in the output2aFact
loop the way we did in the output2bFact
loop, we probably don’t actually need to do this…
If such a property of output2aFact
exists, we can go ahead and dump output2aFact
into our “temporary holding list” output2aList
.
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
Save “output2a.csv” and “output2b.csv”
Phew! All done with loops!
We now have some “holding data” that looks like this:
output2aList
:
[OrderedDict([('Master', '003...01'), ('Duplicate_1', '003...07')]), OrderedDict([('Master', '003...10'), ('Duplicate_1', '003...02'), ('Duplicate_2', '003...04')]), OrderedDict([('Master', '003...06'), ('Duplicate_1', '003...14')])]
output2bList
:
[OrderedDict([('emAddr', '[email protected]'), ('Master', '003...10'), ('Master_FN', 'Eleanor'), ('Master_LN', 'Shellstrop'), ('Duplicate_1', '003...11'), ('Duplicate_1_FN', 'Ellie'), ('Duplicate_1_LN', 'Shellstrop')])]
We’ll dump them both to files on our hard drive, specifying that we want a plain .txt
file with a message if there are no output rows, but writing to .csv
if there are output rows.
In the case of outputting a CSV file, we’ll make it easy by converting output2aList
into a Pandas “DataFrame” called output2adf
and doing an equivalent operation for output2bList
, since Pandas “DataFrames” are so easy to convert to CSV.
The index=False
setting means “don’t bother to write pandas
’s internal row numbers as their own dedicated column in the output.
The quoting=1
means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”
if len(output2aList) > 0:
output2adf = pandas.DataFrame(output2aList)
output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
with open(output2aNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
if len(output2bList) > 0:
output2bdf = pandas.DataFrame(output2bList)
output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
with open(output2bNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
In this case, we do get output.
Here’s “output 2a”:
Master | Duplicate_1 | Duplicate_2 |
---|---|---|
003…01 | 003…07 | |
003…10 | 003…02 | 003…04 |
003…06 | 003…14 |
And here’s “output 2b”:
emAddr | Master | Master_FN | Master_LN | Duplicate_1 | Duplicate_1_FN | Duplicate_1_LN |
---|---|---|---|---|---|---|
[email protected] | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
[email protected] | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
Dummy-checking the time taken
Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')