Français
Presentations About Resources

Salesforce, Python, SQL, & other ways to put your data where you need it -- a bilingual blog in English & French

Dedupe Salesforce NPSP Emails with Python

23 Nov 2018 🔖 salesforce python questions tutorials intermediate
💬 EN

Post Header Image

Table of Contents

Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.

Data & Expectations

Input Data

As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “contacts.csv,” that looked something like this (I consolidated his examples and added a bit of complexity):

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com ca@hotmail.com
003…02 DDDD Eleanor Shellstrop Personal es@gp.com    
003…03 EEEE Jason Mendoza Work jm@gp.com jm@gp.com jm@gp.com
003…04 FFFF Eleanor Shellstrop Alternate es@hotmail.com   es@gp.com
003…05 GGGG Janet Della-Denunzio Alternate jd@gp.com   jm@gp.com
003…06 HHHH Tahani Al-Jamil Alternate ta@gp.com ta@yahoo.com ta@gp.com
003…07 IIII Chidi Anagonye Work   ca@gp.com  
003…08 JJJJ Mindy St. Claire Personal ms@mp.com    
003…09 KKKK Kamilah Al-Jamil Personal ka@kafoundation.org    
003…10 AAAA Eleanor Shellstrop Alternate     es@gp.com
003…11 CCCC Ellie Shellstrop Work   es@gp.com  
003…12 LLLL Mindy St. Claire Personal ms@yahoo.com ms@hotmail.com ms@hotmail.com
003…13 MMMM Mindy St. Claire Work ms@z.com ms@z.com ms@z.com
003…14 NNNN Tahani Al-Jamil Alternate     ta@gp.com

Output File #1

Matthew wanted to see an “output1.csv” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader (D.L. docs), that looks like this:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com  
003…03 EEEE Jason Mendoza Work   jm@gp.com  
003…06 HHHH Tahani Al-Jamil Alternate   ta@yahoo.com ta@gp.com
003…13 MMMM Mindy St. Claire Work   ms@z.com  

Output File #2A

To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” fileoutput2a.csv” that would look like the data below.

This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.

Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).

Each column is the Salesforce ID of a Contact in which the email address was found.

Master Duplicate_1 Duplicate_2
003…01 003…07  
003…10 003…02 003…04
003…06 003…14  

Output File #2B

For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “output2b.csv”, to facilitate human review. He suggested it look like the output below.

emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
es@gp.com 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop
jm@gp.com 003…03 Jason Mendoza 003…05 Janet Della-Denunzio

Running The Code Yourself

Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!

Or, if you’re adventurous, download an “Anaconda suite” installer for your computer’s operating system and give it a try.

  • Don’t bother installing VSCode when it asks if you want to.
  • Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
    • Type print('hello') in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see if hello is printed at the right-hand side of your screen in an output console (probably bottom right).
    • If so, congrats – you just wrote a program! Backspace out print(hello) and copy/paste my code in instead, then start tweaking it.

And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.


The Code

There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!

You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.

Code to produce “Output1” only

import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce “Output2A” and “Output2B” only

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running


df = pandas.read_csv(inputFilePath, dtype=object)

df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce all 3 output files at once

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

# ### Output1 portion starts here ###
df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output1 portion ends here ###

# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output2 portion ends here ###

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Stop reading here if you don’t want to learn Python

You don’t have to thoroughly understand the code above to make use of it.

Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.

Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!

For the curious, the rest of this post is total geekery peeking under the hood.


Code Explanations

“Output1” explanation

Import “time,” start the clock

First we import the time extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).

import time
start_time = time.time()

Import “pandas,” set display options

Next we import the pandas (spreadsheet-processing) extension to Python and say that, should we display any of its data on our screen with a print(...) command (although in this script we don’t), we want to make use of our full screen width before trying to “wrap” excessive columns onto overflow lines.

import pandas
pandas.set_option('expand_frame_repr', False)

Specify “email address” column names

We’ll create a “variable” (nickname for use later in our code) called emFieldNames, and in it, we’ll store a list of the names of columns where email addresses can be found.

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Map out “alternate spellings” of email address fields

In a variable called pickToField, we’ll create a “dictionary,” or “dict,” that lists out the values we might find in our PreferredEmail column, and clarifies what “email address” column headers they each correspond to.

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Specify where “contacts.csv” is

In the variable inputFilePath, we’ll specify where on our hard drive our file “contacts.csv” can be found.

The letter r before the open-quote tells Python that the \ characters in our code are meant to be interpreted literally, not as any sort of “special character.” (Usually, they’re special characters within Python text inside of quotes.)

inputFilePath = r'c:\example\contacts.csv'

If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'

Specify output files

In the variable output1FilePath, we’ll specify where to save our output file (if there is data), and in output1NoneFoundFilePath we’ll specify where to leave us a note saying there was no data.

output1FilePath = r'c:\example\output1.csv'
output1NoneFoundFilePath = r'c:\example\nonefound1.txt'

Read “contacts.csv” into pandas and call the table df

Now we’ll use pandas to read “contacts.csv” into Python.

We’ll call Python’s copy of the table df because it’s short for the jargon pandas uses to describe a table, which is “DataFrame.”

df = pandas.read_csv(inputfilepath, dtype=object)

Add a ChangedAnything to df

Right away, we’ll add a column to our table df called ChangedAnything, with values of False all the way down (since we haven’t yet changed anything).

df['ChangedAnything'] = False

df looks like this now:

SalesforceID Human-Readable ID FirstName ChangedAnything
003…01 BBBB Chidi False
003…02 DDDD Eleanor False
003…03 EEEE Jason False
003…04 FFFF Eleanor False
003…05 GGGG Janet False
003…06 HHHH Tahani False
003…07 IIII Chidi False
003…08 JJJJ Mindy False
003…09 KKKK Kamilah False
003…10 AAAA Eleanor False
003…11 CCCC Ellie False
003…12 LLLL Mindy False
003…13 MMMM Mindy False
003…14 NNNN Tahani False

Grab df rows containing valid PreferredEmail values

Next, we’ll build a special type of list specific to Pandas called a “Series” that will indicate, for each record in PreferredEmail, whether the value found there is a legitimate name of a column in df (always good to check for typos).

We’ll set that aside in a variable named validPrefEmailIndicatorTFSeries.

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)

validPrefEmailIndicatorTFSeries looks like this:

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
Name: PreferredEmail, dtype: bool

Add a PrefEmAddr column to df

These next two lines are … a bit of “Python magic” … but what they do is add a new PrefEmAddr column to our df table that contains the actual preferred email address of each Contact record.

df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

df now looks like this:

SalesforceID Human-Readable ID FirstName ChangedAnything PrefEmAddr
003…01 BBBB Chidi False ca@hotmail.com
003…02 DDDD Eleanor False es@gp.com
003…03 EEEE Jason False jm@gp.com
003…04 FFFF Eleanor False es@gp.com
003…05 GGGG Janet False jm@gp.com
003…06 HHHH Tahani False ta@gp.com
003…07 IIII Chidi False ca@gp.com
003…08 JJJJ Mindy False ms@mp.com
003…09 KKKK Kamilah False ka@kafoundation.org
003…10 AAAA Eleanor False es@gp.com
003…11 CCCC Ellie False es@gp.com
003…12 LLLL Mindy False ms@yahoo.com
003…13 MMMM Mindy False ms@z.com
003…14 NNNN Tahani False ta@gp.com

“Loop” over the email addresses

Now we’ll “loop” over one “email address” column name at a time (as listed out in emFieldNames).

Any code “inside” this “loop” will be able to tell which “column name” we’re currently inspecting, because for the duration of a given “pass” of the loop, an appropriate value ('PersonalEmail', 'WorkEmail', or 'AlternateEmail') will be stored in the variable fieldName.

for fieldName in emFieldNames:
Select the cells that need to be cleared out

First thing we’ll do within each “loop” is build a special list (again, a Pandas “Series“) of whether or not (True or False) we find each row of the df table “interesting.” We’ll save that list into a variable called needsClearingTFSer.

In this case, “interesting” means that for a given row of our df table, its value in the PreferredEmail column is not the column we’re currently “looping” over (as specified in fieldName), but that the email address found in the actual column specified by the current value of fieldName is the spelled exactly the same as the email address found in the same row’s PrefEmAddr field that we built earlier.

In other words, “It looks the same as the preferred email address, but it isn’t the preferred email address.”

In other other words, “Is it a redundant copy of the email address?”

needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])

During the 'PersonalEmail' pass of our loop, for example, needsClearingTFSer looks like this:

0     False
1     False
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
dtype: bool
Clear out cells that had redundant values

Next we’ll clear out any values found in cells at the intersections of the rows indicated by True values in needsClearingTFSer and the column indicated by fieldName for our current “loop pass.”

We’ll also set ChangedAnything for a given row to True if we … well … changed anything.

    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

After the entire “loop” as run through all 3 email address columns, our table df looks like this:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail ChangedAnything PrefEmAddr
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com   True ca@hotmail.com
003…02 DDDD Eleanor Shellstrop Personal es@gp.com     False es@gp.com
003…03 EEEE Jason Mendoza Work   jm@gp.com   True jm@gp.com
003…04 FFFF Eleanor Shellstrop Alternate es@hotmail.com   es@gp.com False es@gp.com
003…05 GGGG Janet Della-Denunzio Alternate jd@gp.com   jm@gp.com False jm@gp.com
003…06 HHHH Tahani Al-Jamil Alternate   ta@yahoo.com ta@gp.com True ta@gp.com
003…07 IIII Chidi Anagonye Work   ca@gp.com   False ca@gp.com
003…08 JJJJ Mindy St. Claire Personal ms@mp.com     False ms@mp.com
003…09 KKKK Kamilah Al-Jamil Personal ka@kafoundation.org     False ka@kafoundation.org
003…10 AAAA Eleanor Shellstrop Alternate     es@gp.com False es@gp.com
003…11 CCCC Ellie Shellstrop Work   es@gp.com   False es@gp.com
003…12 LLLL Mindy St. Claire Personal ms@yahoo.com ms@hotmail.com ms@hotmail.com False ms@yahoo.com
003…13 MMMM Mindy St. Claire Work   ms@z.com   True ms@z.com
003…14 NNNN Tahani Al-Jamil Alternate     ta@gp.com False ta@gp.com

Only show changed lines

Matthew asked that we only show lines where we changed something, so we’ll create a new copy of df and call it output1df. This copy will be filtered to only contain rows from df where changedAnything was True.

output1df = df[df['ChangedAnything']]

output1df looks like this:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail ChangedAnything PrefEmAddr
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com   True ca@hotmail.com
003…03 EEEE Jason Mendoza Work   jm@gp.com   True jm@gp.com
003…06 HHHH Tahani Al-Jamil Alternate   ta@yahoo.com ta@gp.com True ta@gp.com
003…13 MMMM Mindy St. Claire Work   ms@z.com   True ms@z.com

Delete the PrefEmAddr & ChangedAnything columns

We don’t need our PrefEmAddr and ChangedAnything columns anymore, so let’s get rid of them:

output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)

Now our output1df looks the way Matthew asked for it to look:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com  
003…03 EEEE Jason Mendoza Work   jm@gp.com  
003…06 HHHH Tahani Al-Jamil Alternate   ta@yahoo.com ta@gp.com
003…13 MMMM Mindy St. Claire Work   ms@z.com  

Save “output1.csv”

Finally, we’ll save output1df to our hard drive (or a “no records found” message instead) with the following code.

The index=False setting means “don’t bother to write pandas’s internal row numbers as their own dedicated column in the output.

The quoting=1 means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

Dummy-checking the time taken

Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

“Output2” explanation

Import “time,” start the clock

First we import the time extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).

import time
start_time = time.time()

Import “OrderedDict”

To ensure that we get the right Salesforce Contact IDs assigned to “Master” vs. “Duplicate 1” columns, we’ll need to ensure that we keep track of certain discoveries we make in a specific sequence.

Python’s OrderedDict extension will let us do this.

from collections import OrderedDict

Import “pandas,” set display options

Next we import the pandas (spreadsheet-processing) extension to Python and say that, should we display any of its data on our screen with a print(...) command (although in this script we don’t), we want to make use of our full screen width before trying to “wrap” excessive columns onto overflow lines.

import pandas
pandas.set_option('expand_frame_repr', False)

Specify “email address” column names

We’ll create a “variable” (nickname for use later in our code) called emFieldNames, and in it, we’ll store a list of the names of columns where email addresses can be found.

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Map out “alternate spellings” of email address fields

In a variable called pickToField, we’ll create a “dictionary, or “dict,” that lists out the values we might find in our PreferredEmail column, and clarifies what “email address” column headers they each correspond to.

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Specify where “contacts.csv” is

In the variable inputFilePath, we’ll specify where on our hard drive our file “contacts.csv” can be found.

inputFilePath = r'c:\example\contacts.csv'

If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'

Specify output files

In the variable output1FilePath, we’ll specify where to save our output file (if there is data), and in output1NoneFoundFilePath we’ll specify where to leave us a note saying there was no data.

output2aFilePath = r'c:\example\output2a.csv'
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt'
output2bFilePath = r'c:\example\output2a.csv'
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt'

Read “contacts.csv” into pandas and call the table df

Now we’ll use pandas to read “contacts.csv” into Python.

We’ll call Python’s copy of the table df because it’s short for the jargon pandas uses to describe a table, which is “DataFrame.”

df = pandas.read_csv(inputfilepath, dtype=object)

This time, when I show you what df looks like, as below, I’m going to include pandas’s internal “row IDs,” because we’re going to do some trickery by changing them. I want you to be able to watch them change. Upon importing df into Python from our CSV file, it looks like this:

(row ID) SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
(0) 003…01 BBBB Chidi Anagonye Work ca@gp.com ca@hotmail.com ca@hotmail.com
(1) 003…02 DDDD Eleanor Shellstrop Personal es@gp.com    
(2) 003…03 EEEE Jason Mendoza Work jm@gp.com jm@gp.com jm@gp.com
(3) 003…04 FFFF Eleanor Shellstrop Alternate es@hotmail.com   es@gp.com
(4) 003…05 GGGG Janet Della-Denunzio Alternate jd@gp.com   jm@gp.com
(5) 003…06 HHHH Tahani Al-Jamil Alternate ta@gp.com ta@yahoo.com ta@gp.com
(6) 003…07 IIII Chidi Anagonye Work   ca@gp.com  
(7) 003…08 JJJJ Mindy St. Claire Personal ms@mp.com    
(8) 003…09 KKKK Kamilah Al-Jamil Personal ka@kafoundation.org    
(9) 003…10 AAAA Eleanor Shellstrop Alternate     es@gp.com
(10) 003…11 CCCC Ellie Shellstrop Work   es@gp.com  
(11) 003…12 LLLL Mindy St. Claire Personal ms@yahoo.com ms@hotmail.com ms@hotmail.com
(12) 003…13 MMMM Mindy St. Claire Work ms@z.com ms@z.com ms@z.com
(13) 003…14 NNNN Tahani Al-Jamil Alternate     ta@gp.com

Sort df on Human-Readable ID

Matthew specifed that the alphabetically earliest Human-Readable ID value for a given “grouping” of “likely duplicates” should be the “Master” for that grouping.

Since a lot of the tricks we’ll play in this code read things in the order we started with them in in df, we’ll just re-sort df. Notice that at this point, the “row IDs” start with 9, then go to 0, then 10, then 1, etc. We’ll fix that next.

df = df.sort_values('Human-Readable ID')
(row ID) SalesforceID Human-Readable ID FirstName FirstName
(9) 003…10 AAAA Eleanor
(0) 003…01 BBBB Chidi
(10) 003…11 CCCC Ellie
(1) 003…02 DDDD Eleanor
(2) 003…03 EEEE Jason
(3) 003…04 FFFF Eleanor
(4) 003…05 GGGG Janet
(5) 003…06 HHHH Tahani
(6) 003…07 IIII Chidi
(7) 003…08 JJJJ Mindy
(8) 003…09 KKKK Kamilah
(11) 003…12 LLLL Mindy
(12) 003…13 MMMM Mindy
(13) 003…14 NNNN Tahani

Re-set “row IDs” to start at 0

Told you we’d fix the row IDs.

df = df.reset_index(drop=True)
(row ID) SalesforceID Human-Readable ID FirstName
(0) 003…10 AAAA Eleanor
(1) 003…01 BBBB Chidi
(2) 003…11 CCCC Ellie
(3) 003…02 DDDD Eleanor
(4) 003…03 EEEE Jason
(5) 003…04 FFFF Eleanor
(6) 003…05 GGGG Janet
(7) 003…06 HHHH Tahani
(8) 003…07 IIII Chidi
(9) 003…08 JJJJ Mindy
(10) 003…09 KKKK Kamilah
(11) 003…12 LLLL Mindy
(12) 003…13 MMMM Mindy
(13) 003…14 NNNN Tahani

Move all data except email addresses into the “row IDs.”

Now we’re going to play a nifty trick: we’re going to move any data, except the email addresses themselves, that we’ll need when building our output files, into the “row IDs.”

The order in which we list the fields that should move into “row ID” matters.

We need to make sure that the first item listed when we do this is something that uniquely identifies each row (if we couldn’t find such a piece of data, we’d want to include the old numeric “row ID” as the first item in this list – but I’m not going to show that code right now). That will help ensure that when we do a sort of “pivot”-like operation, we know that any “row ID” we find is truly the ID of a row, and not of a group of rows.

Just … trust me. 🤷

df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

Our df table now looks something like this (the row IDs don’t literally look like that with the // characters, but it’s a decent representation of what’s going on).

(row ID) PersonalEmail WorkEmail AlternateEmail
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate)     es@gp.com
(BBBB // 003…01 // Chidi // Anagonye // Work) ca@gp.com ca@hotmail.com ca@hotmail.com
(CCCC // 003…11 // Ellie // Shellstrop // Work)   es@gp.com  
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) es@gp.com    
(EEEE // 003…03 // Jason // Mendoza // Work) jm@gp.com jm@gp.com jm@gp.com
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) es@hotmail.com   es@gp.com
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) jd@gp.com   jm@gp.com
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) ta@gp.com ta@yahoo.com ta@gp.com
(IIII // 003…07 // Chidi // Anagonye // Work)   ca@gp.com  
(JJJJ // 003…08 // Mindy // St. Claire // Personal) ms@mp.com    
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) ka@kafoundation.org    
(LLLL // 003…12 // Mindy // St. Claire // Personal) ms@yahoo.com ms@hotmail.com ms@hotmail.com
(MMMM // 003…13 // Mindy // St. Claire // Work) ms@z.com ms@z.com ms@z.com
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate)     ta@gp.com

“Stack” the email addresses

When Matthew first asked me about solving this problem, I said

Well, if I were trying to find duplicates in Excel, I suppose I’d do it this way:

  1. make 3 copies of the file
  2. delete 2 of the 3 email columns
  3. add a column indicating which email column remained
  4. cut/paste (concatenate) the rows of each of the 3 files one after the other into a big spreadsheet
  5. run “Find Duplicates” on that

Pandas’s stack operation against the 3 column names we previously set aside in our emFieldNames variable works something like that those first 4 steps.

It produces a special type of list specific to Pandas called a “Series)_”, which we’ll set aside into a variable called allEmailsSeries.

allEmailsSeries = df[emFieldNames].stack()

As you can see in the output of allEmailsSeries below, it’s a little funny-looking. It’s almost like grouped cells, or cells with line breaks in the text, in Excel. (Only a lot easier to work with using code, as we’ll soon see.)

(row ID) (this column has no name)
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) AlternateEmail: es@gp.com
(BBBB // 003…01 // Chidi // Anagonye // Work) PersonalEmail: ca@gp.com
WorkEmail: ca@hotmail.com
AlternateEmail: ca@hotmail.com
(CCCC // 003…11 // Ellie // Shellstrop // Work) WorkEmail: es@gp.com
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) PersonalEmail: es@gp.com
(EEEE // 003…03 // Jason // Mendoza // Work) PersonalEmail: jm@gp.com
WorkEmail: jm@gp.com
AlternateEmail: jm@gp.com
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) PersonalEmail: es@hotmail.com
AlternateEmail: es@gp.com
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) PersonalEmail: jd@gp.com
AlternateEmail: jm@gp.com
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) PersonalEmail: ta@gp.com
WorkEmail: ta@yahoo.com</br>AlternateEmail: ta@gp.com
(IIII // 003…07 // Chidi // Anagonye // Work) WorkEmail: ca@gp.com
(JJJJ // 003…08 // Mindy // St. Claire // Personal) PersonalEmail: ms@mp.com
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) PersonalEmail: ka@kafoundation.org
(LLLL // 003…12 // Mindy // St. Claire // Personal) PersonalEmail: ms@yahoo.com
WorkEmail: ms@hotmail.com
AlternateEmail: ms@hotmail.com
(MMMM // 003…13 // Mindy // St. Claire // Work) PersonalEmail: ms@z.com
WorkEmail: ms@z.com
AlternateEmail: ms@z.com
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) AlternateEmail: ta@gp.com

Group allEmailsSeries by the email addresses it contains

You’ll have to trust me, but the next line of code is “Python magic” that produces a list of email addresses found in allEmailsSeries and, for each email address, details about the rows where it was found.

We’ll save that information aside in a variable called groupedEmailSeriesGroupItems.

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

groupedEmailSeriesGroupItems isn’t pretty, but here it is, just in case you’re curious.

Trust me – the computer can read it, even if it’s hard for us to read.

Group key:   ca@gp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 8], [0, 6], [0, 0], [1, 1], [2, 2], [0, 1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ca@hotmail.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 1], [0, 0], [0, 0], [1, 1], [2, 2], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   es@gp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[0, 2, 3, 5], [9, 10, 1, 3], [1, 2, 1, 1], [4, 4, 4, 4], [0, 2, 1, 0], [2, 1, 0, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   es@hotmail.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[5], [3], [1], [4], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   jd@gp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[6], [4], [3], [2], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   jm@gp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[4, 4, 4, 6], [2, 2, 2, 4], [4, 4, 4, 3], [3, 3, 3, 2], [2, 2, 2, 0], [0, 1, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ka@kafoundation.org
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[10], [8], [5], [0], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ms@hotmail.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11, 11], [11, 11], [6, 6], [5, 5], [1, 1], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ms@mp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[9], [7], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ms@yahoo.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11], [11], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ms@z.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[12, 12, 12], [12, 12, 12], [6, 6, 6], [5, 5, 5], [2, 2, 2], [0, 1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ta@gp.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7, 7, 13], [5, 5, 13], [7, 7, 7], [0, 0, 0], [0, 0, 0], [0, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   ta@yahoo.com
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7], [5], [7], [0], [0], [1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Filter groupedEmailSeriesGroupItems into emsAndTheirCellsFound

Since groupedEmailSeriesGroupItems was so terrible to read, we’re going to do more “Python magic” that will:

  1. Transform the “email addresses and the cells of df in which we found them” into something more legible
  2. Throw away any email addresses that only appeared in one cell of df

We’ll save this new information into a “dictionary, or “dict,” variable called emsAndTheirCellsFound.

In this explanation, I’m still referring to “cells of df,” but honestly, we’re never going to use df again. We don’t need it.

Everything we need to know is in emsAndTheirCellsFound, thanks to us changing the “row IDs” of df to include the data we’d want later.

Avoiding the need to go back and consult df for details like first name, last name, Salesforce ID, etc. helps our script run faster.

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

emsAndTheirCellsFound, printed to be more legible, looks like this:

Group key we'll soon call "emAddr" while looping through these groups:
         ca@gp.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'PersonalEmail'), ('IIII', '003...07', 'Chidi', 'Anagonye', 'Work', 'WorkEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         ca@hotmail.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'WorkEmail'), ('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         es@gp.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail'), ('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work', 'WorkEmail'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal', 'PersonalEmail'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         jm@gp.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'WorkEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'AlternateEmail'), ('GGGG', '003...05', 'Janet', 'Della-Denunzio', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         ms@hotmail.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'WorkEmail'), ('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         ms@z.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'PersonalEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'WorkEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         ta@gp.com
Group value we'll soon call "foundCells" while looping through these groups:
         [('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'PersonalEmail'), ('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail'), ('NNNN', '003...14', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail')]

Create empty “dumping spots” output2aList and output2bList

We create 2 empty “lists” as a place to store data we build inside the upcoming “loop” that we’ll want in our final output (each item in the “list” will become a row of our output CSV file).

output2aList = []
output2bList = []

“Loop” over the email addresses

Now we’ll “loop” over one “email address” (e.g. ca@gp.com) at a time.

Any code “inside” this “loop” will be able to tell which “email address” we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored in the variable emAddr.

Also accessible to us will be a “list” of details about each “cell” of df where that email address was found, accessible under the variable name foundCells.

For example, while we look at ms@hotmail.com, foundCells will contain a 2-item list, telling us that was found at the intersection of LLLL and WorkEmail and at the intersection of LLLL and AlternateEmail (along with some other details about those “cells,” such as the FirstName, LastName, etc. for the row where LLLL was the Human-Readable ID).

for emAddr, foundCells in emsAndTheirCellsFound.items():
Inspect a given email address

First thing we’ll do the moment we pull up a given emAddr and its foundCells list is set up a blank place to store “per-email-address”-level information we decide we want to set aside, as well as take a snapshot of how many items are in foundCells.

    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
Check if there even are duplicate “found cells” for the email address

If numCellsFound (the length of foundCells) isn’t >1, the email address isn’t duplicated at all in our Contacts CSV, which means we can skip right along to the next email emAddr in our “loop”:

    if numCellsFound > 1:
“Loop” over the “cells where this email address was found” list

Presuming there actually are several “cells” of df where an email address was found, now we need to loop over those cells.

Any code “inside” this “loop” will be able to tell which “cell” from foundCells we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored ine variable cell.

Also accessible to us will be a plain old number indicating which “pass” of this “inner loops” we’re on, stored in cellLoopCounter. `cellLoopCounter’ starts at 0, not at 1, on its first pass.

Our program can also still see the “outer loop” information like what the current contents of emAddr are.

        for cellLoopCounter, cell in enumerate(foundCells):
Ask: Is this “found cell” the first time we’ve seen the FirstName it contains?

cell is a type of Python data called a “tuple” that looks something like ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'). You can fetch any one of those things between the commas by putting its “position number” right after the word cell in square brackets … just remember that the first one is 0, not 1.

So, for example, when ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail') is the current value of cell, then cell[2] is Jason.

We’re going to keep a running list of “the first time we’ve seen a given FirstName for a given emAddr.

(Each time we pull up a new emAddr with our “outer loop,” we set up a fresh, empty seenFirstNamesAndTheirContacts to help us do just that.)

            if cell[2] not in seenFirstNamesAndTheirContacts:

If we haven’t yet seen a FirstName before (within a given email address), we set it aside in seenFirstNamesAndTheirContacts along with a copy of cell that contains all of its values except that last one (the one specifying which email address column the email address was found in). That’s what the [:-1] means. In other words, we’re just grabbing the info from cell that identifies the row or the person (Contact), not every last detail about where they had that email address.

                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]

Furthermore, if it’s our first time seeing a FirstName for a given e-mail address, we’ll check if it’s our first pass through the foundCells for that e-mail address at all.

                if cellLoopCounter == 0:

If it is, we’ll grab some details about that e-mail address and set them aside in an empty “dictionary” called output2bFact that we created when we first pulled up the current emAddr.

                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]

If it’s our first time seeing a FirstName for a given e-mail address, yet it’s not our first cell for the e-mail address altogether, then it must be a duplicate record (e.g. Ellie when we’ve already seen Eleanor), so we’ll keep fleshing out output2bFact with details about this new record. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many FirstNames we’ve already seen for the current emAddr and subtracting (the first one was “master” and sort of counts as “0,” but the count will be “1,” so now we have to subtract 1 to get our numbers right).

                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
Ask: And if cell is not the first time we’ve seen this FirstName for an email address?

On the other hand, if we’ve seen this FirstName before for this e-mail address, we’ll just create a copy of cell that contains all of its values except that last one (the one specifying which email address column the email address was found in) and append it to the end of the list we’re keeping in seenFirstNamesAndTheirContacts.

Subtle issue: if a given e-mail address is found in multiple columns for the same Contact (which could happen if we didn’t first run our “Output1,” fix the data with Data Loader, and re-fetch our Contacts), we will end up with redundant copies of the “Contact” data (e.g. ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work')) stored in seenFirstNamesAndTheirContacts.

That’s not a big deal, though, because Python is really good at detecting duplicates of that sort, and we’ll deduplicate for that later in our code.

                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
Clean up at the end of the last pass through the foundCells for an emAddr

We have some code that only makes sense to run when we’ve finished looping through the last cell in foundCells for a given emAddr.

            if cellLoopCounter == numCellsFound-1:

For example, now that we know we’ve finished building output2bFact (because we’re all the way through all the records for an emAddr), we can check to see if it even had any “duplicates” by looking for “Duplicate_1”.

If such a property of output2bFact exists, we can go ahead and dump output2bFact into our “temporary holding list” output2bList.

                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)

output2bFact, for example, looks something like this (I enhanced it for legibility) at this point for es@gp.com:

Key:   emAddr
Value:   es@gp.com

Key:   Master
Value:   003...10

Key:   Master_FN
Value:   Eleanor

Key:   Master_LN
Value:   Shellstrop

Key:   Duplicate_1
Value:   003...11

Key:   Duplicate_1_FN
Value:   Ellie

Key:   Duplicate_1_LN
Value:   Shellstrop

Furthermore, it’s time to do that promised cleanup of seenFirstNamesAndTheirContacts, where we’ve been dumping a “dict” that contains all our unique “first names seen” and, for each of them, just the “contact” part …

“Loop” over the “first names seen” for a given emAddr
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

I’ve run out of “table of contents” indents (I can only go 6 levels deep), so we’ll have to talk about the entirety of this block of code here.

Firstly, while we are still within the “loop” for a given email address, such as es@gp.com, we will “loop over” the “first names” contained in seenFirstNamesAndTheirContacts, which looks something like this:

Key:  'Eleanor'
Value:  [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate')]

Key:  'Ellie'
Value:  [('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work')]

Here’s the line of code responsible for making us loop over 'Eleanor' and 'Ellie' within seenFirstNamesAndTheirContacts:

                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():

For each of those “loop passes” (that is, one for 'Eleanor' and one for 'Ellie'), the first thing we’ll do is set aside a place to store anything interesting we find:

                    output2aFact = OrderedDict()

Next, we’ll say that we really don’t care about data like the 'Ellie' data, where there’s just one row in our dataset where the first name 'Ellie' was found for the emAddr 'es@gp.com', and only care more about data like 'Eleanor'

                    if len(contactsForFN) > 1:

In such cases, we’ll say that we want to do another loop to inspect each and every “Contact” record found for a combination of “email address” & “first name” like es@gp.com and Eleanor:

                        for contactLoopCounter, contact in enumerate(contactsForFN):

Each time we go through that “loop,” the value of contact looks something like this:

('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate')

If we’re dealing with the first contact for a given firstName, we’ll set it aside our empty “dictionary” called output2aFact that we created when we first pulled up the current firstName.

                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]

Otherwise, the contact in question must be a duplicate record for the current emAddr and firstName combination, so we’ll keep fleshing out output2aFact with details about this contact. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many contacts we’ve already seen for the current emAddr+firstName (fortunately, since our “loop counter” contactLoopCounter started counting with “Master” as “loop count 0,” we’re good to go to start w/ “1” at our first duplicate):

                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]

output2aFact, for example, looks something like this (I enhanced it for legibility) at this point for es@gp.com+Eleanor:

Key:  Master
Value:  003...10

Key:  Duplicate_1
Value:  003...02

Key:  Duplicate_2
Value:  003...04

Now that we know we’ve finished building output2aFact (because we’re all the way through all the records for a firstName within an emAddr), we can dummy-check to ensure that it had any “duplicates” by looking for “Duplicate_1”.

Since we didn’t throw away any data in the output2aFact loop the way we did in the output2bFact loop, we probably don’t actually need to do this…

If such a property of output2aFact exists, we can go ahead and dump output2aFact into our “temporary holding list” output2aList.

                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

Save “output2a.csv” and “output2b.csv”

Phew! All done with loops!

We now have some “holding data” that looks like this:

output2aList:

[OrderedDict([('Master', '003...01'), ('Duplicate_1', '003...07')]), OrderedDict([('Master', '003...10'), ('Duplicate_1', '003...02'), ('Duplicate_2', '003...04')]), OrderedDict([('Master', '003...06'), ('Duplicate_1', '003...14')])]

output2bList:

[OrderedDict([('emAddr', 'es@gp.com'), ('Master', '003...10'), ('Master_FN', 'Eleanor'), ('Master_LN', 'Shellstrop'), ('Duplicate_1', '003...11'), ('Duplicate_1_FN', 'Ellie'), ('Duplicate_1_LN', 'Shellstrop')])]

We’ll dump them both to files on our hard drive, specifying that we want a plain .txt file with a message if there are no output rows, but writing to .csv if there are output rows.

In the case of outputting a CSV file, we’ll make it easy by converting output2aList into a Pandas “DataFrame” called output2adf and doing an equivalent operation for output2bList, since Pandas “DataFrames” are so easy to convert to CSV.

The index=False setting means “don’t bother to write pandas’s internal row numbers as their own dedicated column in the output.

The quoting=1 means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

In this case, we do get output.

Here’s “output 2a”:

Master Duplicate_1 Duplicate_2
003…01 003…07  
003…10 003…02 003…04
003…06 003…14  

And here’s “output 2b”:

emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
es@gp.com 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop
jm@gp.com 003…03 Jason Mendoza 003…05 Janet Della-Denunzio

Dummy-checking the time taken

Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
--- ---