Dedupe Salesforce NPSP Emails with Python

23 Nov 2018 🔖 salesforce python questions tutorials intermediate
💬 EN

Data & Expectations
Running The Code Yourself
The Code
Stop reading here if you don’t want to learn Python
Code Explanations
- “Output1” explanation
- “Output2” explanation

Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.

Data & Expectations

Input Data

As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “contacts.csv,” that looked something like this (I consolidated his examples and added a bit of complexity):

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]	[email protected]
003…02	DDDD	Eleanor	Shellstrop	Personal	[email protected]
003…03	EEEE	Jason	Mendoza	Work	[email protected]	[email protected]	[email protected]
003…04	FFFF	Eleanor	Shellstrop	Alternate	[email protected]		[email protected]
003…05	GGGG	Janet	Della-Denunzio	Alternate	[email protected]		[email protected]
003…06	HHHH	Tahani	Al-Jamil	Alternate	[email protected]	[email protected]	[email protected]
003…07	IIII	Chidi	Anagonye	Work		[email protected]
003…08	JJJJ	Mindy	St. Claire	Personal	[email protected]
003…09	KKKK	Kamilah	Al-Jamil	Personal	[email protected]
003…10	AAAA	Eleanor	Shellstrop	Alternate			[email protected]
003…11	CCCC	Ellie	Shellstrop	Work		[email protected]
003…12	LLLL	Mindy	St. Claire	Personal	[email protected]	[email protected]	[email protected]
003…13	MMMM	Mindy	St. Claire	Work	[email protected]	[email protected]	[email protected]
003…14	NNNN	Tahani	Al-Jamil	Alternate			[email protected]

Output File #1

Matthew wanted to see an “output1.csv” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader (D.L. docs), that looks like this:

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]
003…03	EEEE	Jason	Mendoza	Work		[email protected]
003…06	HHHH	Tahani	Al-Jamil	Alternate		[email protected]	[email protected]
003…13	MMMM	Mindy	St. Claire	Work		[email protected]

Output File #2A

To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” file “output2a.csv” that would look like the data below.

This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.

Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).

Each column is the Salesforce ID of a Contact in which the email address was found.

Master	Duplicate_1	Duplicate_2
003…01	003…07
003…10	003…02	003…04
003…06	003…14

Output File #2B

For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “output2b.csv”, to facilitate human review. He suggested it look like the output below.

emAddr	Master	Master_FN	Master_LN	Duplicate_1	Duplicate_1_FN	Duplicate_1_LN
[email protected]	003…10	Eleanor	Shellstrop	003…11	Ellie	Shellstrop
[email protected]	003…03	Jason	Mendoza	003…05	Janet	Della-Denunzio

Running The Code Yourself

Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!

Or, if you’re adventurous, download an “Anaconda suite” installer for your computer’s operating system and give it a try.

Don’t bother installing VSCode when it asks if you want to.
Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
- Type print('hello') in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see if hello is printed at the right-hand side of your screen in an output console (probably bottom right).
- If so, congrats – you just wrote a program! Backspace out print(hello) and copy/paste my code in instead, then start tweaking it.

And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.

The Code

There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!

You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.

Code to produce “Output1” only

import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce “Output2A” and “Output2B” only

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running


df = pandas.read_csv(inputFilePath, dtype=object)

df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce all 3 output files at once

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

# ### Output1 portion starts here ###
df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output1 portion ends here ###

# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output2 portion ends here ###

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Stop reading here if you don’t want to learn Python

You don’t have to thoroughly understand the code above to make use of it.

Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.

Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!

For the curious, the rest of this post is total geekery peeking under the hood.

Code Explanations

“Output1” explanation

Import “time,” start the clock

First we import the time extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).

import time
start_time = time.time()

Import “pandas,” set display options

Next we import the pandas (spreadsheet-processing) extension to Python and say that, should we display any of its data on our screen with a print(...) command (although in this script we don’t), we want to make use of our full screen width before trying to “wrap” excessive columns onto overflow lines.

import pandas
pandas.set_option('expand_frame_repr', False)

Specify “email address” column names

We’ll create a “variable” (nickname for use later in our code) called emFieldNames, and in it, we’ll store a list of the names of columns where email addresses can be found.

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Map out “alternate spellings” of email address fields

In a variable called pickToField, we’ll create a “dictionary,” or “dict,” that lists out the values we might find in our PreferredEmail column, and clarifies what “email address” column headers they each correspond to.

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Specify where “contacts.csv” is

In the variable inputFilePath, we’ll specify where on our hard drive our file “contacts.csv” can be found.

The letter r before the open-quote tells Python that the \ characters in our code are meant to be interpreted literally, not as any sort of “special character.” (Usually, they’re special characters within Python text inside of quotes.)

inputFilePath = r'c:\example\contacts.csv'

If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'

Specify output files

In the variable output1FilePath, we’ll specify where to save our output file (if there is data), and in output1NoneFoundFilePath we’ll specify where to leave us a note saying there was no data.

output1FilePath = r'c:\example\output1.csv'
output1NoneFoundFilePath = r'c:\example\nonefound1.txt'

Read “contacts.csv” into `pandas` and call the table `df`

Now we’ll use pandas to read “contacts.csv” into Python.

We’ll call Python’s copy of the table df because it’s short for the jargon pandas uses to describe a table, which is “DataFrame.”

df = pandas.read_csv(inputfilepath, dtype=object)

Add a `ChangedAnything` to `df`

Right away, we’ll add a column to our table df called ChangedAnything, with values of False all the way down (since we haven’t yet changed anything).

df['ChangedAnything'] = False

df looks like this now:

SalesforceID	Human-Readable ID	FirstName	…	ChangedAnything
003…01	BBBB	Chidi	…	False
003…02	DDDD	Eleanor	…	False
003…03	EEEE	Jason	…	False
003…04	FFFF	Eleanor	…	False
003…05	GGGG	Janet	…	False
003…06	HHHH	Tahani	…	False
003…07	IIII	Chidi	…	False
003…08	JJJJ	Mindy	…	False
003…09	KKKK	Kamilah	…	False
003…10	AAAA	Eleanor	…	False
003…11	CCCC	Ellie	…	False
003…12	LLLL	Mindy	…	False
003…13	MMMM	Mindy	…	False
003…14	NNNN	Tahani	…	False

Grab `df` rows containing valid `PreferredEmail` values

Next, we’ll build a special type of list specific to Pandas called a “Series” that will indicate, for each record in PreferredEmail, whether the value found there is a legitimate name of a column in df (always good to check for typos).

We’ll set that aside in a variable named validPrefEmailIndicatorTFSeries.

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)

validPrefEmailIndicatorTFSeries looks like this:

   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
  True
  True
Name: PreferredEmail, dtype: bool

Add a `PrefEmAddr` column to `df`

These next two lines are … a bit of “Python magic” … but what they do is add a new PrefEmAddr column to our df table that contains the actual preferred email address of each Contact record.

df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

df now looks like this:

SalesforceID	Human-Readable ID	FirstName	…	ChangedAnything	PrefEmAddr
003…01	BBBB	Chidi	…	False	[email protected]
003…02	DDDD	Eleanor	…	False	[email protected]
003…03	EEEE	Jason	…	False	[email protected]
003…04	FFFF	Eleanor	…	False	[email protected]
003…05	GGGG	Janet	…	False	[email protected]
003…06	HHHH	Tahani	…	False	[email protected]
003…07	IIII	Chidi	…	False	[email protected]
003…08	JJJJ	Mindy	…	False	[email protected]
003…09	KKKK	Kamilah	…	False	[email protected]
003…10	AAAA	Eleanor	…	False	[email protected]
003…11	CCCC	Ellie	…	False	[email protected]
003…12	LLLL	Mindy	…	False	[email protected]
003…13	MMMM	Mindy	…	False	[email protected]
003…14	NNNN	Tahani	…	False	[email protected]

“Loop” over the email addresses

Now we’ll “loop” over one “email address” column name at a time (as listed out in emFieldNames).

Any code “inside” this “loop” will be able to tell which “column name” we’re currently inspecting, because for the duration of a given “pass” of the loop, an appropriate value ('PersonalEmail', 'WorkEmail', or 'AlternateEmail') will be stored in the variable fieldName.

for fieldName in emFieldNames:

Select the cells that need to be cleared out

First thing we’ll do within each “loop” is build a special list (again, a Pandas “Series“) of whether or not (True or False) we find each row of the df table “interesting.” We’ll save that list into a variable called needsClearingTFSer.

In this case, “interesting” means that for a given row of our df table, its value in the PreferredEmail column is not the column we’re currently “looping” over (as specified in fieldName), but that the email address found in the actual column specified by the current value of fieldName is the spelled exactly the same as the email address found in the same row’s PrefEmAddr field that we built earlier.

In other words, “It looks the same as the preferred email address, but it isn’t the preferred email address.”

In other other words, “Is it a redundant copy of the email address?”

needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])

During the 'PersonalEmail' pass of our loop, for example, needsClearingTFSer looks like this:

   False
   False
    True
   False
   False
    True
   False
   False
   False
   False
  False
  False
dtype: bool

Clear out cells that had redundant values

Next we’ll clear out any values found in cells at the intersections of the rows indicated by True values in needsClearingTFSer and the column indicated by fieldName for our current “loop pass.”

We’ll also set ChangedAnything for a given row to True if we … well … changed anything.

    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

After the entire “loop” as run through all 3 email address columns, our table df looks like this:

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail	ChangedAnything	PrefEmAddr
003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]		True	[email protected]
003…02	DDDD	Eleanor	Shellstrop	Personal	[email protected]			False	[email protected]
003…03	EEEE	Jason	Mendoza	Work		[email protected]		True	[email protected]
003…04	FFFF	Eleanor	Shellstrop	Alternate	[email protected]		[email protected]	False	[email protected]
003…05	GGGG	Janet	Della-Denunzio	Alternate	[email protected]		[email protected]	False	[email protected]
003…06	HHHH	Tahani	Al-Jamil	Alternate		[email protected]	[email protected]	True	[email protected]
003…07	IIII	Chidi	Anagonye	Work		[email protected]		False	[email protected]
003…08	JJJJ	Mindy	St. Claire	Personal	[email protected]			False	[email protected]
003…09	KKKK	Kamilah	Al-Jamil	Personal	[email protected]			False	[email protected]
003…10	AAAA	Eleanor	Shellstrop	Alternate			[email protected]	False	[email protected]
003…11	CCCC	Ellie	Shellstrop	Work		[email protected]		False	[email protected]
003…12	LLLL	Mindy	St. Claire	Personal	[email protected]	[email protected]	[email protected]	False	[email protected]
003…13	MMMM	Mindy	St. Claire	Work		[email protected]		True	[email protected]
003…14	NNNN	Tahani	Al-Jamil	Alternate			[email protected]	False	[email protected]

Only show changed lines

Matthew asked that we only show lines where we changed something, so we’ll create a new copy of df and call it output1df. This copy will be filtered to only contain rows from df where changedAnything was True.

output1df = df[df['ChangedAnything']]

output1df looks like this:

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail	ChangedAnything	PrefEmAddr
003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]		True	[email protected]
003…03	EEEE	Jason	Mendoza	Work		[email protected]		True	[email protected]
003…06	HHHH	Tahani	Al-Jamil	Alternate		[email protected]	[email protected]	True	[email protected]
003…13	MMMM	Mindy	St. Claire	Work		[email protected]		True	[email protected]

Delete the `PrefEmAddr` & `ChangedAnything` columns

We don’t need our PrefEmAddr and ChangedAnything columns anymore, so let’s get rid of them:

output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)

Now our output1df looks the way Matthew asked for it to look:

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]
003…03	EEEE	Jason	Mendoza	Work		[email protected]
003…06	HHHH	Tahani	Al-Jamil	Alternate		[email protected]	[email protected]
003…13	MMMM	Mindy	St. Claire	Work		[email protected]

Save “output1.csv”

Finally, we’ll save output1df to our hard drive (or a “no records found” message instead) with the following code.

The index=False setting means “don’t bother to write pandas’s internal row numbers as their own dedicated column in the output.

The quoting=1 means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

Dummy-checking the time taken

Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

“Output2” explanation

Import “time,” start the clock

First we import the time extension to Python and start a clock running (so we can tell, at the end, how long our script took to run).

import time
start_time = time.time()

Import “OrderedDict”

To ensure that we get the right Salesforce Contact IDs assigned to “Master” vs. “Duplicate 1” columns, we’ll need to ensure that we keep track of certain discoveries we make in a specific sequence.

Python’s OrderedDict extension will let us do this.

from collections import OrderedDict

Import “pandas,” set display options

import pandas
pandas.set_option('expand_frame_repr', False)

Specify “email address” column names

We’ll create a “variable” (nickname for use later in our code) called emFieldNames, and in it, we’ll store a list of the names of columns where email addresses can be found.

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Map out “alternate spellings” of email address fields

In a variable called pickToField, we’ll create a “dictionary, or “dict,” that lists out the values we might find in our PreferredEmail column, and clarifies what “email address” column headers they each correspond to.

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Specify where “contacts.csv” is

In the variable inputFilePath, we’ll specify where on our hard drive our file “contacts.csv” can be found.

inputFilePath = r'c:\example\contacts.csv'

If you’re on a Mac, the file path might look more like r'/Users/your_username/example/contacts.csv'

Specify output files

output2aFilePath = r'c:\example\output2a.csv'
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt'
output2bFilePath = r'c:\example\output2a.csv'
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt'

Read “contacts.csv” into `pandas` and call the table `df`

Now we’ll use pandas to read “contacts.csv” into Python.

We’ll call Python’s copy of the table df because it’s short for the jargon pandas uses to describe a table, which is “DataFrame.”

df = pandas.read_csv(inputfilepath, dtype=object)

This time, when I show you what df looks like, as below, I’m going to include pandas’s internal “row IDs,” because we’re going to do some trickery by changing them. I want you to be able to watch them change. Upon importing df into Python from our CSV file, it looks like this:

(row ID)	SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
*(0)*	003…01	BBBB	Chidi	Anagonye	Work	[email protected]	[email protected]	[email protected]
*(1)*	003…02	DDDD	Eleanor	Shellstrop	Personal	[email protected]
*(2)*	003…03	EEEE	Jason	Mendoza	Work	[email protected]	[email protected]	[email protected]
*(3)*	003…04	FFFF	Eleanor	Shellstrop	Alternate	[email protected]		[email protected]
*(4)*	003…05	GGGG	Janet	Della-Denunzio	Alternate	[email protected]		[email protected]
*(5)*	003…06	HHHH	Tahani	Al-Jamil	Alternate	[email protected]	[email protected]	[email protected]
*(6)*	003…07	IIII	Chidi	Anagonye	Work		[email protected]
*(7)*	003…08	JJJJ	Mindy	St. Claire	Personal	[email protected]
*(8)*	003…09	KKKK	Kamilah	Al-Jamil	Personal	[email protected]
*(9)*	003…10	AAAA	Eleanor	Shellstrop	Alternate			[email protected]
*(10)*	003…11	CCCC	Ellie	Shellstrop	Work		[email protected]
*(11)*	003…12	LLLL	Mindy	St. Claire	Personal	[email protected]	[email protected]	[email protected]
*(12)*	003…13	MMMM	Mindy	St. Claire	Work	[email protected]	[email protected]	[email protected]
*(13)*	003…14	NNNN	Tahani	Al-Jamil	Alternate			[email protected]

Sort `df` on `Human-Readable ID`

Matthew specifed that the alphabetically earliest Human-Readable ID value for a given “grouping” of “likely duplicates” should be the “Master” for that grouping.

Since a lot of the tricks we’ll play in this code read things in the order we started with them in in df, we’ll just re-sort df. Notice that at this point, the “row IDs” start with 9, then go to 0, then 10, then 1, etc. We’ll fix that next.

df = df.sort_values('Human-Readable ID')

(row ID)	SalesforceID	Human-Readable ID	FirstName	FirstName
*(9)*	003…10	AAAA	Eleanor	…
*(0)*	003…01	BBBB	Chidi	…
*(10)*	003…11	CCCC	Ellie	…
*(1)*	003…02	DDDD	Eleanor	…
*(2)*	003…03	EEEE	Jason	…
*(3)*	003…04	FFFF	Eleanor	…
*(4)*	003…05	GGGG	Janet	…
*(5)*	003…06	HHHH	Tahani	…
*(6)*	003…07	IIII	Chidi	…
*(7)*	003…08	JJJJ	Mindy	…
*(8)*	003…09	KKKK	Kamilah	…
*(11)*	003…12	LLLL	Mindy	…
*(12)*	003…13	MMMM	Mindy	…
*(13)*	003…14	NNNN	Tahani	…

Re-set “row IDs” to start at 0

Told you we’d fix the row IDs.

df = df.reset_index(drop=True)

(row ID)	SalesforceID	Human-Readable ID	FirstName	…
*(0)*	003…10	AAAA	Eleanor	…
*(1)*	003…01	BBBB	Chidi	…
*(2)*	003…11	CCCC	Ellie	…
*(3)*	003…02	DDDD	Eleanor	…
*(4)*	003…03	EEEE	Jason	…
*(5)*	003…04	FFFF	Eleanor	…
*(6)*	003…05	GGGG	Janet	…
*(7)*	003…06	HHHH	Tahani	…
*(8)*	003…07	IIII	Chidi	…
*(9)*	003…08	JJJJ	Mindy	…
*(10)*	003…09	KKKK	Kamilah	…
*(11)*	003…12	LLLL	Mindy	…
*(12)*	003…13	MMMM	Mindy	…
*(13)*	003…14	NNNN	Tahani	…

Move all data except email addresses into the “row IDs.”

Now we’re going to play a nifty trick: we’re going to move any data, except the email addresses themselves, that we’ll need when building our output files, into the “row IDs.”

The order in which we list the fields that should move into “row ID” matters.

We need to make sure that the first item listed when we do this is something that uniquely identifies each row (if we couldn’t find such a piece of data, we’d want to include the old numeric “row ID” as the first item in this list – but I’m not going to show that code right now). That will help ensure that when we do a sort of “pivot”-like operation, we know that any “row ID” we find is truly the ID of a row, and not of a group of rows.

Just … trust me. 🤷

df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

Our df table now looks something like this (the row IDs don’t literally look like that with the // characters, but it’s a decent representation of what’s going on).

(row ID)	PersonalEmail	WorkEmail	AlternateEmail
*(AAAA // 003…10 // Eleanor // Shellstrop // Alternate)*			[email protected]
*(BBBB // 003…01 // Chidi // Anagonye // Work)*	[email protected]	[email protected]	[email protected]
*(CCCC // 003…11 // Ellie // Shellstrop // Work)*		[email protected]
*(DDDD // 003…02 // Eleanor // Shellstrop // Personal)*	[email protected]
*(EEEE // 003…03 // Jason // Mendoza // Work)*	[email protected]	[email protected]	[email protected]
*(FFFF // 003…04 // Eleanor // Shellstrop // Alternate)*	[email protected]		[email protected]
*(GGGG // 003…05 // Janet // Della-Denunzio // Alternate)*	[email protected]		[email protected]
*(HHHH // 003…06 // Tahani // Al-Jamil // Alternate)*	[email protected]	[email protected]	[email protected]
*(IIII // 003…07 // Chidi // Anagonye // Work)*		[email protected]
*(JJJJ // 003…08 // Mindy // St. Claire // Personal)*	[email protected]
*(KKKK // 003…09 // Kamilah // Al-Jamil // Personal)*	[email protected]
*(LLLL // 003…12 // Mindy // St. Claire // Personal)*	[email protected]	[email protected]	[email protected]
*(MMMM // 003…13 // Mindy // St. Claire // Work)*	[email protected]	[email protected]	[email protected]
*(NNNN // 003…14 // Tahani // Al-Jamil // Alternate)*			[email protected]

“Stack” the email addresses

When Matthew first asked me about solving this problem, I said

Well, if I were trying to find duplicates in Excel, I suppose I’d do it this way:

make 3 copies of the file

delete 2 of the 3 email columns

add a column indicating which email column remained

cut/paste (concatenate) the rows of each of the 3 files one after the other into a big spreadsheet

run “Find Duplicates” on that

Pandas’s stack operation against the 3 column names we previously set aside in our emFieldNames variable works something like that those first 4 steps.

It produces a special type of list specific to Pandas called a “Series)_”, which we’ll set aside into a variable called allEmailsSeries.

allEmailsSeries = df[emFieldNames].stack()

As you can see in the output of allEmailsSeries below, it’s a little funny-looking. It’s almost like grouped cells, or cells with line breaks in the text, in Excel. (Only a lot easier to work with using code, as we’ll soon see.)

(row ID)	(this column has no name)
*(AAAA // 003…10 // Eleanor // Shellstrop // Alternate)*	AlternateEmail: [email protected]
*(BBBB // 003…01 // Chidi // Anagonye // Work)*	PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected]
*(CCCC // 003…11 // Ellie // Shellstrop // Work)*	WorkEmail: [email protected]
*(DDDD // 003…02 // Eleanor // Shellstrop // Personal)*	PersonalEmail: [email protected]
*(EEEE // 003…03 // Jason // Mendoza // Work)*	PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected]
*(FFFF // 003…04 // Eleanor // Shellstrop // Alternate)*	PersonalEmail: [email protected] AlternateEmail: [email protected]
*(GGGG // 003…05 // Janet // Della-Denunzio // Alternate)*	PersonalEmail: [email protected] AlternateEmail: [email protected]
*(HHHH // 003…06 // Tahani // Al-Jamil // Alternate)*	PersonalEmail: [email protected] WorkEmail: [email protected]</br>AlternateEmail: [email protected]
*(IIII // 003…07 // Chidi // Anagonye // Work)*	WorkEmail: [email protected]
*(JJJJ // 003…08 // Mindy // St. Claire // Personal)*	PersonalEmail: [email protected]
*(KKKK // 003…09 // Kamilah // Al-Jamil // Personal)*	PersonalEmail: [email protected]
*(LLLL // 003…12 // Mindy // St. Claire // Personal)*	PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected]
*(MMMM // 003…13 // Mindy // St. Claire // Work)*	PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected]
*(NNNN // 003…14 // Tahani // Al-Jamil // Alternate)*	AlternateEmail: [email protected]

Group `allEmailsSeries` by the email addresses it contains

You’ll have to trust me, but the next line of code is “Python magic” that produces a list of email addresses found in allEmailsSeries and, for each email address, details about the rows where it was found.

We’ll save that information aside in a variable called groupedEmailSeriesGroupItems.

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

groupedEmailSeriesGroupItems isn’t pretty, but here it is, just in case you’re curious.

Trust me – the computer can read it, even if it’s hard for us to read.

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 8], [0, 6], [0, 0], [1, 1], [2, 2], [0, 1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 1], [0, 0], [0, 0], [1, 1], [2, 2], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[0, 2, 3, 5], [9, 10, 1, 3], [1, 2, 1, 1], [4, 4, 4, 4], [0, 2, 1, 0], [2, 1, 0, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[5], [3], [1], [4], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[6], [4], [3], [2], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[4, 4, 4, 6], [2, 2, 2, 4], [4, 4, 4, 3], [3, 3, 3, 2], [2, 2, 2, 0], [0, 1, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[10], [8], [5], [0], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11, 11], [11, 11], [6, 6], [5, 5], [1, 1], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[9], [7], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11], [11], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[12, 12, 12], [12, 12, 12], [6, 6, 6], [5, 5, 5], [2, 2, 2], [0, 1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7, 7, 13], [5, 5, 13], [7, 7, 7], [0, 0, 0], [0, 0, 0], [0, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7], [5], [7], [0], [0], [1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Filter `groupedEmailSeriesGroupItems` into `emsAndTheirCellsFound`

Since groupedEmailSeriesGroupItems was so terrible to read, we’re going to do more “Python magic” that will:

Transform the “email addresses and the cells of df in which we found them” into something more legible
Throw away any email addresses that only appeared in one cell of df

We’ll save this new information into a “dictionary, or “dict,” variable called emsAndTheirCellsFound.

In this explanation, I’m still referring to “cells of df,” but honestly, we’re never going to use df again. We don’t need it.

Everything we need to know is in emsAndTheirCellsFound, thanks to us changing the “row IDs” of df to include the data we’d want later.

Avoiding the need to go back and consult df for details like first name, last name, Salesforce ID, etc. helps our script run faster.

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

emsAndTheirCellsFound, printed to be more legible, looks like this:

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'PersonalEmail'), ('IIII', '003...07', 'Chidi', 'Anagonye', 'Work', 'WorkEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'WorkEmail'), ('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail'), ('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work', 'WorkEmail'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal', 'PersonalEmail'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'WorkEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'AlternateEmail'), ('GGGG', '003...05', 'Janet', 'Della-Denunzio', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'WorkEmail'), ('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'PersonalEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'WorkEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'PersonalEmail'), ('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail'), ('NNNN', '003...14', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail')]

Create empty “dumping spots” `output2aList` and `output2bList`

We create 2 empty “lists” as a place to store data we build inside the upcoming “loop” that we’ll want in our final output (each item in the “list” will become a row of our output CSV file).

output2aList = []
output2bList = []

“Loop” over the email addresses

Now we’ll “loop” over one “email address” (e.g. [email protected]) at a time.

Any code “inside” this “loop” will be able to tell which “email address” we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored in the variable emAddr.

Also accessible to us will be a “list” of details about each “cell” of df where that email address was found, accessible under the variable name foundCells.

For example, while we look at [email protected], foundCells will contain a 2-item list, telling us that was found at the intersection of LLLL and WorkEmail and at the intersection of LLLL and AlternateEmail (along with some other details about those “cells,” such as the FirstName, LastName, etc. for the row where LLLL was the Human-Readable ID).

for emAddr, foundCells in emsAndTheirCellsFound.items():

Inspect a given email address

First thing we’ll do the moment we pull up a given emAddr and its foundCells list is set up a blank place to store “per-email-address”-level information we decide we want to set aside, as well as take a snapshot of how many items are in foundCells.

    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()

Check if there even are duplicate “found cells” for the email address

If numCellsFound (the length of foundCells) isn’t >1, the email address isn’t duplicated at all in our Contacts CSV, which means we can skip right along to the next email emAddr in our “loop”:

    if numCellsFound > 1:

“Loop” over the “cells where this email address was found” list

Presuming there actually are several “cells” of df where an email address was found, now we need to loop over those cells.

Any code “inside” this “loop” will be able to tell which “cell” from foundCells we’re currently inspecting, because for the duration of a given “pass” of the loop, its value will be stored ine variable cell.

Also accessible to us will be a plain old number indicating which “pass” of this “inner loops” we’re on, stored in cellLoopCounter. `cellLoopCounter’ starts at 0, not at 1, on its first pass.

Our program can also still see the “outer loop” information like what the current contents of emAddr are.

        for cellLoopCounter, cell in enumerate(foundCells):

Ask: Is this “found `cell`” the first time we’ve seen the `FirstName` it contains?

cell is a type of Python data called a “tuple” that looks something like ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'). You can fetch any one of those things between the commas by putting its “position number” right after the word cell in square brackets … just remember that the first one is 0, not 1.

So, for example, when ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail') is the current value of cell, then cell[2] is Jason.

We’re going to keep a running list of “the first time we’ve seen a given FirstName for a given emAddr.

(Each time we pull up a new emAddr with our “outer loop,” we set up a fresh, empty seenFirstNamesAndTheirContacts to help us do just that.)

            if cell[2] not in seenFirstNamesAndTheirContacts:

If we haven’t yet seen a FirstName before (within a given email address), we set it aside in seenFirstNamesAndTheirContacts along with a copy of cell that contains all of its values except that last one (the one specifying which email address column the email address was found in). That’s what the [:-1] means. In other words, we’re just grabbing the info from cell that identifies the row or the person (Contact), not every last detail about where they had that email address.

                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]

Furthermore, if it’s our first time seeing a FirstName for a given e-mail address, we’ll check if it’s our first pass through the foundCells for that e-mail address at all.

                if cellLoopCounter == 0:

If it is, we’ll grab some details about that e-mail address and set them aside in an empty “dictionary” called output2bFact that we created when we first pulled up the current emAddr.

                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]

If it’s our first time seeing a FirstName for a given e-mail address, yet it’s not our first cell for the e-mail address altogether, then it must be a duplicate record (e.g. Ellie when we’ve already seen Eleanor), so we’ll keep fleshing out output2bFact with details about this new record. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many FirstNames we’ve already seen for the current emAddr and subtracting (the first one was “master” and sort of counts as “0,” but the count will be “1,” so now we have to subtract 1 to get our numbers right).

                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]

Ask: And if `cell` is not the first time we’ve seen this `FirstName` for an email address?

On the other hand, if we’ve seen this FirstName before for this e-mail address, we’ll just create a copy of cell that contains all of its values except that last one (the one specifying which email address column the email address was found in) and append it to the end of the list we’re keeping in seenFirstNamesAndTheirContacts.

Subtle issue: if a given e-mail address is found in multiple columns for the same Contact (which could happen if we didn’t first run our “Output1,” fix the data with Data Loader, and re-fetch our Contacts), we will end up with redundant copies of the “Contact” data (e.g. ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work')) stored in seenFirstNamesAndTheirContacts.

That’s not a big deal, though, because Python is really good at detecting duplicates of that sort, and we’ll deduplicate for that later in our code.

                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]

Clean up at the end of the last pass through the `foundCells` for an `emAddr`

We have some code that only makes sense to run when we’ve finished looping through the last cell in foundCells for a given emAddr.

            if cellLoopCounter == numCellsFound-1:

For example, now that we know we’ve finished building output2bFact (because we’re all the way through all the records for an emAddr), we can check to see if it even had any “duplicates” by looking for “Duplicate_1”.

If such a property of output2bFact exists, we can go ahead and dump output2bFact into our “temporary holding list” output2bList.

                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)

output2bFact, for example, looks something like this (I enhanced it for legibility) at this point for [email protected]:

Key:   emAddr
Value:   [email protected]

Key:   Master
Value:   003...10

Key:   Master_FN
Value:   Eleanor

Key:   Master_LN
Value:   Shellstrop

Key:   Duplicate_1
Value:   003...11

Key:   Duplicate_1_FN
Value:   Ellie

Key:   Duplicate_1_LN
Value:   Shellstrop

Furthermore, it’s time to do that promised cleanup of seenFirstNamesAndTheirContacts, where we’ve been dumping a “dict” that contains all our unique “first names seen” and, for each of them, just the “contact” part …

“Loop” over the “first names seen” for a given `emAddr`

                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

I’ve run out of “table of contents” indents (I can only go 6 levels deep), so we’ll have to talk about the entirety of this block of code here.

Firstly, while we are still within the “loop” for a given email address, such as [email protected], we will “loop over” the “first names” contained in seenFirstNamesAndTheirContacts, which looks something like this:

Key:  'Eleanor'
Value:  [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate')]

Key:  'Ellie'
Value:  [('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work')]

Here’s the line of code responsible for making us loop over 'Eleanor' and 'Ellie' within seenFirstNamesAndTheirContacts:

                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():

For each of those “loop passes” (that is, one for 'Eleanor' and one for 'Ellie'), the first thing we’ll do is set aside a place to store anything interesting we find:

                    output2aFact = OrderedDict()

Next, we’ll say that we really don’t care about data like the 'Ellie' data, where there’s just one row in our dataset where the first name 'Ellie' was found for the emAddr '[email protected]', and only care more about data like 'Eleanor'

                    if len(contactsForFN) > 1:

In such cases, we’ll say that we want to do another loop to inspect each and every “Contact” record found for a combination of “email address” & “first name” like [email protected] and Eleanor:

                        for contactLoopCounter, contact in enumerate(contactsForFN):

Each time we go through that “loop,” the value of contact looks something like this:

('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate')

If we’re dealing with the first contact for a given firstName, we’ll set it aside our empty “dictionary” called output2aFact that we created when we first pulled up the current firstName.

                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]

Otherwise, the contact in question must be a duplicate record for the current emAddr and firstName combination, so we’ll keep fleshing out output2aFact with details about this contact. We’ll decide if it’s “duplicate 1,” “duplicate 2,” etc. by counting how many contacts we’ve already seen for the current emAddr+firstName (fortunately, since our “loop counter” contactLoopCounter started counting with “Master” as “loop count 0,” we’re good to go to start w/ “1” at our first duplicate):

                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]

output2aFact, for example, looks something like this (I enhanced it for legibility) at this point for [email protected]+Eleanor:

Key:  Master
Value:  003...10

Key:  Duplicate_1
Value:  003...02

Key:  Duplicate_2
Value:  003...04

Now that we know we’ve finished building output2aFact (because we’re all the way through all the records for a firstName within an emAddr), we can dummy-check to ensure that it had any “duplicates” by looking for “Duplicate_1”.

Since we didn’t throw away any data in the output2aFact loop the way we did in the output2bFact loop, we probably don’t actually need to do this…

If such a property of output2aFact exists, we can go ahead and dump output2aFact into our “temporary holding list” output2aList.

                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

Save “output2a.csv” and “output2b.csv”

Phew! All done with loops!

We now have some “holding data” that looks like this:

output2aList:

[OrderedDict([('Master', '003...01'), ('Duplicate_1', '003...07')]), OrderedDict([('Master', '003...10'), ('Duplicate_1', '003...02'), ('Duplicate_2', '003...04')]), OrderedDict([('Master', '003...06'), ('Duplicate_1', '003...14')])]

output2bList:

[OrderedDict([('emAddr', '[email protected]'), ('Master', '003...10'), ('Master_FN', 'Eleanor'), ('Master_LN', 'Shellstrop'), ('Duplicate_1', '003...11'), ('Duplicate_1_FN', 'Ellie'), ('Duplicate_1_LN', 'Shellstrop')])]

We’ll dump them both to files on our hard drive, specifying that we want a plain .txt file with a message if there are no output rows, but writing to .csv if there are output rows.

In the case of outputting a CSV file, we’ll make it easy by converting output2aList into a Pandas “DataFrame” called output2adf and doing an equivalent operation for output2bList, since Pandas “DataFrames” are so easy to convert to CSV.

The index=False setting means “don’t bother to write pandas’s internal row numbers as their own dedicated column in the output.

The quoting=1 means “put quotes around everything in the output CSV file, just in case our data itself contains commas.”

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
    
if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

In this case, we do get output.

Here’s “output 2a”:

Master	Duplicate_1	Duplicate_2
003…01	003…07
003…10	003…02	003…04
003…06	003…14

And here’s “output 2b”:

emAddr	Master	Master_FN	Master_LN	Duplicate_1	Duplicate_1_FN	Duplicate_1_LN
[email protected]	003…10	Eleanor	Shellstrop	003…11	Ellie	Shellstrop
[email protected]	003…03	Jason	Mendoza	003…05	Janet	Della-Denunzio

Dummy-checking the time taken

Finally, as a nice touch, we’ll put some text on our screen to announce that the code is done running and to say how long it took to run.

print('Done running script.  Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Tweet Share on LinkedIn

Previous Post Next Post

--- ---

Need event music? 🎸

Dedupe Salesforce NPSP Emails with Python

Table of Contents

Data & Expectations

Input Data

Output File #1

Output File #2A

Output File #2B

Running The Code Yourself

The Code

Code to produce “Output1” only

Code to produce “Output2A” and “Output2B” only

Code to produce all 3 output files at once

Stop reading here if you don’t want to learn Python

Code Explanations

“Output1” explanation

Import “time,” start the clock

Import “pandas,” set display options

Specify “email address” column names

Map out “alternate spellings” of email address fields

Specify where “contacts.csv” is

Specify output files

Read “contacts.csv” into pandas and call the table df

Add a ChangedAnything to df

Grab df rows containing valid PreferredEmail values

Add a PrefEmAddr column to df

“Loop” over the email addresses

Select the cells that need to be cleared out

Clear out cells that had redundant values

Only show changed lines

Delete the PrefEmAddr & ChangedAnything columns

Save “output1.csv”

Dummy-checking the time taken

“Output2” explanation

Import “time,” start the clock

Import “OrderedDict”

Import “pandas,” set display options

Specify “email address” column names

Map out “alternate spellings” of email address fields

Specify where “contacts.csv” is

Specify output files

Read “contacts.csv” into pandas and call the table df

Sort df on Human-Readable ID

Re-set “row IDs” to start at 0

Move all data except email addresses into the “row IDs.”

“Stack” the email addresses

Group allEmailsSeries by the email addresses it contains

Filter groupedEmailSeriesGroupItems into emsAndTheirCellsFound

Create empty “dumping spots” output2aList and output2bList

“Loop” over the email addresses

Inspect a given email address

Check if there even are duplicate “found cells” for the email address

“Loop” over the “cells where this email address was found” list

Ask: Is this “found cell” the first time we’ve seen the FirstName it contains?

Ask: And if cell is not the first time we’ve seen this FirstName for an email address?

Clean up at the end of the last pass through the foundCells for an emAddr

“Loop” over the “first names seen” for a given emAddr

Save “output2a.csv” and “output2b.csv”

Dummy-checking the time taken

Read “contacts.csv” into `pandas` and call the table `df`

Add a `ChangedAnything` to `df`

Grab `df` rows containing valid `PreferredEmail` values

Add a `PrefEmAddr` column to `df`

Delete the `PrefEmAddr` & `ChangedAnything` columns

Read “contacts.csv” into `pandas` and call the table `df`

Sort `df` on `Human-Readable ID`

Group `allEmailsSeries` by the email addresses it contains

Filter `groupedEmailSeriesGroupItems` into `emsAndTheirCellsFound`

Create empty “dumping spots” `output2aList` and `output2bList`

Ask: Is this “found `cell`” the first time we’ve seen the `FirstName` it contains?

Ask: And if `cell` is not the first time we’ve seen this `FirstName` for an email address?

Clean up at the end of the last pass through the `foundCells` for an `emAddr`

“Loop” over the “first names seen” for a given `emAddr`