Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Live and recorded jazz, pop, and meditative music for your virtual conference / Zoom wedding / yoga class / private party with quality sound and a smooth technical experience

Python for Salesforce Real-Life Challenge: NPSP Email Deduping

12 Nov 2018 🔖 salesforce python questions
💬 EN

Post Header Image

Table of Contents

Salesforce #AwesomeAdmin, reader, and “Office Hours” attendee Matthew (@matthewsalibi) got Python up and running on his computer, and is ready to dive into looking for duplicates in his Salesforce “Contacts” table. But he’s on the Nonprofit Starter Pack (NPSP), and a given email address might be in any of several fields. Today we’ll lay out his question … stay tuned for the answer!

Thanks, Matthew, for submitting the questions below (and for teaching me not only that Janet the robot has a last name, 😲 but that that there’s a “load known duplicates from a CSV file” option in DemandTools – that is going to save me hours): 🙌


Click here for the answer to this problem

Salesforce NPSP: How Email Addresses Work

In NPSP (NPSP=Nonprofit Starter Pack, the open-source package that runs 27,000 nonprofits!), there are three named email fields, which, combined with a “Preferred Email” picklist, populate the default Salesforce email field (using Apex code):

(So, if I populate the PersonalEmail field and set the picklist value to Personal, an NPSP trigger writes the PersonalEmail value to the default Email field. The consequence of this is just basically that we ignore the default email field in NPSP, since it’s always populated by automation.)

De-duping named email Fields within a single record

Using Python, it would be cool to be able to look across these named email fields, find places where the same email address is duplicated within a single record, and null out the non-preferred version.

So, taking a dataset like:

FirstName LastName Preferred Email PersonalEmail WorkEmail AlternateEmail
Chidi Anagonye Personal [email protected] [email protected] [email protected]
Tahani Al-Jamil Personal [email protected] [email protected] [email protected]
Jason Mendoza Work [email protected] [email protected] [email protected]

We could output a version where the duplicates that are not in the “preferred” version of the field are nulled out:

FirstName LastName Preferred Email PersonalEmail WorkEmail AlternateEmail
Chidi Anagonye Personal [email protected]   [email protected]
Tahani Al-Jamil Personal [email protected] [email protected]  
Jason Mendoza Work   [email protected]  

If it’s possible to only output changed rows, that would be extra-awesome!

De-duping named email fields across records

One limitation of native Salesforce duplicate management, and also of DemandTools, is that you can only check a field against itself; that is, you can only compare the PersonalEmail field on one record to the PersonalEmail field on another record. So, if [email protected] is in the PersonalEmail field on one record but in the WorkEmail field on a duplicated record, you could not easily see that using DemandTools or native duplicate record jobs.

So, taking a data set like:

SalesforceID Human-Readable ID FirstName LastName PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye [email protected]    
003…02 DDDD Eleanor Shellstrop [email protected]    
003…03 EEEE Jason Mendoza [email protected]    
003…04 FFFF Eleanor Shellstrop [email protected]   [email protected]
003…05 GGGG Janet Della-Denunzio [email protected]   [email protected]
003…06 HHHH Tahani Al-Jamil [email protected]    
003…07 IIII Chidi Anagonye   [email protected]  
003…08 JJJJ Mindy St. Claire [email protected]    
003…09 KKKK Kamilah Al-Jamil [email protected]    
003…10 AAAA Eleanor Shellstrop     [email protected]
003…11 CCCC Ellie Shellstrop   [email protected]  

We should:

  1. Check across the named email fields for duplicates
  2. For identified duplicates, check the first names
  3. If the first names are an exact match, output a file for DemandTools with the Salesforce IDs that correspond to the (alphabetically) lowest Human-Readable ID in the “Master” column, and any duplicates’ Salesforce IDs in “duplicate” columns, like this:
Master Duplicate1 Duplicate2
003…10 003…02 003…04
003…01 003…07  

This produces a file that is ready to go into DemandTools for single table dedupe using the advanced field-level merge criteria we’ve already built, woohoo!

For those records where a duplicate email is found, but there is no FirstName match, it would be great to output an exception report, that looks like this:

Duplicated Email Master MasterFirstName Master LastName Duplicate1 Duplicate1FirstName Duplicate1LastName
[email protected] 003…03 Jason Mendoza 003…05 Janet Della-Denunzio
[email protected] 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop

A human user can look at this exception report, and guess that on line 2, what’s likely happened is that Janet sometimes uses Jason’s email, but the email is more closely identified with Jason (because the email is “jm@”, and Jason’s last name starts with “M”). The user may then decide to remove the address from Janet’s record.

Looking at row 3, a human user may decide that those records are true duplicates. The user could review this output file, mark the “true dupes” and sort them out, then remove all columns except for “Master” and “Duplicate”, and feed that to DemandTools to complete a de-dupe using an existing DemandTools scenario.

--- ---