Français
Presentations About Resources

Salesforce, Python, SQL, & other ways to put your data where you need it -- a bilingual blog in English & French

Python for Salesforce Real-Life Challenge: NPSP Email Deduping

12 Nov 2018 🔖 salesforce python questions
💬 EN

Post Header Image

Table of Contents

Salesforce #AwesomeAdmin, reader, and “Office Hours” attendee Matthew (@matthewsalibi) got Python up and running on his computer, and is ready to dive into looking for duplicates in his Salesforce “Contacts” table. But he’s on the Nonprofit Starter Pack (NPSP), and a given email address might be in any of several fields. Today we’ll lay out his question … stay tuned for the answer!

Thanks, Matthew, for submitting the questions below (and for teaching me not only that Janet the robot has a last name, 😲 but that that there’s a “load known duplicates from a CSV file” option in DemandTools – that is going to save me hours): 🙌


Answer

Click here for the answer to this problem

Salesforce NPSP: How Email Addresses Work

In NPSP (NPSP=Nonprofit Starter Pack, the open-source package that runs 27,000 nonprofits!), there are three named email fields, which, combined with a “Preferred Email” picklist, populate the default Salesforce email field (using Apex code):

(So, if I populate the PersonalEmail field and set the picklist value to Personal, an NPSP trigger writes the PersonalEmail value to the default Email field. The consequence of this is just basically that we ignore the default email field in NPSP, since it’s always populated by automation.)

De-duping named email Fields within a single record

Using Python, it would be cool to be able to look across these named email fields, find places where the same email address is duplicated within a single record, and null out the non-preferred version.

So, taking a dataset like:

FirstName LastName Preferred Email PersonalEmail WorkEmail AlternateEmail
Chidi Anagonye Personal ca@gp.com ca@gp.com ca@hotmail.com
Tahani Al-Jamil Personal ta@gp.com ta@yahoo.com ta@gp.com
Jason Mendoza Work jm@gp.com jm@gp.com jm@gp.com

We could output a version where the duplicates that are not in the “preferred” version of the field are nulled out:

FirstName LastName Preferred Email PersonalEmail WorkEmail AlternateEmail
Chidi Anagonye Personal ca@gp.com   ca@hotmail.com
Tahani Al-Jamil Personal ta@gp.com ta@yahoo.com  
Jason Mendoza Work   jm@gp.com  

If it’s possible to only output changed rows, that would be extra-awesome!


De-duping named email fields across records

One limitation of native Salesforce duplicate management, and also of DemandTools, is that you can only check a field against itself; that is, you can only compare the PersonalEmail field on one record to the PersonalEmail field on another record. So, if ca@gp.com is in the PersonalEmail field on one record but in the WorkEmail field on a duplicated record, you could not easily see that using DemandTools or native duplicate record jobs.

So, taking a data set like:

SalesforceID Human-Readable ID FirstName LastName PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye ca@gp.com    
003…02 DDDD Eleanor Shellstrop es@gp.com    
003…03 EEEE Jason Mendoza jm@gp.com    
003…04 FFFF Eleanor Shellstrop es@hotmail.com   es@gp.com
003…05 GGGG Janet Della-Denunzio jd@gp.com   jm@gp.com
003…06 HHHH Tahani Al-Jamil ta@gp.com    
003…07 IIII Chidi Anagonye   ca@gp.com  
003…08 JJJJ Mindy St. Claire ms@mp.com    
003…09 KKKK Kamilah Al-Jamil ka@kafoundation.org    
003…10 AAAA Eleanor Shellstrop     es@gp.com
003…11 CCCC Ellie Shellstrop   es@gp.com  

We should:

  1. Check across the named email fields for duplicates
  2. For identified duplicates, check the first names
  3. If the first names are an exact match, output a file for DemandTools with the Salesforce IDs that correspond to the (alphabetically) lowest Human-Readable ID in the “Master” column, and any duplicates’ Salesforce IDs in “duplicate” columns, like this:
Master Duplicate1 Duplicate2
003…10 003…02 003…04
003…01 003…07  

This produces a file that is ready to go into DemandTools for single table dedupe using the advanced field-level merge criteria we’ve already built, woohoo!

For those records where a duplicate email is found, but there is no FirstName match, it would be great to output an exception report, that looks like this:

Duplicated Email Master MasterFirstName Master LastName Duplicate1 Duplicate1FirstName Duplicate1LastName
jm@gp.com 003…03 Jason Mendoza 003…05 Janet Della-Denunzio
es@gp.com 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop

A human user can look at this exception report, and guess that on line 2, what’s likely happened is that Janet sometimes uses Jason’s email, but the email is more closely identified with Jason (because the email is “jm@”, and Jason’s last name starts with “M”). The user may then decide to remove the address from Janet’s record.

Looking at row 3, a human user may decide that those records are true duplicates. The user could review this output file, mark the “true dupes” and sort them out, then remove all columns except for “Master” and “Duplicate”, and feed that to DemandTools to complete a de-dupe using an existing DemandTools scenario.

--- ---