Python for Salesforce Real-Life Challenge: NPSP Email Deduping

12 Nov 2018 🔖 salesforce python questions
💬 EN

Answer
Salesforce NPSP: How Email Addresses Work
De-duping named email Fields within a single record
De-duping named email fields across records

Salesforce #AwesomeAdmin, reader, and “Office Hours” attendee Matthew (@matthewsalibi) got Python up and running on his computer, and is ready to dive into looking for duplicates in his Salesforce “Contacts” table. But he’s on the Nonprofit Starter Pack (NPSP), and a given email address might be in any of several fields. Today we’ll lay out his question … stay tuned for the answer!

Thanks, Matthew, for submitting the questions below (and for teaching me not only that Janet the robot has a last name, 😲 but that that there’s a “load known duplicates from a CSV file” option in DemandTools – that is going to save me hours): 🙌

Answer

Click here for the answer to this problem

Salesforce NPSP: How Email Addresses Work

In NPSP (NPSP=Nonprofit Starter Pack, the open-source package that runs 27,000 nonprofits!), there are three named email fields, which, combined with a “Preferred Email” picklist, populate the default Salesforce email field (using Apex code):

(So, if I populate the PersonalEmail field and set the picklist value to Personal, an NPSP trigger writes the PersonalEmail value to the default Email field. The consequence of this is just basically that we ignore the default email field in NPSP, since it’s always populated by automation.)

De-duping named email Fields within a single record

Using Python, it would be cool to be able to look across these named email fields, find places where the same email address is duplicated within a single record, and null out the non-preferred version.

So, taking a dataset like:

FirstName	LastName	Preferred Email	PersonalEmail	WorkEmail	AlternateEmail
Chidi	Anagonye	Personal	ca@gp.com	ca@gp.com	ca@hotmail.com
Tahani	Al-Jamil	Personal	ta@gp.com	ta@yahoo.com	ta@gp.com
Jason	Mendoza	Work	jm@gp.com	jm@gp.com	jm@gp.com

We could output a version where the duplicates that are not in the “preferred” version of the field are nulled out:

FirstName	LastName	Preferred Email	PersonalEmail	WorkEmail	AlternateEmail
Chidi	Anagonye	Personal	ca@gp.com		ca@hotmail.com
Tahani	Al-Jamil	Personal	ta@gp.com	ta@yahoo.com
Jason	Mendoza	Work		jm@gp.com

If it’s possible to only output changed rows, that would be extra-awesome!

De-duping named email fields across records

One limitation of native Salesforce duplicate management, and also of DemandTools, is that you can only check a field against itself; that is, you can only compare the PersonalEmail field on one record to the PersonalEmail field on another record. So, if ca@gp.com is in the PersonalEmail field on one record but in the WorkEmail field on a duplicated record, you could not easily see that using DemandTools or native duplicate record jobs.

So, taking a data set like:

SalesforceID	Human-Readable ID	FirstName	LastName	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	ca@gp.com
003…02	DDDD	Eleanor	Shellstrop	es@gp.com
003…03	EEEE	Jason	Mendoza	jm@gp.com
003…04	FFFF	Eleanor	Shellstrop	es@hotmail.com		es@gp.com
003…05	GGGG	Janet	Della-Denunzio	jd@gp.com		jm@gp.com
003…06	HHHH	Tahani	Al-Jamil	ta@gp.com
003…07	IIII	Chidi	Anagonye		ca@gp.com
003…08	JJJJ	Mindy	St. Claire	ms@mp.com
003…09	KKKK	Kamilah	Al-Jamil	ka@kafoundation.org
003…10	AAAA	Eleanor	Shellstrop			es@gp.com
003…11	CCCC	Ellie	Shellstrop		es@gp.com

We should:

Check across the named email fields for duplicates
For identified duplicates, check the first names
If the first names are an exact match, output a file for DemandTools with the Salesforce IDs that correspond to the (alphabetically) lowest Human-Readable ID in the “Master” column, and any duplicates’ Salesforce IDs in “duplicate” columns, like this:

Master	Duplicate1	Duplicate2
003…10	003…02	003…04
003…01	003…07

This produces a file that is ready to go into DemandTools for single table dedupe using the advanced field-level merge criteria we’ve already built, woohoo!

For those records where a duplicate email is found, but there is no FirstName match, it would be great to output an exception report, that looks like this:

Duplicated Email	Master	MasterFirstName	Master LastName	Duplicate1	Duplicate1FirstName	Duplicate1LastName
jm@gp.com	003…03	Jason	Mendoza	003…05	Janet	Della-Denunzio
es@gp.com	003…10	Eleanor	Shellstrop	003…11	Ellie	Shellstrop

A human user can look at this exception report, and guess that on line 2, what’s likely happened is that Janet sometimes uses Jason’s email, but the email is more closely identified with Jason (because the email is “jm@”, and Jason’s last name starts with “M”). The user may then decide to remove the address from Janet’s record.

Looking at row 3, a human user may decide that those records are true duplicates. The user could review this output file, mark the “true dupes” and sort them out, then remove all columns except for “Master” and “Duplicate”, and feed that to DemandTools to complete a de-dupe using an existing DemandTools scenario.

Tweet Share on LinkedIn

Previous Post Next Post

--- ---

Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Python for Salesforce Real-Life Challenge: NPSP Email Deduping

Table of Contents

Answer

Salesforce NPSP: How Email Addresses Work

De-duping named email Fields within a single record

De-duping named email fields across records