Python for Salesforce Real-Life Challenge: NPSP Email Deduping
12 Nov 2018
Salesforce #AwesomeAdmin, reader, and “Office Hours” attendee Matthew (@matthewsalibi) got Python up and running on his computer, and is ready to dive into looking for duplicates in his Salesforce “Contacts” table. But he’s on the Nonprofit Starter Pack (NPSP), and a given email address might be in any of several fields. Today we’ll lay out his question … stay tuned for the answer!
Thanks, Matthew, for submitting the questions below (and for teaching me not only that Janet the robot has a last name, 😲 but that that there’s a “load known duplicates from a CSV file” option in DemandTools – that is going to save me hours): 🙌
Answer
Click here for the answer to this problem
Salesforce NPSP: How Email Addresses Work
In NPSP (NPSP=Nonprofit Starter Pack, the open-source package that runs 27,000 nonprofits!), there are three named email fields, which, combined with a “Preferred Email” picklist, populate the default Salesforce email field (using Apex code):
(So, if I populate the PersonalEmail field and set the picklist value to Personal, an NPSP trigger writes the PersonalEmail value to the default Email field. The consequence of this is just basically that we ignore the default email field in NPSP, since it’s always populated by automation.)
De-duping named email Fields within a single record
Using Python, it would be cool to be able to look across these named email fields, find places where the same email address is duplicated within a single record, and null out the non-preferred version.
So, taking a dataset like:
FirstName | LastName | Preferred Email | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|
Chidi | Anagonye | Personal | [email protected] | [email protected] | [email protected] |
Tahani | Al-Jamil | Personal | [email protected] | [email protected] | [email protected] |
Jason | Mendoza | Work | [email protected] | [email protected] | [email protected] |
We could output a version where the duplicates that are not in the “preferred” version of the field are nulled out:
FirstName | LastName | Preferred Email | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|
Chidi | Anagonye | Personal | [email protected] | [email protected] | |
Tahani | Al-Jamil | Personal | [email protected] | [email protected] | |
Jason | Mendoza | Work | [email protected] |
If it’s possible to only output changed rows, that would be extra-awesome!
De-duping named email fields across records
One limitation of native Salesforce duplicate management, and also of DemandTools, is that you can only check a field against itself; that is, you can only compare the PersonalEmail field on one record to the PersonalEmail field on another record. So, if [email protected] is in the PersonalEmail field on one record but in the WorkEmail field on a duplicated record, you could not easily see that using DemandTools or native duplicate record jobs.
So, taking a data set like:
SalesforceID | Human-Readable ID | FirstName | LastName | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | [email protected] | ||
003…02 | DDDD | Eleanor | Shellstrop | [email protected] | ||
003…03 | EEEE | Jason | Mendoza | [email protected] | ||
003…04 | FFFF | Eleanor | Shellstrop | [email protected] | [email protected] | |
003…05 | GGGG | Janet | Della-Denunzio | [email protected] | [email protected] | |
003…06 | HHHH | Tahani | Al-Jamil | [email protected] | ||
003…07 | IIII | Chidi | Anagonye | [email protected] | ||
003…08 | JJJJ | Mindy | St. Claire | [email protected] | ||
003…09 | KKKK | Kamilah | Al-Jamil | [email protected] | ||
003…10 | AAAA | Eleanor | Shellstrop | [email protected] | ||
003…11 | CCCC | Ellie | Shellstrop | [email protected] |
We should:
- Check across the named email fields for duplicates
- For identified duplicates, check the first names
- If the first names are an exact match, output a file for DemandTools with the Salesforce IDs that correspond to the (alphabetically) lowest Human-Readable ID in the “Master” column, and any duplicates’ Salesforce IDs in “duplicate” columns, like this:
Master | Duplicate1 | Duplicate2 |
---|---|---|
003…10 | 003…02 | 003…04 |
003…01 | 003…07 |
This produces a file that is ready to go into DemandTools for single table dedupe using the advanced field-level merge criteria we’ve already built, woohoo!
For those records where a duplicate email is found, but there is no FirstName match, it would be great to output an exception report, that looks like this:
Duplicated Email | Master | MasterFirstName | Master LastName | Duplicate1 | Duplicate1FirstName | Duplicate1LastName |
---|---|---|---|---|---|---|
[email protected] | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
[email protected] | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
A human user can look at this exception report, and guess that on line 2, what’s likely happened is that Janet sometimes uses Jason’s email, but the email is more closely identified with Jason (because the email is “jm@”, and Jason’s last name starts with “M”). The user may then decide to remove the address from Janet’s record.
Looking at row 3, a human user may decide that those records are true duplicates. The user could review this output file, mark the “true dupes” and sort them out, then remove all columns except for “Master” and “Duplicate”, and feed that to DemandTools to complete a de-dupe using an existing DemandTools scenario.