Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Live and recorded jazz, pop, and meditative music for your virtual conference / Zoom wedding / yoga class / private party with quality sound and a smooth technical experience

Dedupe Salesforce NPSP Emails with Python

15 Nov 2018 🔖 salesforce python questions tutorials intermediate
💬 EN

Table of Contents

Introduction

Blah blah blah

Matthew dit:

Super cool!!! Le code fonctionne avec 5,5 millions d’enregistrements en moins d’une minute, et il a produit un fichier CSV prêt à charger en Salesforce immédiatement. Magnifique x3 !

Les données “input” (“contacts.csv”)

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work [email protected] [email protected] [email protected]
003…02 DDDD Eleanor Shellstrop Personal [email protected]    
003…03 EEEE Jason Mendoza Work [email protected] [email protected] [email protected]
003…04 FFFF Eleanor Shellstrop Alternate [email protected]   [email protected]
003…05 GGGG Janet Della-Denunzio Alternate [email protected]   [email protected]
003…06 HHHH Tahani Al-Jamil Alternate [email protected] [email protected] [email protected]
003…07 IIII Chidi Anagonye Work   [email protected]  
003…08 JJJJ Mindy St. Claire Personal [email protected]    
003…09 KKKK Kamilah Al-Jamil Personal [email protected]    
003…10 AAAA Eleanor Shellstrop Alternate     [email protected]
003…11 CCCC Ellie Shellstrop Work   [email protected]  
003…12 LLLL Mindy St. Claire Personal [email protected] [email protected] [email protected]
003…13 MMMM Mindy St. Claire Work [email protected] [email protected] [email protected]
003…14 NNNN Tahani Al-Jamil Alternate     [email protected]

Les données “output” désirées

“Output1” (“output1.csv”)

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work [email protected] [email protected]  
003…03 EEEE Jason Mendoza Work   [email protected]  
003…06 HHHH Tahani Al-Jamil Alternate   [email protected] [email protected]
003…13 MMMM Mindy St. Claire Work   [email protected]  

“Output2A” (“output2a.csv”)

Master Duplicate_1 Duplicate_2
003…01 003…07  
003…10 003…02 003…04
003…06 003…14  

“Output2B” (“output2b.csv”)

emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
[email protected] 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop
[email protected] 003…03 Jason Mendoza 003…05 Janet Della-Denunzio

Les script

Le code pour générer “Output1”

import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
inputFilePath = r'c:\example\contacts.csv'
output1FilePath = r'c:\example\output1.csv'

df = pandas.read_csv(contactsCSVString, dtype=object)

df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)

output1df.to_csv(output1FilePath, index=False, quoting=1)

Le code pour générer “Output2A” et “Output2B”

INSERT CODE HERE

Le code pour générer “Output1” avec “Output2A” et “Output2B”

INSERT CODE HERE

Explications du code

Explication du code “Output1”

Préciser que l’on se sert de Pandas

On va préciser que l’on va utiliser Pandas, et on mettra certains paramètres de sa sortie visuelle:

import pandas
pandas.set_option('expand_frame_repr', False)

Préciser les noms des champs “mail”

D’abord, on va mettre de côté une liste des noms des colonnes qui représent les adresses mail dans une variable qui s’appelle emFieldNames:

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Préciser certaines correspondances d’orthographie

On construira aussi, dans une variable pickToField, un dictionnaire des valeurs de la liste de sélection PreferredEmail, avec les noms des colonnes qu’ils indiquent:

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Préciser où trouver “contacts.csv”

Avec la variable inputFilePath, on indiquera où trouver, sur notre disque dur, le fichier CSV que l’on va manipuler.

inputFilePath = r'c:\example\contacts.csv'

Préciser où enregistrer “output1.csv”

On précise aussi où enregistrer notre premier fichier de sortie (output1FilePath).

output1FilePath = r'c:\example\output1.csv'

Charger “contacts.csv” au tableau df

Et, bien sûr, il faut lire le fichier CSV avec Python et sa module Pandas:

df = pandas.read_csv(inputfilepath, dtype=object)

Ajouter une colonne ChangedAnything

Après avoir chargé le tableau de contacts avec Python, on y ajoute une nouvelle colonne ChangedAnything, avec toutes ses valeurs False (faux).

df['ChangedAnything'] = False

Nos données rassemblent maintenant à celles-ci:

SalesforceID Human-Readable ID FirstName ChangedAnything
003…01 BBBB Chidi False
003…02 DDDD Eleanor False
003…03 EEEE Jason False
003…04 FFFF Eleanor False
003…05 GGGG Janet False
003…06 HHHH Tahani False
003…07 IIII Chidi False
003…08 JJJJ Mindy False
003…09 KKKK Kamilah False
003…10 AAAA Eleanor False
003…11 CCCC Ellie False
003…12 LLLL Mindy False
003…13 MMMM Mindy False
003…14 NNNN Tahani False

Selectionner les lignes de df contenant de bons valeurs PreferredEmail

Maintenant, on va construire une liste (une série Pandas) des lignes de notre tableau qui ont un valeur sous PreferredEmail valide. C’est à dire, on vérifie si le valeur soit le nom d’une colonne de notre tableau.

On met de côté cette liste sous le nom validPrefEmailIndicatorTFSeries.

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)

Les contenus de validPrefEmailIndicatorTFSeries sont:

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
Name: PreferredEmail, dtype: bool

Ajouter une colonne PrefEmAddr à df

C’est un peu de magie comment marche la ligne de code suivante, mais elle ajoute une nouvelle colonne PrefEmAddr à notre tableau qui contient l’adresse préférée de chaque contact.

df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

Nos données du tableau df rassemblent maintenant à celles-ci:

SalesforceID Human-Readable ID FirstName ChangedAnything PrefEmAddr
003…01 BBBB Chidi False [email protected]
003…02 DDDD Eleanor False [email protected]
003…03 EEEE Jason False [email protected]
003…04 FFFF Eleanor False [email protected]
003…05 GGGG Janet False [email protected]
003…06 HHHH Tahani False [email protected]
003…07 IIII Chidi False [email protected]
003…08 JJJJ Mindy False [email protected]
003…09 KKKK Kamilah False [email protected]
003…10 AAAA Eleanor False [email protected]
003…11 CCCC Ellie False [email protected]
003…12 LLLL Mindy False [email protected]
003…13 MMMM Mindy False [email protected]
003…14 NNNN Tahani False [email protected]

Boucler sur les adresses mail

Ensuite, nous parcourons notre liste de colonnes concernant les adresses. Pour chaque colonne (PersonalEmail,WorkEmail,AlternateEmail). Dans la boucle, le nom de la colonne qu’on est en train d’inspecter est accessible sous le nom fieldName.

for fieldName in emFieldNames:
Selectionner les cellules qui doivent être supprimés

D’abord, pendant chaque passage en boucle, on construit une liste (une série Pandas) des lignes de notre tableau qui sont intéressantes (parce que la ligne porte le même valeur, dans cette colonne fieldName, qu’elle porte dans la colonne PrefEmAddr, sans que fieldName soit la colonne indiquée par PreferredEmail pour cette ligne:

needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])

Pendant le passage du boucle “PersonalEmail”, par exemple. needsClearingTFSer rassemble à:

0     False
1     False
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
dtype: bool
Supprimer les valeurs aux cellules qui en ont besoin

Puis on supprime chaque valeur trouvée aux intersection des lignes de needsClearingTFSer et de la colonne indiquée par fieldName (soit PersonalEmail, soit WorkEmail, soit AlternateEmail, dépendant du passage du boucle).

On fait True (“vrai”) chaque valeur trouvée aux intersection des lignes de needsClearingTFSer et de la colonne ‘ChangedAnything’ (“qqc est changée”).

    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

Après la fin du boucle, nos données du tableau df rassemblent à celles-ci:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail ChangedAnything PrefEmAddr
003…01 BBBB Chidi Anagonye Work [email protected] [email protected]   True [email protected]
003…02 DDDD Eleanor Shellstrop Personal [email protected]     False [email protected]
003…03 EEEE Jason Mendoza Work   [email protected]   True [email protected]
003…04 FFFF Eleanor Shellstrop Alternate [email protected]   [email protected] False [email protected]
003…05 GGGG Janet Della-Denunzio Alternate [email protected]   [email protected] False [email protected]
003…06 HHHH Tahani Al-Jamil Alternate   [email protected] [email protected] True [email protected]
003…07 IIII Chidi Anagonye Work   [email protected]   False [email protected]
003…08 JJJJ Mindy St. Claire Personal [email protected]     False [email protected]
003…09 KKKK Kamilah Al-Jamil Personal [email protected]     False [email protected]
003…10 AAAA Eleanor Shellstrop Alternate     [email protected] False [email protected]
003…11 CCCC Ellie Shellstrop Work   [email protected]   False [email protected]
003…12 LLLL Mindy St. Claire Personal [email protected] [email protected] [email protected] False [email protected]
003…13 MMMM Mindy St. Claire Work   [email protected]   True [email protected]
003…14 NNNN Tahani Al-Jamil Alternate     [email protected] False [email protected]

Ne montrer que les lignes changées

Matthew n’a voulu voir que les lignes où quelque chose est changée, donc on créer une nouvelle variable output1df avec une copie filtrée des contenus de df (pour n’inclure que les lignes où changedAnything soit True):

output1df = df[df['ChangedAnything']]

Nos données du tableau output1df rassemblent à celles-ci:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail ChangedAnything PrefEmAddr
003…01 BBBB Chidi Anagonye Work [email protected] [email protected]   True [email protected]
003…03 EEEE Jason Mendoza Work   [email protected]   True [email protected]
003…06 HHHH Tahani Al-Jamil Alternate   [email protected] [email protected] True [email protected]
003…13 MMMM Mindy St. Claire Work   [email protected]   True [email protected]

Supprimer les colonnes PrefEmAddr et ChangedAnything

On n’a plus besoin des colonnes PrefEmAddr et ChangedAnything, donc on peut les éliminer de notre tableau df:

output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)

Enfin, le tableau output1df contient nos données désirées:

SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
003…01 BBBB Chidi Anagonye Work [email protected] [email protected]  
003…03 EEEE Jason Mendoza Work   [email protected]  
003…06 HHHH Tahani Al-Jamil Alternate   [email protected] [email protected]
003…13 MMMM Mindy St. Claire Work   [email protected]  

Enregistrer “output1.csv”

On exporte les contenus de output1df en fichier CSV sur la disque dur avec ce commande:

output1df.to_csv(output1FilePath, index=False, quoting=1)

Explication du code “Output2A” et “Output2B”

Préciser que l’on se sert de Pandas

On va préciser que l’on va utiliser “OrderedDict” et “Pandas,” et on mettra certains paramètres de la sortie visuelle de Pandas:

from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

Préciser les noms des champs “mail”

D’abord, on va mettre de côté une liste des noms des colonnes qui représent les adresses mail dans une variable qui s’appelle emFieldNames:

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']

Préciser certaines correspondances d’orthographie

On construira aussi, dans une variable pickToField, un dictionnaire des valeurs de la liste de sélection PreferredEmail, avec les noms des colonnes qu’ils indiquent:

pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

Préciser où trouver “contacts.csv”

Avec la variable inputFilePath, on indiquera où trouver, sur notre disque dur, le fichier CSV que l’on va manipuler.

inputFilePath = r'c:\example\contacts.csv'

Préciser où enregistrer “output2a.csv” et “output2b.csv”

On précise aussi où enregistrer les fichiers de sortie (output2aFilePath et output2bFilePath).

output2aFilePath = r'c:\example\output2a.csv'
output2bFilePath = r'c:\example\output2b.csv'

Charger “contacts.csv” au tableau df

Et, bien sûr, il faut lire le fichier CSV avec Python et sa module Pandas:

df = pandas.read_csv(inputfilepath, dtype=object)
(identifiant de la ligne) SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
(0) 003…01 BBBB Chidi Anagonye Work [email protected] [email protected] [email protected]
(1) 003…02 DDDD Eleanor Shellstrop Personal [email protected]    
(2) 003…03 EEEE Jason Mendoza Work [email protected] [email protected] [email protected]
(3) 003…04 FFFF Eleanor Shellstrop Alternate [email protected]   [email protected]
(4) 003…05 GGGG Janet Della-Denunzio Alternate [email protected]   [email protected]
(5) 003…06 HHHH Tahani Al-Jamil Alternate [email protected] [email protected] [email protected]
(6) 003…07 IIII Chidi Anagonye Work   [email protected]  
(7) 003…08 JJJJ Mindy St. Claire Personal [email protected]    
(8) 003…09 KKKK Kamilah Al-Jamil Personal [email protected]    
(9) 003…10 AAAA Eleanor Shellstrop Alternate     [email protected]
(10) 003…11 CCCC Ellie Shellstrop Work   [email protected]  
(11) 003…12 LLLL Mindy St. Claire Personal [email protected] [email protected] [email protected]
(12) 003…13 MMMM Mindy St. Claire Work [email protected] [email protected] [email protected]
(13) 003…14 NNNN Tahani Al-Jamil Alternate     [email protected]

Blah blah

df = df.sort_values('Human-Readable ID')
(identifiant de la ligne) SalesforceID Human-Readable ID FirstName FirstName
(9) 003…10 AAAA Eleanor
(0) 003…01 BBBB Chidi
(10) 003…11 CCCC Ellie
(1) 003…02 DDDD Eleanor
(2) 003…03 EEEE Jason
(3) 003…04 FFFF Eleanor
(4) 003…05 GGGG Janet
(5) 003…06 HHHH Tahani
(6) 003…07 IIII Chidi
(7) 003…08 JJJJ Mindy
(8) 003…09 KKKK Kamilah
(11) 003…12 LLLL Mindy
(12) 003…13 MMMM Mindy
(13) 003…14 NNNN Tahani

Blah blah

df = df.reset_index(drop=True)
(identifiant de la ligne) SalesforceID Human-Readable ID FirstName
(0) 003…10 AAAA Eleanor
(1) 003…01 BBBB Chidi
(2) 003…11 CCCC Ellie
(3) 003…02 DDDD Eleanor
(4) 003…03 EEEE Jason
(5) 003…04 FFFF Eleanor
(6) 003…05 GGGG Janet
(7) 003…06 HHHH Tahani
(8) 003…07 IIII Chidi
(9) 003…08 JJJJ Mindy
(10) 003…09 KKKK Kamilah
(11) 003…12 LLLL Mindy
(12) 003…13 MMMM Mindy
(13) 003…14 NNNN Tahani

Blah blah

df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
(identifiant de la ligne) PersonalEmail WorkEmail AlternateEmail
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate)     [email protected]
(BBBB // 003…01 // Chidi // Anagonye // Work) [email protected] [email protected] [email protected]
(CCCC // 003…11 // Ellie // Shellstrop // Work)   [email protected]  
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) [email protected]    
(EEEE // 003…03 // Jason // Mendoza // Work) [email protected] [email protected] [email protected]
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) [email protected]   [email protected]
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) [email protected]   [email protected]
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) [email protected] [email protected] [email protected]
(IIII // 003…07 // Chidi // Anagonye // Work)   [email protected]  
(JJJJ // 003…08 // Mindy // St. Claire // Personal) [email protected]    
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) [email protected]    
(LLLL // 003…12 // Mindy // St. Claire // Personal) [email protected] [email protected] [email protected]
(MMMM // 003…13 // Mindy // St. Claire // Work) [email protected] [email protected] [email protected]
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate)     [email protected]

Blah blah

allEmailsSeries = df[emFieldNames].stack()

Série allEmailsSeries:

(identifiant de la ligne) (cette colonne n’a pas de nom)
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) AlternateEmail: [email protected]
(BBBB // 003…01 // Chidi // Anagonye // Work) PersonalEmail: [email protected]
WorkEmail: [email protected]
AlternateEmail: [email protected]
(CCCC // 003…11 // Ellie // Shellstrop // Work) WorkEmail: [email protected]
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) PersonalEmail: [email protected]
(EEEE // 003…03 // Jason // Mendoza // Work) PersonalEmail: [email protected]
WorkEmail: [email protected]
AlternateEmail: [email protected]
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) PersonalEmail: [email protected]
AlternateEmail: [email protected]
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) PersonalEmail: [email protected]
AlternateEmail: [email protected]
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) PersonalEmail: [email protected]
WorkEmail: [email protected]</br>AlternateEmail: [email protected]
(IIII // 003…07 // Chidi // Anagonye // Work) WorkEmail: [email protected]
(JJJJ // 003…08 // Mindy // St. Claire // Personal) PersonalEmail: [email protected]
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) PersonalEmail: [email protected]
(LLLL // 003…12 // Mindy // St. Claire // Personal) PersonalEmail: [email protected]
WorkEmail: [email protected]
AlternateEmail: [email protected]
(MMMM // 003…13 // Mindy // St. Claire // Work) PersonalEmail: [email protected]
WorkEmail: [email protected]
AlternateEmail: [email protected]
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) AlternateEmail: [email protected]

Blah blah blah

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

Contenus de groupedEmailSeriesGroupItems:

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 8], [0, 6], [0, 0], [1, 1], [2, 2], [0, 1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[1, 1], [0, 0], [0, 0], [1, 1], [2, 2], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[0, 2, 3, 5], [9, 10, 1, 3], [1, 2, 1, 1], [4, 4, 4, 4], [0, 2, 1, 0], [2, 1, 0, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[5], [3], [1], [4], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[6], [4], [3], [2], [0], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[4, 4, 4, 6], [2, 2, 2, 4], [4, 4, 4, 3], [3, 3, 3, 2], [2, 2, 2, 0], [0, 1, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[10], [8], [5], [0], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11, 11], [11, 11], [6, 6], [5, 5], [1, 1], [1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[9], [7], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[11], [11], [6], [5], [1], [0]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[12, 12, 12], [12, 12, 12], [6, 6, 6], [5, 5, 5], [2, 2, 2], [0, 1, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7, 7, 13], [5, 5, 13], [7, 7, 7], [0, 0, 0], [0, 0, 0], [0, 2, 2]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Group key:   [email protected]
Group value:   MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
           labels=[[7], [5], [7], [0], [0], [1]],
           names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])

Blah blah blah

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

Contenus de emsAndTheirCellsFound:

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'PersonalEmail'), ('IIII', '003...07', 'Chidi', 'Anagonye', 'Work', 'WorkEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'WorkEmail'), ('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail'), ('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work', 'WorkEmail'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal', 'PersonalEmail'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'WorkEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'AlternateEmail'), ('GGGG', '003...05', 'Janet', 'Della-Denunzio', 'Alternate', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'WorkEmail'), ('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'PersonalEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'WorkEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'AlternateEmail')]

Group key we'll soon call "emAddr" while looping through these groups:
         [email protected]
Group value we'll soon call "foundCells" while looping through these groups:
         [('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'PersonalEmail'), ('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail'), ('NNNN', '003...14', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail')]

Blah blah

Pendant le boucle [email protected]:, juste après if cellLoopCounter == numCellsFound-1:, seenFirstNamesAndTheirContacts ressemble à:

Group key:   Eleanor
Group value:   [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate')]

Group key:   Ellie
Group value:   [('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work')]

Et en ce moment, output2bFact ressemble à:

Group key:   emAddr
Group value:   [email protected]

Group key:   Master
Group value:   003...10

Group key:   Master_FN
Group value:   Eleanor

Group key:   Master_LN
Group value:   Shellstrop

Group key:   Duplicate_1
Group value:   003...11

Group key:   Duplicate_1_FN
Group value:   Ellie

Group key:   Duplicate_1_LN
Group value:   Shellstrop

Blah blah

Après la boucle for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():, avant if 'Duplicate_1' in output2aFact:, output2aFact ressemble à:

Group key:   Master
Group value:   003...10

Group key:   Duplicate_1
Group value:   003...02

Group key:   Duplicate_2
Group value:   003...04

Blah blah apres le boucle

output2aList:

[OrderedDict([('Master', '003...01'), ('Duplicate_1', '003...07')]), OrderedDict([('Master', '003...10'), ('Duplicate_1', '003...02'), ('Duplicate_2', '003...04')]), OrderedDict([('Master', '003...06'), ('Duplicate_1', '003...14')])]

output2bList:

[OrderedDict([('emAddr', '[email protected]'), ('Master', '003...10'), ('Master_FN', 'Eleanor'), ('Master_LN', 'Shellstrop'), ('Duplicate_1', '003...11'), ('Duplicate_1_FN', 'Ellie'), ('Duplicate_1_LN', 'Shellstrop')])]

Blah blah blah donc en forme de tableau:

Master Duplicate_1 Duplicate_2
003…01 003…07  
003…10 003…02 003…04
003…06 003…14  

et

emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
[email protected] 003…10 Eleanor Shellstrop 003…11 Ellie Shellstrop
[email protected] 003…03 Jason Mendoza 003…05 Janet Della-Denunzio

For Matthew

Instructions:

  1. For your own data, you don’t need the from io import StringIO or the contactsCSVString builder
  2. For your own data, replace the contactsCSVString builder with inputFilePath = 'PATH_TO_YOUR_FILE' and replace contactsCSVString with inputFilePath in the pandas.read_csv(...)
from io import StringIO

contactsCSVString = StringIO("""
SalesforceID,Human-Readable ID,FirstName,LastName,PreferredEmail,PersonalEmail,WorkEmail,AlternateEmail
003...01,BBBB,Chidi,Anagonye,Work,[email protected],[email protected],[email protected]
003...02,DDDD,Eleanor,Shellstrop,Personal,[email protected],,
003...03,EEEE,Jason,Mendoza,Work,[email protected],[email protected],[email protected]
003...04,FFFF,Eleanor,Shellstrop,Alternate,[email protected],,[email protected]
003...05,GGGG,Janet,Della-Denunzio,Alternate,[email protected],,[email protected]
003...06,HHHH,Tahani,Al-Jamil,Alternate,[email protected],[email protected],[email protected]
003...07,IIII,Chidi,Anagonye,Work,,[email protected],
003...08,JJJJ,Mindy,St. Claire,Personal,[email protected],,
003...09,KKKK,Kamilah,Al-Jamil,Personal,[email protected],,
003...10,AAAA,Eleanor,Shellstrop,Alternate,,,[email protected]
003...11,CCCC,Ellie,Shellstrop,Work,,[email protected],
003...12,LLLL,Mindy,St. Claire,Personal,[email protected],[email protected],[email protected]
003...13,MMMM,Mindy,St. Claire,Work,[email protected],[email protected],[email protected]
003...14,NNNN,Tahani,Al-Jamil,Alternate,,,[email protected]
    """)

import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}

df = pandas.read_csv(contactsCSVString, dtype=object)

print('\r\n---Here is what our CSV file looks like the moment we read it in---')
print(df)

# Add a "ChangedAnything" temporary column, filled in False all the way down
df['ChangedAnything'] = False

# Add a "PrefEmAddr" temporary column, filled with the actual value of the preferred email address all the way down.
# Credit to https://stackoverflow.com/a/45487394/2321195 for the ".lookup()" function that lets us decide, dynamically, which column to fetch a value from.
# Note that .lookup() doesn't work well if there's a null value, or a misspelled value, in "PreferredEmail," so we have to do some extra backflips to skip over such values.
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

# Loop over each of the 3 email columns, one at a time.
for fieldName in emFieldNames:
    # For the given column we're inspecting, build a True/False "series" indicating, for each row, whether the column is of interest to us.
    # "Of interest" means that this column is NOT the "preferred" address for a given row,
    # but that the value for that row in this column is the same as the actual preferred email address.
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    # If we've established that the row is "of interest,"
    # then clear out the email address and flip the "ChangedAnything" column's value
    # for this row to True.
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

# Replace "df" with a table of just the rows where something changed.
# (Note that "LLLL"'s 2 hotmails weren't "fixed" because we have no "rule" by which to decide which to keep.)
out1df = df[df['ChangedAnything']]

# Get rid of our temporary columns
out1df = out1df.drop(['PrefEmAddr','ChangedAnything'], axis=1)

print('\r\n---Here is what our output CSV file looks like---')
print('---  Be sure to Data-Load this promptly after creating it from freshly-downloaded data, ---')
print('---    because some email addresses that merely "started blank"---')
print('---    may have been filled in by users in the meantime, and---')
print('---    we\'re about to do a "don\'t ignore nulls" data load.---')
print(out1df)


# ================================================

from collections import OrderedDict

df = df.sort_values('Human-Readable ID').reset_index(drop=True).set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in df[emFieldNames].stack().to_frame('ems').groupby('ems')['ems'].groups.items() if len(indices) > 1}

all01s = []
all01sExtended = []
all02s = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    #print('\r\n---foundCells for emAddr ' + emAddr + ' are---','\r\n',foundCells) # Inspect this item from "emsAndTheirCellsFound"
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output1Fact = OrderedDict()
    output1FactExtended = OrderedDict()
    output2Fact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2Fact['emAddr'] = emAddr
                    output2Fact['Master'] = cell[1]
                    output2Fact['Master_FN'] = cell[2]
                    output2Fact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2Fact:
                    all02s.append(output2Fact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output1FactExtended['Dummy_emAddr'] = emAddr
                                output1FactExtended['Dummy_FN'] = contact[2]
                                output1Fact['Master'] = contact[1]
                                output1FactExtended['Master'] = contact[1]
                                output1FactExtended['Dummy_Master_LN'] = contact[3]
                            else:
                                output1Fact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                                output1FactExtended['Duplicate_'+str(contactLoopCounter)]=contact[1]
                                output1FactExtended['Dummy_Duplicate_'+str(contactLoopCounter)+'_LN']=contact[3]
                if 'Duplicate_1' in output1Fact:
                    all01s.append(output1Fact)
                    all01sExtended.append(output1FactExtended)
    #print('\r\n',emAddr,seenFirstNamesAndTheirContacts) # Make sure we built "seenFirstNamesAndTheirContacts" nicely
    #print('\r\n', output1Fact) # Make sure we built "output1Fact" nicely
    #print('\r\n', output2Fact) # Make sure we built "output2Fact" nicely
#print('\r\n---output2A is---')
#[print('\r\n', x) for x in all01s] # Make sure we built "all02s" nicely
#print('\r\n---output2B is---')
#[print('\r\n', x) for x in all02s] # Make sure we built "all02s" nicely


print('\r\n---output2A, ready for DemandTools, is---')
if len(all01s) > 0:
    out1df = pandas.DataFrame(all01s)
    print(out1df)
else:
    print('Empty output2A')
    
print('\r\n---output2A-Extended, ready for human, is---')
if len(all01sExtended) > 0:
    out1Extendeddf = pandas.DataFrame(all01sExtended)
    print(out1Extendeddf)
else:
    print('Empty output2A-Extended')
    
print('\r\n---output2B (email-only dupes), ready for human, is---')
if len(all02s) > 0:
    out2df = pandas.DataFrame(all02s)
    print(out2df)
else:
    print('Empty output2B')

Output:

---Here is what our CSV file looks like the moment we read it in---
   SalesforceID Human-Readable ID FirstName        LastName PreferredEmail        PersonalEmail       WorkEmail  AlternateEmail
0      003...01              BBBB     Chidi        Anagonye           Work            [email protected]  [email protected]  [email protected]
1      003...02              DDDD   Eleanor      Shellstrop       Personal            [email protected]             NaN             NaN
2      003...03              EEEE     Jason         Mendoza           Work            [email protected]       [email protected]       [email protected]
3      003...04              FFFF   Eleanor      Shellstrop      Alternate       [email protected]             NaN       [email protected]
4      003...05              GGGG     Janet  Della-Denunzio      Alternate            [email protected]             NaN       [email protected]
5      003...06              HHHH    Tahani        Al-Jamil      Alternate            [email protected]    [email protected]       [email protected]
6      003...07              IIII     Chidi        Anagonye           Work                  NaN       [email protected]             NaN
7      003...08              JJJJ     Mindy      St. Claire       Personal            [email protected]             NaN             NaN
8      003...09              KKKK   Kamilah        Al-Jamil       Personal  [email protected]             NaN             NaN
9      003...10              AAAA   Eleanor      Shellstrop      Alternate                  NaN             NaN       [email protected]
10     003...11              CCCC     Ellie      Shellstrop           Work                  NaN       [email protected]             NaN
11     003...12              LLLL     Mindy      St. Claire       Personal         [email protected]  [email protected]  [email protected]
12     003...13              MMMM     Mindy      St. Claire           Work             [email protected]        [email protected]        [email protected]
13     003...14              NNNN    Tahani        Al-Jamil      Alternate                  NaN             NaN       [email protected]

---Here is what our output CSV file looks like---
---  Be sure to Data-Load this promptly after creating it from freshly-downloaded data, ---
---    because some email addresses that merely "started blank"---
---    may have been filled in by users in the meantime, and---
---    we're about to do a "don't ignore nulls" data load.---
   SalesforceID Human-Readable ID FirstName    LastName PreferredEmail PersonalEmail       WorkEmail AlternateEmail
0      003...01              BBBB     Chidi    Anagonye           Work     [email protected]  [email protected]           None
2      003...03              EEEE     Jason     Mendoza           Work          None       [email protected]           None
5      003...06              HHHH    Tahani    Al-Jamil      Alternate          None    [email protected]      [email protected]
12     003...13              MMMM     Mindy  St. Claire           Work          None        [email protected]           None

---output2A, ready for DemandTools, is---
     Master Duplicate_1 Duplicate_2
0  003...01    003...07         NaN
1  003...10    003...02    003...04
2  003...06    003...14         NaN

---output2A, ready for human, is---
  Dummy_emAddr Dummy_FN    Master Dummy_Master_LN Duplicate_1 Dummy_Duplicate_1_LN Duplicate_2 Dummy_Duplicate_2_LN
0    [email protected]    Chidi  003...01        Anagonye    003...07             Anagonye         NaN                  NaN
1    [email protected]  Eleanor  003...10      Shellstrop    003...02           Shellstrop    003...04           Shellstrop
2    [email protected]   Tahani  003...06        Al-Jamil    003...14             Al-Jamil         NaN                  NaN

---output2B (email-only dupes), ready for human, is---
      emAddr    Master Master_FN   Master_LN Duplicate_1 Duplicate_1_FN  Duplicate_1_LN
0  [email protected]  003...10   Eleanor  Shellstrop    003...11          Ellie      Shellstrop
1  [email protected]  003...03     Jason     Mendoza    003...05          Janet  Della-Denunzio
--- ---