Dedupe Salesforce NPSP Emails with Python
15 Nov 2018
Table of Contents
- Introduction
- Les données “input” (“contacts.csv”)
- Les données “output” désirées
- Les script
- Explications du code
- Explication du code “Output1”
- Préciser que l’on se sert de Pandas
- Préciser les noms des champs “mail”
- Préciser certaines correspondances d’orthographie
- Préciser où trouver “contacts.csv”
- Préciser où enregistrer “output1.csv”
- Charger “contacts.csv” au tableau
df
- Ajouter une colonne
ChangedAnything
- Selectionner les lignes de
df
contenant de bons valeursPreferredEmail
- Ajouter une colonne
PrefEmAddr
àdf
- Boucler sur les adresses mail
- Ne montrer que les lignes changées
- Supprimer les colonnes
PrefEmAddr
etChangedAnything
- Enregistrer “output1.csv”
- Explication du code “Output2A” et “Output2B”
- Préciser que l’on se sert de Pandas
- Préciser les noms des champs “mail”
- Préciser certaines correspondances d’orthographie
- Préciser où trouver “contacts.csv”
- Préciser où enregistrer “output2a.csv” et “output2b.csv”
- Charger “contacts.csv” au tableau
df
- Blah blah
- Blah blah
- Blah blah
- Blah blah
- Blah blah blah
- Blah blah blah
- Blah blah
- Blah blah
- Blah blah apres le boucle
- Blah blah blah donc en forme de tableau:
- Explication du code “Output1”
- For Matthew
Introduction
Blah blah blah
Matthew dit:
Super cool!!! Le code fonctionne avec 5,5 millions d’enregistrements en moins d’une minute, et il a produit un fichier CSV prêt à charger en Salesforce immédiatement. Magnifique x3 !
Les données “input” (“contacts.csv”)
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | [email protected] |
003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | ||
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | [email protected] | [email protected] |
003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | |
003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | |
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | [email protected] |
003…07 | IIII | Chidi | Anagonye | Work | [email protected] | ||
003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | ||
003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | ||
003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | ||
003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | ||
003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | [email protected] | [email protected] |
003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] |
Les données “output” désirées
“Output1” (“output1.csv”)
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] |
“Output2A” (“output2a.csv”)
Master | Duplicate_1 | Duplicate_2 |
---|---|---|
003…01 | 003…07 | |
003…10 | 003…02 | 003…04 |
003…06 | 003…14 |
“Output2B” (“output2b.csv”)
emAddr | Master | Master_FN | Master_LN | Duplicate_1 | Duplicate_1_FN | Duplicate_1_LN |
---|---|---|---|---|---|---|
[email protected] | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
[email protected] | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
Les script
Le code pour générer “Output1”
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
inputFilePath = r'c:\example\contacts.csv'
output1FilePath = r'c:\example\output1.csv'
df = pandas.read_csv(contactsCSVString, dtype=object)
df['ChangedAnything'] = False
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
for fieldName in emFieldNames:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
output1df = df[df['ChangedAnything']]
output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)
output1df.to_csv(output1FilePath, index=False, quoting=1)
Le code pour générer “Output2A” et “Output2B”
INSERT CODE HERE
Le code pour générer “Output1” avec “Output2A” et “Output2B”
INSERT CODE HERE
Explications du code
Explication du code “Output1”
Préciser que l’on se sert de Pandas
On va préciser que l’on va utiliser Pandas, et on mettra certains paramètres de sa sortie visuelle:
import pandas
pandas.set_option('expand_frame_repr', False)
Préciser les noms des champs “mail”
D’abord, on va mettre de côté une liste des noms des colonnes qui représent les adresses mail dans une variable qui s’appelle emFieldNames
:
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
Préciser certaines correspondances d’orthographie
On construira aussi, dans une variable pickToField
, un dictionnaire des valeurs de la liste de sélection PreferredEmail
, avec les noms des colonnes qu’ils indiquent:
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
Préciser où trouver “contacts.csv”
Avec la variable inputFilePath
, on indiquera où trouver, sur notre disque dur, le fichier CSV que l’on va manipuler.
inputFilePath = r'c:\example\contacts.csv'
Préciser où enregistrer “output1.csv”
On précise aussi où enregistrer notre premier fichier de sortie (output1FilePath
).
output1FilePath = r'c:\example\output1.csv'
Charger “contacts.csv” au tableau df
Et, bien sûr, il faut lire le fichier CSV avec Python et sa module Pandas:
df = pandas.read_csv(inputfilepath, dtype=object)
Ajouter une colonne ChangedAnything
Après avoir chargé le tableau de contacts avec Python, on y ajoute une nouvelle colonne ChangedAnything
, avec toutes ses valeurs False
(faux).
df['ChangedAnything'] = False
Nos données rassemblent maintenant à celles-ci:
SalesforceID | Human-Readable ID | FirstName | … | ChangedAnything |
---|---|---|---|---|
003…01 | BBBB | Chidi | … | False |
003…02 | DDDD | Eleanor | … | False |
003…03 | EEEE | Jason | … | False |
003…04 | FFFF | Eleanor | … | False |
003…05 | GGGG | Janet | … | False |
003…06 | HHHH | Tahani | … | False |
003…07 | IIII | Chidi | … | False |
003…08 | JJJJ | Mindy | … | False |
003…09 | KKKK | Kamilah | … | False |
003…10 | AAAA | Eleanor | … | False |
003…11 | CCCC | Ellie | … | False |
003…12 | LLLL | Mindy | … | False |
003…13 | MMMM | Mindy | … | False |
003…14 | NNNN | Tahani | … | False |
Selectionner les lignes de df
contenant de bons valeurs PreferredEmail
Maintenant, on va construire une liste (une série Pandas) des lignes de notre tableau qui ont un valeur sous PreferredEmail
valide. C’est à dire, on vérifie si le valeur soit le nom d’une colonne de notre tableau.
On met de côté cette liste sous le nom validPrefEmailIndicatorTFSeries
.
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
Les contenus de validPrefEmailIndicatorTFSeries
sont:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
Name: PreferredEmail, dtype: bool
Ajouter une colonne PrefEmAddr
à df
C’est un peu de magie comment marche la ligne de code suivante, mais elle ajoute une nouvelle colonne PrefEmAddr
à notre tableau qui contient l’adresse préférée de chaque contact.
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
Nos données du tableau df
rassemblent maintenant à celles-ci:
SalesforceID | Human-Readable ID | FirstName | … | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|
003…01 | BBBB | Chidi | … | False | [email protected] |
003…02 | DDDD | Eleanor | … | False | [email protected] |
003…03 | EEEE | Jason | … | False | [email protected] |
003…04 | FFFF | Eleanor | … | False | [email protected] |
003…05 | GGGG | Janet | … | False | [email protected] |
003…06 | HHHH | Tahani | … | False | [email protected] |
003…07 | IIII | Chidi | … | False | [email protected] |
003…08 | JJJJ | Mindy | … | False | [email protected] |
003…09 | KKKK | Kamilah | … | False | [email protected] |
003…10 | AAAA | Eleanor | … | False | [email protected] |
003…11 | CCCC | Ellie | … | False | [email protected] |
003…12 | LLLL | Mindy | … | False | [email protected] |
003…13 | MMMM | Mindy | … | False | [email protected] |
003…14 | NNNN | Tahani | … | False | [email protected] |
Boucler sur les adresses mail
Ensuite, nous parcourons notre liste de colonnes concernant les adresses. Pour chaque colonne (PersonalEmail
,WorkEmail
,AlternateEmail
). Dans la boucle, le nom de la colonne qu’on est en train d’inspecter est accessible sous le nom fieldName
.
for fieldName in emFieldNames:
Selectionner les cellules qui doivent être supprimés
D’abord, pendant chaque passage en boucle, on construit une liste (une série Pandas) des lignes de notre tableau qui sont intéressantes (parce que la ligne porte le même valeur, dans cette colonne fieldName
, qu’elle porte dans la colonne PrefEmAddr
, sans que fieldName
soit la colonne indiquée par PreferredEmail
pour cette ligne:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
Pendant le passage du boucle “PersonalEmail
”, par exemple. needsClearingTFSer
rassemble à:
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
dtype: bool
Supprimer les valeurs aux cellules qui en ont besoin
Puis on supprime chaque valeur trouvée aux intersection des lignes de needsClearingTFSer
et de la colonne indiquée par fieldName
(soit PersonalEmail
, soit WorkEmail
, soit AlternateEmail
, dépendant du passage du boucle).
On fait True
(“vrai”) chaque valeur trouvée aux intersection des lignes de needsClearingTFSer
et de la colonne ‘ChangedAnything’ (“qqc est changée”).
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
Après la fin du boucle, nos données du tableau df
rassemblent à celles-ci:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | True | [email protected] | |
003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | False | [email protected] | ||
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | True | [email protected] | ||
003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | False | [email protected] | |
003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | False | [email protected] | |
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | True | [email protected] | |
003…07 | IIII | Chidi | Anagonye | Work | [email protected] | False | [email protected] | ||
003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | False | [email protected] | ||
003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | False | [email protected] | ||
003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | False | [email protected] | ||
003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | False | [email protected] | ||
003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] | False | [email protected] |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | True | [email protected] | ||
003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] | False | [email protected] |
Ne montrer que les lignes changées
Matthew n’a voulu voir que les lignes où quelque chose est changée, donc on créer une nouvelle variable output1df
avec une copie filtrée des contenus de df
(pour n’inclure que les lignes où changedAnything
soit True
):
output1df = df[df['ChangedAnything']]
Nos données du tableau output1df
rassemblent à celles-ci:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail | ChangedAnything | PrefEmAddr |
---|---|---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | True | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | True | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | True | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | True | [email protected] |
Supprimer les colonnes PrefEmAddr
et ChangedAnything
On n’a plus besoin des colonnes PrefEmAddr
et ChangedAnything
, donc on peut les éliminer de notre tableau df
:
output1df.drop(columns=['PrefEmAddr','ChangedAnything'], inplace=True)
Enfin, le tableau output1df
contient nos données désirées:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | |
003…03 | EEEE | Jason | Mendoza | Work | [email protected] | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | |
003…13 | MMMM | Mindy | St. Claire | Work | [email protected] |
Enregistrer “output1.csv”
On exporte les contenus de output1df
en fichier CSV sur la disque dur avec ce commande:
output1df.to_csv(output1FilePath, index=False, quoting=1)
Explication du code “Output2A” et “Output2B”
Préciser que l’on se sert de Pandas
On va préciser que l’on va utiliser “OrderedDict” et “Pandas,” et on mettra certains paramètres de la sortie visuelle de Pandas:
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
Préciser les noms des champs “mail”
D’abord, on va mettre de côté une liste des noms des colonnes qui représent les adresses mail dans une variable qui s’appelle emFieldNames
:
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
Préciser certaines correspondances d’orthographie
On construira aussi, dans une variable pickToField
, un dictionnaire des valeurs de la liste de sélection PreferredEmail
, avec les noms des colonnes qu’ils indiquent:
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
Préciser où trouver “contacts.csv”
Avec la variable inputFilePath
, on indiquera où trouver, sur notre disque dur, le fichier CSV que l’on va manipuler.
inputFilePath = r'c:\example\contacts.csv'
Préciser où enregistrer “output2a.csv” et “output2b.csv”
On précise aussi où enregistrer les fichiers de sortie (output2aFilePath
et output2bFilePath
).
output2aFilePath = r'c:\example\output2a.csv'
output2bFilePath = r'c:\example\output2b.csv'
Charger “contacts.csv” au tableau df
Et, bien sûr, il faut lire le fichier CSV avec Python et sa module Pandas:
df = pandas.read_csv(inputfilepath, dtype=object)
(identifiant de la ligne) | SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|---|
(0) | 003…01 | BBBB | Chidi | Anagonye | Work | [email protected] | [email protected] | [email protected] |
(1) | 003…02 | DDDD | Eleanor | Shellstrop | Personal | [email protected] | ||
(2) | 003…03 | EEEE | Jason | Mendoza | Work | [email protected] | [email protected] | [email protected] |
(3) | 003…04 | FFFF | Eleanor | Shellstrop | Alternate | [email protected] | [email protected] | |
(4) | 003…05 | GGGG | Janet | Della-Denunzio | Alternate | [email protected] | [email protected] | |
(5) | 003…06 | HHHH | Tahani | Al-Jamil | Alternate | [email protected] | [email protected] | [email protected] |
(6) | 003…07 | IIII | Chidi | Anagonye | Work | [email protected] | ||
(7) | 003…08 | JJJJ | Mindy | St. Claire | Personal | [email protected] | ||
(8) | 003…09 | KKKK | Kamilah | Al-Jamil | Personal | [email protected] | ||
(9) | 003…10 | AAAA | Eleanor | Shellstrop | Alternate | [email protected] | ||
(10) | 003…11 | CCCC | Ellie | Shellstrop | Work | [email protected] | ||
(11) | 003…12 | LLLL | Mindy | St. Claire | Personal | [email protected] | [email protected] | [email protected] |
(12) | 003…13 | MMMM | Mindy | St. Claire | Work | [email protected] | [email protected] | [email protected] |
(13) | 003…14 | NNNN | Tahani | Al-Jamil | Alternate | [email protected] |
Blah blah
df = df.sort_values('Human-Readable ID')
(identifiant de la ligne) | SalesforceID | Human-Readable ID | FirstName | FirstName |
---|---|---|---|---|
(9) | 003…10 | AAAA | Eleanor | … |
(0) | 003…01 | BBBB | Chidi | … |
(10) | 003…11 | CCCC | Ellie | … |
(1) | 003…02 | DDDD | Eleanor | … |
(2) | 003…03 | EEEE | Jason | … |
(3) | 003…04 | FFFF | Eleanor | … |
(4) | 003…05 | GGGG | Janet | … |
(5) | 003…06 | HHHH | Tahani | … |
(6) | 003…07 | IIII | Chidi | … |
(7) | 003…08 | JJJJ | Mindy | … |
(8) | 003…09 | KKKK | Kamilah | … |
(11) | 003…12 | LLLL | Mindy | … |
(12) | 003…13 | MMMM | Mindy | … |
(13) | 003…14 | NNNN | Tahani | … |
Blah blah
df = df.reset_index(drop=True)
(identifiant de la ligne) | SalesforceID | Human-Readable ID | FirstName | … |
---|---|---|---|---|
(0) | 003…10 | AAAA | Eleanor | … |
(1) | 003…01 | BBBB | Chidi | … |
(2) | 003…11 | CCCC | Ellie | … |
(3) | 003…02 | DDDD | Eleanor | … |
(4) | 003…03 | EEEE | Jason | … |
(5) | 003…04 | FFFF | Eleanor | … |
(6) | 003…05 | GGGG | Janet | … |
(7) | 003…06 | HHHH | Tahani | … |
(8) | 003…07 | IIII | Chidi | … |
(9) | 003…08 | JJJJ | Mindy | … |
(10) | 003…09 | KKKK | Kamilah | … |
(11) | 003…12 | LLLL | Mindy | … |
(12) | 003…13 | MMMM | Mindy | … |
(13) | 003…14 | NNNN | Tahani | … |
Blah blah
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
(identifiant de la ligne) | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) | [email protected] | ||
(BBBB // 003…01 // Chidi // Anagonye // Work) | [email protected] | [email protected] | [email protected] |
(CCCC // 003…11 // Ellie // Shellstrop // Work) | [email protected] | ||
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) | [email protected] | ||
(EEEE // 003…03 // Jason // Mendoza // Work) | [email protected] | [email protected] | [email protected] |
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) | [email protected] | [email protected] | |
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) | [email protected] | [email protected] | |
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) | [email protected] | [email protected] | [email protected] |
(IIII // 003…07 // Chidi // Anagonye // Work) | [email protected] | ||
(JJJJ // 003…08 // Mindy // St. Claire // Personal) | [email protected] | ||
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) | [email protected] | ||
(LLLL // 003…12 // Mindy // St. Claire // Personal) | [email protected] | [email protected] | [email protected] |
(MMMM // 003…13 // Mindy // St. Claire // Work) | [email protected] | [email protected] | [email protected] |
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) | [email protected] |
Blah blah
allEmailsSeries = df[emFieldNames].stack()
Série allEmailsSeries
:
(identifiant de la ligne) | (cette colonne n’a pas de nom) |
---|---|
(AAAA // 003…10 // Eleanor // Shellstrop // Alternate) | AlternateEmail: [email protected] |
(BBBB // 003…01 // Chidi // Anagonye // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(CCCC // 003…11 // Ellie // Shellstrop // Work) | WorkEmail: [email protected] |
(DDDD // 003…02 // Eleanor // Shellstrop // Personal) | PersonalEmail: [email protected] |
(EEEE // 003…03 // Jason // Mendoza // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(FFFF // 003…04 // Eleanor // Shellstrop // Alternate) | PersonalEmail: [email protected] AlternateEmail: [email protected] |
(GGGG // 003…05 // Janet // Della-Denunzio // Alternate) | PersonalEmail: [email protected] AlternateEmail: [email protected] |
(HHHH // 003…06 // Tahani // Al-Jamil // Alternate) | PersonalEmail: [email protected] WorkEmail: [email protected]</br>AlternateEmail: [email protected] |
(IIII // 003…07 // Chidi // Anagonye // Work) | WorkEmail: [email protected] |
(JJJJ // 003…08 // Mindy // St. Claire // Personal) | PersonalEmail: [email protected] |
(KKKK // 003…09 // Kamilah // Al-Jamil // Personal) | PersonalEmail: [email protected] |
(LLLL // 003…12 // Mindy // St. Claire // Personal) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(MMMM // 003…13 // Mindy // St. Claire // Work) | PersonalEmail: [email protected] WorkEmail: [email protected] AlternateEmail: [email protected] |
(NNNN // 003…14 // Tahani // Al-Jamil // Alternate) | AlternateEmail: [email protected] |
Blah blah blah
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
Contenus de groupedEmailSeriesGroupItems
:
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[1, 8], [0, 6], [0, 0], [1, 1], [2, 2], [0, 1]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[1, 1], [0, 0], [0, 0], [1, 1], [2, 2], [1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[0, 2, 3, 5], [9, 10, 1, 3], [1, 2, 1, 1], [4, 4, 4, 4], [0, 2, 1, 0], [2, 1, 0, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[5], [3], [1], [4], [0], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[6], [4], [3], [2], [0], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[4, 4, 4, 6], [2, 2, 2, 4], [4, 4, 4, 3], [3, 3, 3, 2], [2, 2, 2, 0], [0, 1, 2, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[10], [8], [5], [0], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[11, 11], [11, 11], [6, 6], [5, 5], [1, 1], [1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[9], [7], [6], [5], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[11], [11], [6], [5], [1], [0]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[12, 12, 12], [12, 12, 12], [6, 6, 6], [5, 5, 5], [2, 2, 2], [0, 1, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[7, 7, 13], [5, 5, 13], [7, 7, 7], [0, 0, 0], [0, 0, 0], [0, 2, 2]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Group key: [email protected]
Group value: MultiIndex(levels=[['AAAA', 'BBBB', 'CCCC', 'DDDD', 'EEEE', 'FFFF', 'GGGG', 'HHHH', 'IIII', 'JJJJ', 'KKKK', 'LLLL', 'MMMM', 'NNNN'], ['003...01', '003...02', '003...03', '003...04', '003...05', '003...06', '003...07', '003...08', '003...09', '003...10', '003...11', '003...12', '003...13', '003...14'], ['Chidi', 'Eleanor', 'Ellie', 'Janet', 'Jason', 'Kamilah', 'Mindy', 'Tahani'], ['Al-Jamil', 'Anagonye', 'Della-Denunzio', 'Mendoza', 'Shellstrop', 'St. Claire'], ['Alternate', 'Personal', 'Work'], ['PersonalEmail', 'WorkEmail', 'AlternateEmail']],
labels=[[7], [5], [7], [0], [0], [1]],
names=['Human-Readable ID', 'SalesforceID', 'FirstName', 'LastName', 'PreferredEmail', None])
Blah blah blah
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
Contenus de emsAndTheirCellsFound
:
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'PersonalEmail'), ('IIII', '003...07', 'Chidi', 'Anagonye', 'Work', 'WorkEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'WorkEmail'), ('BBBB', '003...01', 'Chidi', 'Anagonye', 'Work', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail'), ('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work', 'WorkEmail'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal', 'PersonalEmail'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'PersonalEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'WorkEmail'), ('EEEE', '003...03', 'Jason', 'Mendoza', 'Work', 'AlternateEmail'), ('GGGG', '003...05', 'Janet', 'Della-Denunzio', 'Alternate', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'WorkEmail'), ('LLLL', '003...12', 'Mindy', 'St. Claire', 'Personal', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'PersonalEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'WorkEmail'), ('MMMM', '003...13', 'Mindy', 'St. Claire', 'Work', 'AlternateEmail')]
Group key we'll soon call "emAddr" while looping through these groups:
[email protected]
Group value we'll soon call "foundCells" while looping through these groups:
[('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'PersonalEmail'), ('HHHH', '003...06', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail'), ('NNNN', '003...14', 'Tahani', 'Al-Jamil', 'Alternate', 'AlternateEmail')]
Blah blah
Pendant le boucle [email protected]
:, juste après if cellLoopCounter == numCellsFound-1:
, seenFirstNamesAndTheirContacts
ressemble à:
Group key: Eleanor
Group value: [('AAAA', '003...10', 'Eleanor', 'Shellstrop', 'Alternate'), ('DDDD', '003...02', 'Eleanor', 'Shellstrop', 'Personal'), ('FFFF', '003...04', 'Eleanor', 'Shellstrop', 'Alternate')]
Group key: Ellie
Group value: [('CCCC', '003...11', 'Ellie', 'Shellstrop', 'Work')]
Et en ce moment, output2bFact
ressemble à:
Group key: emAddr
Group value: [email protected]
Group key: Master
Group value: 003...10
Group key: Master_FN
Group value: Eleanor
Group key: Master_LN
Group value: Shellstrop
Group key: Duplicate_1
Group value: 003...11
Group key: Duplicate_1_FN
Group value: Ellie
Group key: Duplicate_1_LN
Group value: Shellstrop
Blah blah
Après la boucle for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
, avant if 'Duplicate_1' in output2aFact:
, output2aFact
ressemble à:
Group key: Master
Group value: 003...10
Group key: Duplicate_1
Group value: 003...02
Group key: Duplicate_2
Group value: 003...04
Blah blah apres le boucle
output2aList
:
[OrderedDict([('Master', '003...01'), ('Duplicate_1', '003...07')]), OrderedDict([('Master', '003...10'), ('Duplicate_1', '003...02'), ('Duplicate_2', '003...04')]), OrderedDict([('Master', '003...06'), ('Duplicate_1', '003...14')])]
output2bList
:
[OrderedDict([('emAddr', '[email protected]'), ('Master', '003...10'), ('Master_FN', 'Eleanor'), ('Master_LN', 'Shellstrop'), ('Duplicate_1', '003...11'), ('Duplicate_1_FN', 'Ellie'), ('Duplicate_1_LN', 'Shellstrop')])]
Blah blah blah donc en forme de tableau:
Master | Duplicate_1 | Duplicate_2 |
---|---|---|
003…01 | 003…07 | |
003…10 | 003…02 | 003…04 |
003…06 | 003…14 |
et
emAddr | Master | Master_FN | Master_LN | Duplicate_1 | Duplicate_1_FN | Duplicate_1_LN |
---|---|---|---|---|---|---|
[email protected] | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
[email protected] | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
For Matthew
Instructions:
- For your own data, you don’t need the
from io import StringIO
or thecontactsCSVString
builder - For your own data, replace the
contactsCSVString
builder withinputFilePath = 'PATH_TO_YOUR_FILE'
and replacecontactsCSVString
withinputFilePath
in thepandas.read_csv(...)
from io import StringIO
contactsCSVString = StringIO("""
SalesforceID,Human-Readable ID,FirstName,LastName,PreferredEmail,PersonalEmail,WorkEmail,AlternateEmail
003...01,BBBB,Chidi,Anagonye,Work,[email protected],[email protected],[email protected]
003...02,DDDD,Eleanor,Shellstrop,Personal,[email protected],,
003...03,EEEE,Jason,Mendoza,Work,[email protected],[email protected],[email protected]
003...04,FFFF,Eleanor,Shellstrop,Alternate,[email protected],,[email protected]
003...05,GGGG,Janet,Della-Denunzio,Alternate,[email protected],,[email protected]
003...06,HHHH,Tahani,Al-Jamil,Alternate,[email protected],[email protected],[email protected]
003...07,IIII,Chidi,Anagonye,Work,,[email protected],
003...08,JJJJ,Mindy,St. Claire,Personal,[email protected],,
003...09,KKKK,Kamilah,Al-Jamil,Personal,[email protected],,
003...10,AAAA,Eleanor,Shellstrop,Alternate,,,[email protected]
003...11,CCCC,Ellie,Shellstrop,Work,,[email protected],
003...12,LLLL,Mindy,St. Claire,Personal,[email protected],[email protected],[email protected]
003...13,MMMM,Mindy,St. Claire,Work,[email protected],[email protected],[email protected]
003...14,NNNN,Tahani,Al-Jamil,Alternate,,,[email protected]
""")
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail']
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'}
df = pandas.read_csv(contactsCSVString, dtype=object)
print('\r\n---Here is what our CSV file looks like the moment we read it in---')
print(df)
# Add a "ChangedAnything" temporary column, filled in False all the way down
df['ChangedAnything'] = False
# Add a "PrefEmAddr" temporary column, filled with the actual value of the preferred email address all the way down.
# Credit to https://stackoverflow.com/a/45487394/2321195 for the ".lookup()" function that lets us decide, dynamically, which column to fetch a value from.
# Note that .lookup() doesn't work well if there's a null value, or a misspelled value, in "PreferredEmail," so we have to do some extra backflips to skip over such values.
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
# Loop over each of the 3 email columns, one at a time.
for fieldName in emFieldNames:
# For the given column we're inspecting, build a True/False "series" indicating, for each row, whether the column is of interest to us.
# "Of interest" means that this column is NOT the "preferred" address for a given row,
# but that the value for that row in this column is the same as the actual preferred email address.
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
# If we've established that the row is "of interest,"
# then clear out the email address and flip the "ChangedAnything" column's value
# for this row to True.
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
# Replace "df" with a table of just the rows where something changed.
# (Note that "LLLL"'s 2 hotmails weren't "fixed" because we have no "rule" by which to decide which to keep.)
out1df = df[df['ChangedAnything']]
# Get rid of our temporary columns
out1df = out1df.drop(['PrefEmAddr','ChangedAnything'], axis=1)
print('\r\n---Here is what our output CSV file looks like---')
print('--- Be sure to Data-Load this promptly after creating it from freshly-downloaded data, ---')
print('--- because some email addresses that merely "started blank"---')
print('--- may have been filled in by users in the meantime, and---')
print('--- we\'re about to do a "don\'t ignore nulls" data load.---')
print(out1df)
# ================================================
from collections import OrderedDict
df = df.sort_values('Human-Readable ID').reset_index(drop=True).set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in df[emFieldNames].stack().to_frame('ems').groupby('ems')['ems'].groups.items() if len(indices) > 1}
all01s = []
all01sExtended = []
all02s = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
#print('\r\n---foundCells for emAddr ' + emAddr + ' are---','\r\n',foundCells) # Inspect this item from "emsAndTheirCellsFound"
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output1Fact = OrderedDict()
output1FactExtended = OrderedDict()
output2Fact = OrderedDict()
if numCellsFound > 1:
for cellLoopCounter, cell in enumerate(foundCells):
if cell[2] not in seenFirstNamesAndTheirContacts:
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
if cellLoopCounter == 0:
output2Fact['emAddr'] = emAddr
output2Fact['Master'] = cell[1]
output2Fact['Master_FN'] = cell[2]
output2Fact['Master_LN'] = cell[3]
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2Fact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
if cellLoopCounter == numCellsFound-1:
if 'Duplicate_1' in output2Fact:
all02s.append(output2Fact)
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output1FactExtended['Dummy_emAddr'] = emAddr
output1FactExtended['Dummy_FN'] = contact[2]
output1Fact['Master'] = contact[1]
output1FactExtended['Master'] = contact[1]
output1FactExtended['Dummy_Master_LN'] = contact[3]
else:
output1Fact['Duplicate_'+str(contactLoopCounter)]=contact[1]
output1FactExtended['Duplicate_'+str(contactLoopCounter)]=contact[1]
output1FactExtended['Dummy_Duplicate_'+str(contactLoopCounter)+'_LN']=contact[3]
if 'Duplicate_1' in output1Fact:
all01s.append(output1Fact)
all01sExtended.append(output1FactExtended)
#print('\r\n',emAddr,seenFirstNamesAndTheirContacts) # Make sure we built "seenFirstNamesAndTheirContacts" nicely
#print('\r\n', output1Fact) # Make sure we built "output1Fact" nicely
#print('\r\n', output2Fact) # Make sure we built "output2Fact" nicely
#print('\r\n---output2A is---')
#[print('\r\n', x) for x in all01s] # Make sure we built "all02s" nicely
#print('\r\n---output2B is---')
#[print('\r\n', x) for x in all02s] # Make sure we built "all02s" nicely
print('\r\n---output2A, ready for DemandTools, is---')
if len(all01s) > 0:
out1df = pandas.DataFrame(all01s)
print(out1df)
else:
print('Empty output2A')
print('\r\n---output2A-Extended, ready for human, is---')
if len(all01sExtended) > 0:
out1Extendeddf = pandas.DataFrame(all01sExtended)
print(out1Extendeddf)
else:
print('Empty output2A-Extended')
print('\r\n---output2B (email-only dupes), ready for human, is---')
if len(all02s) > 0:
out2df = pandas.DataFrame(all02s)
print(out2df)
else:
print('Empty output2B')
Output:
---Here is what our CSV file looks like the moment we read it in---
SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
0 003...01 BBBB Chidi Anagonye Work [email protected] [email protected] [email protected]
1 003...02 DDDD Eleanor Shellstrop Personal [email protected] NaN NaN
2 003...03 EEEE Jason Mendoza Work [email protected] [email protected] [email protected]
3 003...04 FFFF Eleanor Shellstrop Alternate [email protected] NaN [email protected]
4 003...05 GGGG Janet Della-Denunzio Alternate [email protected] NaN [email protected]
5 003...06 HHHH Tahani Al-Jamil Alternate [email protected] [email protected] [email protected]
6 003...07 IIII Chidi Anagonye Work NaN [email protected] NaN
7 003...08 JJJJ Mindy St. Claire Personal [email protected] NaN NaN
8 003...09 KKKK Kamilah Al-Jamil Personal [email protected] NaN NaN
9 003...10 AAAA Eleanor Shellstrop Alternate NaN NaN [email protected]
10 003...11 CCCC Ellie Shellstrop Work NaN [email protected] NaN
11 003...12 LLLL Mindy St. Claire Personal [email protected] [email protected] [email protected]
12 003...13 MMMM Mindy St. Claire Work [email protected] [email protected] [email protected]
13 003...14 NNNN Tahani Al-Jamil Alternate NaN NaN [email protected]
---Here is what our output CSV file looks like---
--- Be sure to Data-Load this promptly after creating it from freshly-downloaded data, ---
--- because some email addresses that merely "started blank"---
--- may have been filled in by users in the meantime, and---
--- we're about to do a "don't ignore nulls" data load.---
SalesforceID Human-Readable ID FirstName LastName PreferredEmail PersonalEmail WorkEmail AlternateEmail
0 003...01 BBBB Chidi Anagonye Work [email protected] [email protected] None
2 003...03 EEEE Jason Mendoza Work None [email protected] None
5 003...06 HHHH Tahani Al-Jamil Alternate None [email protected] [email protected]
12 003...13 MMMM Mindy St. Claire Work None [email protected] None
---output2A, ready for DemandTools, is---
Master Duplicate_1 Duplicate_2
0 003...01 003...07 NaN
1 003...10 003...02 003...04
2 003...06 003...14 NaN
---output2A, ready for human, is---
Dummy_emAddr Dummy_FN Master Dummy_Master_LN Duplicate_1 Dummy_Duplicate_1_LN Duplicate_2 Dummy_Duplicate_2_LN
0 [email protected] Chidi 003...01 Anagonye 003...07 Anagonye NaN NaN
1 [email protected] Eleanor 003...10 Shellstrop 003...02 Shellstrop 003...04 Shellstrop
2 [email protected] Tahani 003...06 Al-Jamil 003...14 Al-Jamil NaN NaN
---output2B (email-only dupes), ready for human, is---
emAddr Master Master_FN Master_LN Duplicate_1 Duplicate_1_FN Duplicate_1_LN
0 [email protected] 003...10 Eleanor Shellstrop 003...11 Ellie Shellstrop
1 [email protected] 003...03 Jason Mendoza 003...05 Janet Della-Denunzio