Français
Presentations About Resources

Salesforce, Python, SQL, & other ways to put your data where you need it -- a bilingual blog in English & French

Python for Salesforce: List unique values found across similar Salesforce fields

31 Jan 2019 🔖 python pandas csv excel tutorials salesforce
💬 EN ( Lire cet article en français )

Post Header Image

Table of Contents

Salesforce SOQL’s SELECT...COUNT...GROUP BY works great to find unique values present in a single field, but what if you have 50 nearly-identical fields and want to find unique values used in any of them? Python to the rescue – read on!

My Problem

I know it’s a strange data model, but in a Salesforce org I manage, we have over 50 custom fields that start with the word “Status,” one per program that we offer.

When it comes time to deduplicate Contact fields, sometimes for a given program, there are conflicting values. Our student workers who do deduplication need a clear guide about which value trumps which.

So, before sending an email to the business unit that requested these fields asking them to write this guide, I wanted to know what values were in use among these fields.

I also wanted to give them a “chart” they could fill in, like one of those city-distance charts in the back of a road atlas, indicating which value should “win” for every possible combination of values (I figured this would help them avoid accidentally forgetting to pick a “winner” for each value-pairing).

  • In today’s examples, there are merely 6 custom fields and only 5 values are found among them.
  • In my real-life org, there were over 50 custom fields and over 20 unique values found among them, across all 100,000+ Contacts.

That’s 5 million cells I analyzed in a split second, looking for the 20 unique values found among them!

Note that if I’d been looking for unique values within a single custom field, such as StatusProgram01__c, I could have found what I was looking for natively in Salesforce with SOQL:

SELECT StatusProgram01__c
FROM Contact
WHERE StatusProgram01__c != NULL
GROUP BY StatusProgram01__c
ORDER BY StatusProgram01__c

It’s the fact that I wanted to “vertically stack” / “concatenate” the contents of 50+ custom fields and then filter for uniqueness that made me need to export every row in my Contacts table to CSV and parse the CSV file with Python.

Step 1: Export my data from Salesforce

  1. Go to https://workbench.developerforce.com/query.php
  2. Object: Contact
  3. Fields: Pick all that begin with “Status”
  4. The resulting query is:
    SELECT 
      StatusProgram01__c,StatusProgram02__c,
      StatusProgram03__c,StatusProgram04__c,
      StatusProgram05__c,StatusProgram06__c
    FROM Contact
    
  5. View as: Bulk CSV
  6. Click the “Query” button
  7. Under “Batches,” when “Status” becomes “Completed”, click the download icon ( download icon ) to the left of the word “Id”
  8. Save it to my hard drive at c:\example\allcs_statusfields.csv

Note: The Python code below would work just as well if I had simply exported all rows + all columns, and if I had instead used Data Loader to do the export.
(I exported with Workbench because it lists my fields in alphabetical order when building my SOQL query.)

However, I would need to modify my code to look for “STATUS” and “__C” when building the contents of “statusColNames” instead of “Status” and “__c,” since Data Loader exports column names in all-caps.


Step 2 (Python): Verify total row count, column count, & “Status…” column count before proceeding

Before proceeding, I want to make sure I didn’t download my data wrong – I’ll look at total row count, total column count, and “total count of columns that start with ‘Status’” before I move on to make sure I don’t see any surprises.

(I know my data, so this is all I need as verification.)

Python code

import pandas
df = pandas.read_csv('c:\\example\\allcs_statusfields.csv', dtype=object)

print(len(df))
print(len(df.columns))
statusColNames = [x for x in df.columns if x.startswith('Status') and x.endswith('__c')]
print(len(statusColNames))

Output

103829
6
6

Step 3 (Python): List & count unique values found

Python code

import pandas
df = pandas.read_csv('c:\\example\\allcs_statusfields.csv', dtype=object)
statusColNames = [x for x in df.columns if x.startswith('Status') and x.endswith('__c')]

valuesUsed = sorted(df[statusColNames].stack().unique(), key=lambda x: x.lower()) # Quick check of unique values present
print(valuesUsed)
print()
print(len(valuesUsed))

Output

['Cancelled','Completed','Inquired','No Show','Registered']

5

How it works

Note: sorted(df[statusColNames].stack().unique() is the heart of our code.

  • df[statusColNames] pulls a sub-table of our CSV file that only includes the column names that start with “Status” and end with “__c.”
    • In our case, that’s all of them, but I promised you this would work even if you exported extra columns from Salesforce.
  • .stack() is a function that all “Pandas DataFrame”-typed data in Python has. It means:
    Take every column in this table and concatenate them into one big column, deleting any cells with a blank/null value.
    (Note: the “data type” of the output of that operation is called a “Pandas Series.”)
  • .unique() is a function that all “Pandas Series”-typed data in Python has. It means:
    Give me a listing of every unique value found in this column.
  • sorted(...) is a function that all list-like data can be put within in Python. It means what it sounds like!
    In this case, we gave it a little extra setting, , key=lambda x: x.lower(), which made sure it sorted things case-insensitively.
    Its output is a plain-old Python list (which gets displayed as a comma-separated listing between square brackets).
  • Finally, we saved this “plain-old Python list” into a “variable” I decided to call “valuesUsed.”
    You’ll see later that we can refer to this list in our code by the name “valuesUsed.”

Step 4 (Python): Create a “road atlas”-style CSV file of “value-found” “intersections”

Python code

import pandas
df = pandas.read_csv('c:\\example\\allcs_statusfields.csv', dtype=object)
statusColNames = [x for x in df.columns if x.startswith('Status') and x.endswith('__c')]
valuesUsed = sorted(df[statusColNames].stack().unique(), key=lambda x: x.lower()) # Quick check of unique values present

df2 = pandas.DataFrame(columns=valuesUsed, index=valuesUsed)
alreadyDid = []
for x in valuesUsed:
    for y in valuesUsed:
        if x == y:
            df2[x][y] = 'N/A - same value'
        else:
            if (x,y) not in alreadyDid:
                df2[x][y] = 'FILL ME IN'
                alreadyDid.append((y,x))
            else:
                df2[x][y] = 'xxx'

df2.to_csv('c:\\example\\atlaschart.csv')

Contents of atlaschart.csv on my hard drive

  Cancelled Completed Inquired No Show Registered
Cancelled N/A - same value xxx xxx xxx xxx
Completed FILL ME IN N/A - same value xxx xxx xxx
Inquired FILL ME IN FILL ME IN N/A - same value xxx xxx
No Show FILL ME IN FILL ME IN FILL ME IN N/A - same value xxx
Registered FILL ME IN FILL ME IN FILL ME IN FILL ME IN N/A - same value

How it works

The code I wrote to build this table might not be the most efficient way to do things, but any time you program, there’s more than one way to skin a cat, and I found this easy to write.

  1. I create an empty table (“Pandas DataFrame”) with the contents of valuesUsed going across the top as columns and down the left as row IDs and call that table “df2”:
    df2 = pandas.DataFrame(columns=valuesUsed, index=valuesUsed)
  2. I create an empty plain-old Python “list” and call it “alreadyDid.”
  3. I do a nested double loop through the contents of valuesUsed so that I loop over all possible pairs of values (e.g. “Cancelled-Cancelled, Cancelled-Completed, Cancelled-Inquired…“). I call the first item in a pair “x” and the second “y.”
  4. If x and y are the same word, I jump to the cell of df2 indicating the “intersection” of “Column x” and “Row y” and fill it in with the phrase “N/A - same value.”
  5. Otherwise, I check to see if I’ve already seen this pairing of “x” & “y” before (because in this case, there’s no realistic difference between “Cancelled-Completed” and “Completed-Cancelled”) by checking if it’s in the “alreadyDid” list.
  6. If I haven’t seen this pair before, I jump to the cell of df2 indicating the “intersection” of “Column x” and “Row y” and fill it in with the phrase “FILL ME IN.”
    Before moving on, I archive “x” & “y” into “alreadyDid”, only I flip them around as “y & x” so that when I get to them in reverse, the REVERSE values of “what I just did” will be in “alreadyDid.”
  7. If I have done this pair of values before, I jump to the cell of df2 indicating the “intersection” of “Column x” and “Row y” and fill it in with the phrase “xxx” to indicate that it’s redundant.

As I said, this script chewed through 5,000,000 cells in a split second. Let me know if you use this code!

--- ---