Français
Presentations About Resources

Salesforce, Python, SQL, & other ways to put your data where you need it -- a bilingual blog in English & French

Proper-Casing CSV/XLSX Data With Python

10 Jan 2019 🔖 python pandas csv excel spreadsheet tutorials salesforce
💬 EN

Table of Contents

Helen Anderson’s Dev.to article “Excel Series: Formulas and Functions” went around Twitter today, and her first steps were so beautifully simple I thought, “Hey, let’s see what they look like in Python for a little compare-and-contrast.”

I need to get back to work, so I only got as far as her very first example, =PROPER(A1), but that alone was quite interesting.

The Sample Data

First, I set up a very small table to work with and saved it inside a variable I decided to call df:

fname company
anush apple
benita boiron
cathy cvs
darweesh dell

In my case, the Python code to do this was as follows, because I included my sample data as part of the code:

from collections import OrderedDict
import pandas

initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)

But usually it’d look more like this, if my data were inside a spreadsheet called sample.xlsx:

import pandas
df = pandas.read_excel('sample.xlsx')

Displayed in my console with print(df) (after having first included a line of code pandas.set_option('expand_frame_repr', False) to prettify the output), it looks like this:

      fname company
0     anush   apple
1    benita  boiron
2     cathy     cvs
3  darweesh    dell

Proper-Casing Everybody Is “Quicker” In Python Than In Excel

Python’s Pandas module (“module” = extended set of commands) is geared at using minimal code to repeat the same procedure against an entire set of data, such as a column, at once.

Consequently, whereas Helen’s =PROPER(A1) Excel formula will just populate whatever single cell you put it into with a “title-cased” copy of the contents of cell “A1” alone, Pandas commands are simplest when you add a whole new column to your spreadsheet with every first name title-cased, all the way down.

Note that in Pandas, you typically refer to columns by their names, not by letters of the alphabet indicating the order they come in.

This Python code:

df['fname_proper'] = df['fname'].str.title()
print(df)

Modifies the contents of the table stored in the variable df and displays this output:

      fname company fname_proper
0     anush   apple        Anush
1    benita  boiron       Benita
2     cathy     cvs        Cathy
3  darweesh    dell     Darweesh

Which is this data:

fname company fname_proper
anush apple Anush
benita boiron Benita
cathy cvs Cathy
darweesh dell Darweesh

A word on “speed”

Slicing bread with a chainsaw

I put “quicker” in quotes in this section’s heading because it’s debateable whether writing an entire Python program to add 1 column to a spreadsheet is really “quicker” than typing a formula into Excel in extra blank space and copying the formula down to the end of the page.

No, you know what? It’s not actually debateable.

In the real world, it’d be slicing bread with a chainsaw.

What I mean by “quicker” is that Python’s Pandas module’s commands are designed to facilitate thinking in large datasets, whereas when you think about how to use Excel commands, you work more “one cell at a time.”

Note: Where things do start to get legitimately faster in Python is with repetition or complexity. Examples:

  • You have a 300-column spreadsheet, 100 of which have the word “Name” in the column header. You need to create a copy of every single column that has the word “Name” in its header and title-case the contents of the copy. You’re going to save a lot of time by coding that rather than doing it by hand. And the Python I just showed you would be part of that code.
  • You were already processing your spreadsheet with Python because you were doing some other sort of complicated operation (like a multi-column VLOOKUP between spreadsheets). Why not throw in one more operation to your Python program, rather than leave yourself work to do manually in Excel after Python finishes the “difficult” work? I often find this to be the case when, say, combining “people” data with “campaign” data to generate a “CampaignMember” file to insert into Salesforce.

Only adding “Anush”

But perhaps I’m being a bit presumptuous in assuming Helen ever intended to “copy down” =PROPER(A1) to title-case more rows.

Maybe she really did just want to make a title-cased copy of a single cell.

Let’s look at two very similar approaches to this in Python.

The differences between them are negligible for a beginner – there are simply slightly different ways you have to type the “right side” of the = command depending on how you type the “left side,” due to a notion called “data types.

What the two approaches have in common is that they’re both complicated.

(Python’s “Pandas” commands are not optimized for easy typing when it comes to doing single-cell operations like this. That’s considered the exception, not the norm, for what most people want to do.)

Also, they both refer to what was, in Helen’s example, “row 1,” as row “0.”

Approach 1

This Python code:

df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
print(df)

Modifies the contents of the table stored in the variable df (which, remember, now also has a fname_proper column) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1
0     anush   apple        Anush                        Anush
1    benita  boiron       Benita                          NaN
2     cathy     cvs        Cathy                          NaN
3  darweesh    dell     Darweesh                          NaN

Which is this data:

fname company fname_proper fname_1st_row_only_proper_v1
anush apple Anush Anush
benita boiron Benita  
cathy cvs Cathy  
darweesh dell Darweesh  

Approach 2

This Python code:

df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
print(df)

Modifies the contents of the table stored in the variable df (which, remember, already has 4 columns before we run the command) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0     anush   apple        Anush                        Anush                        Anush
1    benita  boiron       Benita                          NaN                          NaN
2     cathy     cvs        Cathy                          NaN                          NaN
3  darweesh    dell     Darweesh                          NaN                          NaN

Which is this data:

fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
anush apple Anush Anush Anush
benita boiron Benita    
cathy cvs Cathy    
darweesh dell Darweesh    

Takeaway

Lessons I hope you’ll take away from this blog post are:

  1. Certain spreadsheet-modification tasks really aren’t much more complicated to code in Python than they are in Excel, once you’re familiar with them. And now you know one! (Adding a new title-cased copy of a column.)
  2. Pick the right tool for the job. Not only is there overhead to coding with Python vs. clicking cells in Excel and editing their formulas, but some commands are simpler than others in Python, just like in Excel.

Full code

Here’s the full script, which you can copy and paste into a Repl.it editor to run for yourself:

from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

def p(x):
    print(x)
    print('---DIVIDER---')

initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
p(df)

# "=PROPER(A1)" imitation #1 of 3 (entire "fname" column)
df['fname_proper'] = df['fname'].str.title()
p(df)

# "=PROPER(A1)" imitation #2 of 3 (single cell approach #1 of 2)
df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
p(df)

# "=PROPER(A1)" imitation #3 of 3 (single cell approach #2 of 2)
df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
p(df)

And its full output:

      fname company
0     anush   apple
1    benita  boiron
2     cathy     cvs
3  darweesh    dell
---DIVIDER---
      fname company fname_proper
0     anush   apple        Anush
1    benita  boiron       Benita
2     cathy     cvs        Cathy
3  darweesh    dell     Darweesh
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1
0     anush   apple        Anush                        Anush
1    benita  boiron       Benita                          NaN
2     cathy     cvs        Cathy                          NaN
3  darweesh    dell     Darweesh                          NaN
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0     anush   apple        Anush                        Anush                        Anush
1    benita  boiron       Benita                          NaN                          NaN
2     cathy     cvs        Cathy                          NaN                          NaN
3  darweesh    dell     Darweesh                          NaN                          NaN
---DIVIDER---
--- ---