Proper-Casing CSV/XLSX Data With Python

10 Jan 2019 🔖 python pandas csv excel spreadsheet tutorials salesforce
💬 EN

The Sample Data
Proper-Casing Everybody Is “Quicker” In Python Than In Excel
- A word on “speed”
Only adding “Anush”
- Approach 1
- Approach 2
Takeaway
Full code

Helen Anderson’s Dev.to article “Excel Series: Formulas and Functions” went around Twitter today, and her first steps were so beautifully simple I thought, “Hey, let’s see what they look like in Python for a little compare-and-contrast.”

I need to get back to work, so I only got as far as her very first example, =PROPER(A1), but that alone was quite interesting.

The Sample Data

First, I set up a very small table to work with and saved it inside a variable I decided to call df:

fname	company
anush	apple
benita	boiron
cathy	cvs
darweesh	dell

In my case, the Python code to do this was as follows, because I included my sample data as part of the code:

from collections import OrderedDict
import pandas

initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)

But usually it’d look more like this, if my data were inside a spreadsheet called sample.xlsx:

import pandas
df = pandas.read_excel('sample.xlsx')

Displayed in my console with print(df) (after having first included a line of code pandas.set_option('expand_frame_repr', False) to prettify the output), it looks like this:

      fname company
   anush   apple
  benita  boiron
   cathy     cvs
darweesh    dell

Proper-Casing Everybody Is “Quicker” In Python Than In Excel

Python’s Pandas module (“module” = extended set of commands) is geared at using minimal code to repeat the same procedure against an entire set of data, such as a column, at once.

Consequently, whereas Helen’s =PROPER(A1) Excel formula will just populate whatever single cell you put it into with a “title-cased” copy of the contents of cell “A1” alone, Pandas commands are simplest when you add a whole new column to your spreadsheet with every first name title-cased, all the way down.

Note that in Pandas, you typically refer to columns by their names, not by letters of the alphabet indicating the order they come in.

This Python code:

df['fname_proper'] = df['fname'].str.title()
print(df)

Modifies the contents of the table stored in the variable df and displays this output:

      fname company fname_proper
   anush   apple        Anush
  benita  boiron       Benita
   cathy     cvs        Cathy
darweesh    dell     Darweesh

Which is this data:

fname	company	fname_proper
anush	apple	Anush
benita	boiron	Benita
cathy	cvs	Cathy
darweesh	dell	Darweesh

A word on “speed”

Slicing bread with a chainsaw

I put “quicker” in quotes in this section’s heading because it’s debateable whether writing an entire Python program to add 1 column to a spreadsheet is really “quicker” than typing a formula into Excel in extra blank space and copying the formula down to the end of the page.

No, you know what? It’s not actually debateable.

In the real world, it’d be slicing bread with a chainsaw.

What I mean by “quicker” is that Python’s Pandas module’s commands are designed to facilitate thinking in large datasets, whereas when you think about how to use Excel commands, you work more “one cell at a time.”

Note: Where things do start to get legitimately faster in Python is with repetition or complexity. Examples:

You have a 300-column spreadsheet, 100 of which have the word “Name” in the column header. You need to create a copy of every single column that has the word “Name” in its header and title-case the contents of the copy. You’re going to save a lot of time by coding that rather than doing it by hand. And the Python I just showed you would be part of that code.
You were already processing your spreadsheet with Python because you were doing some other sort of complicated operation (like a multi-column VLOOKUP between spreadsheets). Why not throw in one more operation to your Python program, rather than leave yourself work to do manually in Excel after Python finishes the “difficult” work? I often find this to be the case when, say, combining “people” data with “campaign” data to generate a “CampaignMember” file to insert into Salesforce.

Only adding “Anush”

But perhaps I’m being a bit presumptuous in assuming Helen ever intended to “copy down” =PROPER(A1) to title-case more rows.

Maybe she really did just want to make a title-cased copy of a single cell.

Let’s look at two very similar approaches to this in Python.

The differences between them are negligible for a beginner – there are simply slightly different ways you have to type the “right side” of the = command depending on how you type the “left side,” due to a notion called “data types.

What the two approaches have in common is that they’re both complicated.

(Python’s “Pandas” commands are not optimized for easy typing when it comes to doing single-cell operations like this. That’s considered the exception, not the norm, for what most people want to do.)

Also, they both refer to what was, in Helen’s example, “row 1,” as row “0.”

Approach 1

This Python code:

df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
print(df)

Modifies the contents of the table stored in the variable df (which, remember, now also has a fname_proper column) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1
   anush   apple        Anush                        Anush
  benita  boiron       Benita                          NaN
   cathy     cvs        Cathy                          NaN
darweesh    dell     Darweesh                          NaN

Which is this data:

fname	company	fname_proper	fname_1st_row_only_proper_v1
anush	apple	Anush	Anush
benita	boiron	Benita
cathy	cvs	Cathy
darweesh	dell	Darweesh

Approach 2

This Python code:

df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
print(df)

Modifies the contents of the table stored in the variable df (which, remember, already has 4 columns before we run the command) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
   anush   apple        Anush                        Anush                        Anush
  benita  boiron       Benita                          NaN                          NaN
   cathy     cvs        Cathy                          NaN                          NaN
darweesh    dell     Darweesh                          NaN                          NaN

Which is this data:

fname	company	fname_proper	fname_1st_row_only_proper_v1	fname_1st_row_only_proper_v2
anush	apple	Anush	Anush	Anush
benita	boiron	Benita
cathy	cvs	Cathy
darweesh	dell	Darweesh

Takeaway

Lessons I hope you’ll take away from this blog post are:

Certain spreadsheet-modification tasks really aren’t much more complicated to code in Python than they are in Excel, once you’re familiar with them. And now you know one! (Adding a new title-cased copy of a column.)
Pick the right tool for the job. Not only is there overhead to coding with Python vs. clicking cells in Excel and editing their formulas, but some commands are simpler than others in Python, just like in Excel.

Full code

Here’s the full script, which you can copy and paste into a Repl.it editor to run for yourself:

from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

def p(x):
    print(x)
    print('---DIVIDER---')

initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
p(df)

# "=PROPER(A1)" imitation #1 of 3 (entire "fname" column)
df['fname_proper'] = df['fname'].str.title()
p(df)

# "=PROPER(A1)" imitation #2 of 3 (single cell approach #1 of 2)
df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
p(df)

# "=PROPER(A1)" imitation #3 of 3 (single cell approach #2 of 2)
df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
p(df)

And its full output:

      fname company
0     anush   apple
1    benita  boiron
2     cathy     cvs
3  darweesh    dell
---DIVIDER---
      fname company fname_proper
0     anush   apple        Anush
1    benita  boiron       Benita
2     cathy     cvs        Cathy
3  darweesh    dell     Darweesh
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1
0     anush   apple        Anush                        Anush
1    benita  boiron       Benita                          NaN
2     cathy     cvs        Cathy                          NaN
3  darweesh    dell     Darweesh                          NaN
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0     anush   apple        Anush                        Anush                        Anush
1    benita  boiron       Benita                          NaN                          NaN
2     cathy     cvs        Cathy                          NaN                          NaN
3  darweesh    dell     Darweesh                          NaN                          NaN
---DIVIDER---

Tweet Share on LinkedIn

Previous Post Next Post

--- ---

Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Proper-Casing CSV/XLSX Data With Python

Table of Contents

The Sample Data

Proper-Casing Everybody Is “Quicker” In Python Than In Excel

A word on “speed”

Only adding “Anush”

Approach 1

Approach 2

Takeaway

Full code