Proper-Casing CSV/XLSX Data With Python
10 Jan 2019
Helen Anderson’s Dev.to article “Excel Series: Formulas and Functions” went around Twitter today, and her first steps were so beautifully simple I thought, “Hey, let’s see what they look like in Python for a little compare-and-contrast.”
I need to get back to work, so I only got as far as her very first example, =PROPER(A1)
, but that alone was quite interesting.
The Sample Data
First, I set up a very small table to work with and saved it inside a variable I decided to call df
:
fname | company |
---|---|
anush | apple |
benita | boiron |
cathy | cvs |
darweesh | dell |
In my case, the Python code to do this was as follows, because I included my sample data as part of the code:
from collections import OrderedDict
import pandas
initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
But usually it’d look more like this, if my data were inside a spreadsheet called sample.xlsx
:
import pandas
df = pandas.read_excel('sample.xlsx')
Displayed in my console with print(df)
(after having first included a line of code pandas.set_option('expand_frame_repr', False)
to prettify the output), it looks like this:
fname company
0 anush apple
1 benita boiron
2 cathy cvs
3 darweesh dell
Proper-Casing Everybody Is “Quicker” In Python Than In Excel
Python’s Pandas module (“module” = extended set of commands) is geared at using minimal code to repeat the same procedure against an entire set of data, such as a column, at once.
Consequently, whereas Helen’s =PROPER(A1)
Excel formula will just populate whatever single cell you put it into with a “title-cased” copy of the contents of cell “A1” alone, Pandas commands are simplest when you add a whole new column to your spreadsheet with every first name title-cased, all the way down.
Note that in Pandas, you typically refer to columns by their names, not by letters of the alphabet indicating the order they come in.
This Python code:
df['fname_proper'] = df['fname'].str.title()
print(df)
Modifies the contents of the table stored in the variable df
and displays this output:
fname company fname_proper
0 anush apple Anush
1 benita boiron Benita
2 cathy cvs Cathy
3 darweesh dell Darweesh
Which is this data:
fname | company | fname_proper |
---|---|---|
anush | apple | Anush |
benita | boiron | Benita |
cathy | cvs | Cathy |
darweesh | dell | Darweesh |
A word on “speed”
I put “quicker” in quotes in this section’s heading because it’s debateable whether writing an entire Python program to add 1 column to a spreadsheet is really “quicker” than typing a formula into Excel in extra blank space and copying the formula down to the end of the page.
No, you know what? It’s not actually debateable.
In the real world, it’d be slicing bread with a chainsaw.
What I mean by “quicker” is that Python’s Pandas module’s commands are designed to facilitate thinking in large datasets, whereas when you think about how to use Excel commands, you work more “one cell at a time.”
Note: Where things do start to get legitimately faster in Python is with repetition or complexity. Examples:
- You have a 300-column spreadsheet, 100 of which have the word “Name” in the column header. You need to create a copy of every single column that has the word “Name” in its header and title-case the contents of the copy. You’re going to save a lot of time by coding that rather than doing it by hand. And the Python I just showed you would be part of that code.
- You were already processing your spreadsheet with Python because you were doing some other sort of complicated operation (like a multi-column
VLOOKUP
between spreadsheets). Why not throw in one more operation to your Python program, rather than leave yourself work to do manually in Excel after Python finishes the “difficult” work? I often find this to be the case when, say, combining “people” data with “campaign” data to generate a “CampaignMember” file to insert into Salesforce.
Only adding “Anush”
But perhaps I’m being a bit presumptuous in assuming Helen ever intended to “copy down” =PROPER(A1)
to title-case more rows.
Maybe she really did just want to make a title-cased copy of a single cell.
Let’s look at two very similar approaches to this in Python.
The differences between them are negligible for a beginner – there are simply slightly different ways you have to type the “right side” of the =
command depending on how you type the “left side,” due to a notion called “data types.
What the two approaches have in common is that they’re both complicated.
(Python’s “Pandas” commands are not optimized for easy typing when it comes to doing single-cell operations like this. That’s considered the exception, not the norm, for what most people want to do.)
Also, they both refer to what was, in Helen’s example, “row 1
,” as row “0
.”
Approach 1
This Python code:
df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
print(df)
Modifies the contents of the table stored in the variable df
(which, remember, now also has a fname_proper
column) and displays this output:
fname company fname_proper fname_1st_row_only_proper_v1
0 anush apple Anush Anush
1 benita boiron Benita NaN
2 cathy cvs Cathy NaN
3 darweesh dell Darweesh NaN
Which is this data:
fname | company | fname_proper | fname_1st_row_only_proper_v1 |
---|---|---|---|
anush | apple | Anush | Anush |
benita | boiron | Benita | |
cathy | cvs | Cathy | |
darweesh | dell | Darweesh |
Approach 2
This Python code:
df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
print(df)
Modifies the contents of the table stored in the variable df
(which, remember, already has 4 columns before we run the command) and displays this output:
fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0 anush apple Anush Anush Anush
1 benita boiron Benita NaN NaN
2 cathy cvs Cathy NaN NaN
3 darweesh dell Darweesh NaN NaN
Which is this data:
fname | company | fname_proper | fname_1st_row_only_proper_v1 | fname_1st_row_only_proper_v2 |
---|---|---|---|---|
anush | apple | Anush | Anush | Anush |
benita | boiron | Benita | ||
cathy | cvs | Cathy | ||
darweesh | dell | Darweesh |
Takeaway
Lessons I hope you’ll take away from this blog post are:
- Certain spreadsheet-modification tasks really aren’t much more complicated to code in Python than they are in Excel, once you’re familiar with them. And now you know one! (Adding a new title-cased copy of a column.)
- Pick the right tool for the job. Not only is there overhead to coding with Python vs. clicking cells in Excel and editing their formulas, but some commands are simpler than others in Python, just like in Excel.
Full code
Here’s the full script, which you can copy and paste into a Repl.it editor to run for yourself:
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
def p(x):
print(x)
print('---DIVIDER---')
initdata = OrderedDict([ ('fname', ['anush','benita','cathy','darweesh']),
('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
p(df)
# "=PROPER(A1)" imitation #1 of 3 (entire "fname" column)
df['fname_proper'] = df['fname'].str.title()
p(df)
# "=PROPER(A1)" imitation #2 of 3 (single cell approach #1 of 2)
df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
p(df)
# "=PROPER(A1)" imitation #3 of 3 (single cell approach #2 of 2)
df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
p(df)
And its full output:
fname company
0 anush apple
1 benita boiron
2 cathy cvs
3 darweesh dell
---DIVIDER---
fname company fname_proper
0 anush apple Anush
1 benita boiron Benita
2 cathy cvs Cathy
3 darweesh dell Darweesh
---DIVIDER---
fname company fname_proper fname_1st_row_only_proper_v1
0 anush apple Anush Anush
1 benita boiron Benita NaN
2 cathy cvs Cathy NaN
3 darweesh dell Darweesh NaN
---DIVIDER---
fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0 anush apple Anush Anush Anush
1 benita boiron Benita NaN NaN
2 cathy cvs Cathy NaN NaN
3 darweesh dell Darweesh NaN NaN
---DIVIDER---