Python quick CSV / Excel diff
16 Apr 2019

Table of Contents
Let’s say I have log files from 2 different days. I know the columns are always the same, and that 99% of the rows will be exact replicas from each other.
I just want to see people who appear/disappeared from the log on any given day, or who changed between one day and the next.
Here’s “yesterday’s” dataset:
ssn | name_lf | ph | em | company |
---|---|---|---|---|
A1A1A1 | Amjit, Anush | 1111111 | [email protected] | Apple |
B2B2B2 | Borges, Benita | 2222222 | [email protected] | Boiron |
C3C3C3 | Combs, Cathy | 3333333 | [email protected] | CVS |
D4D4D4 | Daher, Darweesh | 4444444 | [email protected] | Dell |
E5E5E5 | Ellis, Ezra | 5555555 | [email protected] | EDF |
F6F6F6 | Fulvia, Frances | 6666666 | [email protected] | Firestone |
And here’s “today’s” dataset:
ssn | name_lf | ph | em | company |
---|---|---|---|---|
C3C3C3 | Combs, Cathy | 3333333 | [email protected] | CVS |
D4D4D4 | Daher, Darweesh | 4444444 | [email protected] | Dell |
E5E5E5 | Ellis, Ezra | 5555555 | [email protected] | EDF |
F6F6F6 | Fulvia, Frances | 6666666 | [email protected] | Firestone |
G7G7G7 | Gao, Grace | 7777777 | [email protected] | GAF |
H8H8H8 | Hopper, Helen | 8888888 | [email protected] | Heinz |
Notepad++ has a great “compare” tool that can make it easy to skim for changes, but I had a particular case where I wanted to see only lines that were new (like Grace & Helen), had disappeared (like Amit or Benita), or had changed (like Cathy’s e-mail address).
Unfortunately, the plugin doesn’t isolate changes like that. Notepad++ would simply show me the entire datasets side-by-side.
If these tables were Oracle database tables “convention1” (yesterday) & “convention2” (today), the code for “new or changed today” in SQL would be a very simple “MINUS” operation:
select * from convention2
minus
select * from convention1;
ssn | name_lf | ph | em | company |
---|---|---|---|---|
A1A1A1 | Amjit, Anush | 1111111 | [email protected] | Apple |
B2B2B2 | Borges, Benita | 2222222 | [email protected] | Boiron |
C3C3C3 | Combs, Cathy | 3333333 | [email protected] | CVS |
And the code for “dropped or changed from yesterday’s file” in SQL would be:
select * from convention1
minus
select * from convention2;
ssn | name_lf | ph | em | company |
---|---|---|---|---|
C3C3C3 | Combs, Cathy | 3333333 | [email protected] | CVS |
G7G7G7 | Gao, Grace | 7777777 | [email protected] | GAF |
H8H8H8 | Hopper, Helen | 8888888 | [email protected] | Heinz |
But these aren’t database tables. They’re CSV files on my hard drive.
So how do I get something similar against files on my hard drive?
Python and Pandas to the rescue! (Thanks to EdChum and piRSquared on StackOverflow for the code.)
import pandas
pandas.set_option('expand_frame_repr', False)
df1 = pandas.read_csv('c:\\example\\convention1.csv')
df2 = pandas.read_csv('c:\\example\\convention2.csv')
newin2 = df1.merge(df2, indicator=True, how='outer').query('_merge == "right_only"').sort_values(by=['name_lf'])
droppedin2 = df1.merge(df2, indicator=True, how='outer').query('_merge == "left_only"').sort_values(by=['name_lf'])
print('New in 2')
print(newin2)
print()
print('Lost from 1')
print(droppedin2)
(Note that I sorted my output by the name_lf
column to help with readability – in our file, things already are sorted, but in the file I was working with this morning, they weren’t.)
TO DO: PUT OUTPUT Here
Further Resources
Playing with Oracle
If you want to play with the Oracle SQL, sign up for a free account and try Oracle’s Live SQL tool.
You’ll need to run the following code to create and populate the two tables (once you run it once, you can comment it out by surrounding it with /*
and */
):
/*drop table CONVENTION1;
create table CONVENTION1 (
ssn varchar2(6) not null
, name_lf varchar2(30) not null
, ph varchar2(7) not null
, em varchar(15) not null
, company varchar(15) not null
);
insert into CONVENTION1 values ('A1A1A1', 'Amjit, Anush', '1111111', '[email protected]', 'Apple');
insert into CONVENTION1 values ('B2B2B2', 'Borges, Benita', '2222222', '[email protected]', 'Boiron');
insert into CONVENTION1 values ('C3C3C3', 'Combs, Cathy', '3333333', '[email protected]', 'CVS');
insert into CONVENTION1 values ('D4D4D4', 'Daher, Darweesh', '4444444', '[email protected]', 'Dell');
insert into CONVENTION1 values ('E5E5E5', 'Ellis, Ezra', '5555555', '[email protected]', 'EDF');
insert into CONVENTION1 values ('F6F6F6', 'Fulvia, Frances', '6666666', '[email protected]', 'Firestone');
drop table CONVENTION2;
create table CONVENTION2 (
ssn varchar2(6) not null
, name_lf varchar2(30) not null
, ph varchar2(7) not null
, em varchar(15) not null
, company varchar(15) not null
);
insert into CONVENTION2 values ('C3C3C3', 'Combs, Cathy', '3333333', '[email protected]', 'CVS');
insert into CONVENTION2 values ('D4D4D4', 'Daher, Darweesh', '4444444', '[email protected]', 'Dell');
insert into CONVENTION2 values ('E5E5E5', 'Ellis, Ezra', '5555555', '[email protected]', 'EDF');
insert into CONVENTION2 values ('F6F6F6', 'Fulvia, Frances', '6666666', '[email protected]', 'Firestone');
insert into CONVENTION2 values ('G7G7G7', 'Gao, Grace', '7777777', '[email protected]', 'GAF');
insert into CONVENTION2 values ('H8H8H8', 'Hopper, Helen', '8888888', '[email protected]', 'Heinz');
*/
Playing with Python
For a fancier Python-based “diff” tool, see also https://pbpython.com/excel-diff-pandas-update.html
For repl.it: