Filter a large CSV file with Python
08 Jan 2019
Table of Contents
The problem with large files in Excel isn’t generally the size, per se. It’s that Excel tries to show you all its work. To modify huge CSV or XLSX files, such as exports from your Salesforce “Task” and “Contact” tables, consider writing code with a language like Python.
Why Excel modifies large files slowly
If you work with huge spreadsheets, you’ve probably frozen Excel by trying to filter a file and delete certain rows.
For example, download the file “100000 Sales Records - 3.54 MB” from the site “E for Excel.”
- Open it in Excel.
- Filter on “Country” and show only “Algeria,” “Armenia,” “Australia,” & “Barbados.”
- Select all the visible lines.
- Right-click on a number at right and delete all visible lines.
It takes a few seconds, right? My computer’s pretty great – from 2017 with a 4-core 2.50GHz processor and 16GB of RAM, et Excel froze for 6 seconds.
However, it’s not deleting the rows that slows Excel down.
It’s re-drawing the spreadsheet on your screen that eats up your computer.
Python can modify large files faster
Because programs without a graphical user interface can work without showing you what they’re up to, a little bit of Python code can make this edit in less than a tenth of a second.
Check out the results of this Python script:
import pandas
import datetime
df = pandas.read_csv('100000 Sales Records.csv')
badcountries = ['Algeria','Armenia','Australia','Barbados']
print('There are ' + str(len(df)) + ' rows in the dataset before deleting the "bad" country rows.')
print('Il y a ' + str(len(df)) + ' enregistrements dans les données avant de supprimer ceux des pays non-désirées.')
print()
t1 = datetime.datetime.now()
df = df[~df['Country'].isin(badcountries)]
t2 = datetime.datetime.now()
tm = round((t2 - t1).total_seconds(), 2)
print('It took ' + str(tm) + ' seconds to delete the "bad" country rows.')
print('Il a fallu ' + str(tm) + ' seconds pour suppprimer les données des pays non-désirées.')
print()
print('There are ' + str(len(df)) + ' rows in the dataset after deleting the "bad" country rows.')
print('Il y a ' + str(len(df)) + ' enregistrements dans les données après avoir supprimé ceux des pays non-désirées.')
There are 100000 rows in the dataset before deleting the "bad" country rows.
Il y a 100000 enregistrements dans les données avant de supprimer ceux des pays non-désirées.
It took 0.08 seconds to delete the "bad" country rows.
Il a fallu 0.08 seconds pour suppprimer les données des pays non-désirées.
There are 97885 rows in the dataset after deleting the "bad" country rows.
Il y a 97885 enregistrements dans les données après avoir supprimé ceux des pays non-désirées
It’s not necessary to choose Python as your programming language for such jobs, but it is important to consider a code-based approach to transforming your data. Something that lets the machine work without stopping to show you what it’s doing.
(I do, however, think Python’s a pretty darned simple programming language to learn.)
Now you know!