A Chunk of Pandas

As many will already know, pandas is the go-to solution for data analysis in Python. With its rich collection of convenient utilities and its underlying performance enhancements, its popularity is not only well deserved but also responsible in no small way for the growing popularity of Python.

Nevertheless as computer memory is finite, reading and processing large data files poses problems for pandas as much as any program. And so here we're going to look at the quick and easy techniques that pandas offers for working with large data.

The Panda in the RAM

Once read from file, pandas stores data in RAM, because RAM has fast data transfer rates, even compared to SSD. But RAM is scarce and the overhead of implementing all of pandas's functionality requires approximately two to three bytes of RAM for every byte of data. In short, pandas pays a price for speed by limiting its capacity to analyse large data to the RAM of its host computer.

Thus, in many ways, we often don't have a large data problem, but rather a RAM shortage problem, which means the solution equally often is simple: get more RAM.

For approximately $10 per hour on AWS one can have around a terabyte (1000 GiB) of RAM. Done! Right?

pandas Techniques - A good craftsperson doesn't blame their tools

Unfortunately, supercomputing cloud solutions are not always available or economical, and we often must work with the tools at hand. Enter the venerable pd.read_csv() function: the old dog who still has a few tricks to show us (just look at how long its doc-string is).

Indeed the key to most of what's covered in this article is that pd.read_csv() is both responsible for reading data from disk into RAM (it's a gatekeeper between disk and RAM, if you will), and, pd.read_csv() is an optimised parser of data. This means it is the first tool to reach for when you want to reduce the burden on RAM and to do so speedily.

Below, we'll cover the following approaches:

  • Reducing the number of columns read into RAM
  • Reducing the number of rows read into RAM
  • Processing data piece by piece.

And finally, we'll touch on some related time performance concerns and additional tools that may be worth exploring.

Shrinking Data - Cutting Down to Size

For managing large data with pandas, we have the same performance concerns any program does: memory and time. Memory is often the show-stopper, so we'll address it first and look at time complexity later.

If you have a large data file that is simply too big to fit into RAM, the first option you have is to truncate the data and proceed with your analysis on the more manageable smaller portion.

Removing Columns - Slimming Down

The easy wins here are when you don't need all the columns in your data file. Ideally, tabular data should be long and narrow (ie many rows and few columns), but this is not always the reality we live in. Fortunately pandas provides us with the usecols argument in pd.read_csv() which allows us to include only the columns that we need and so "narrow" our data.

Columns to be included can be identified by name.

df = pd.read_csv('my_data.csv', usecols=['colA', 'colB', 'colC'])

Or, they can be identified by position, starting at 0 for the first column, which is especially useful for when you have too many columns to deal with them by name.

# reading particular columns
df = pd.read_csv('my_data.csv', usecols=[0, 1, 4, 5])

# reading the first 50 columns
df = pd.read_csv('my_data.csv', usecols=range(50))

When you don't know which columns to include but need to inspect them beforehand, the nrows argument tells pd.read_csv() how many rows to read from the top. Taking only the first few or even only the first row allows one to inspect, analyse and filter the columns by name or by the nature of their data, while minimising the data you've loaded. Creating a list of column names or positions to be provided back to usecols is all that is needed to then filter columns according to this analysis. One workflow might be:

  1. Read only the top 3 rows to keep the dataframe small
    df_top = pd.read_csv('my_data.csv', nrows=3)
    
  2. Inspect only the columns as a single array
    df_top.columns
    
  3. Include only columns with alphabetical names (ie with no numbers)
    col_mask = df_top.columns.str.isalpha() # builtin string method
    alpha_cols_include = df_top.columns[col_mask] # masking
    
    df = pd.read('my_data.csv', usecols=alpha_cols_include)
    
  4. Include only columns with numerical data
    numeric_cols_include = df_top.select_dtypes(include='number').columns
    # read the pd.DataFrame.select_dtypes doc string for further tips
    
    df = pd.read('my_data.csv', usecols=numeric_cols_include)
    
  5. Combine the filters to include only columns with an alphanumeric name and numeric data
    cols_include = alpha_cols_include.intersection(numeric_cols_include)
    
    df = pd.read('my_data.csv', usecols=cols_include)
    

Removing Rows - Getting Shorty

Sometimes it's the rows that we want to clean out. We've already seen nrows in action, which allows us to read in only a certain amount of rows from the top, effectively cutting off the remaining rows from the bottom of the data file. Additionally, skiprows does the opposite by cutting off a certain number of rows from the top and reading only what's left at the bottom.

# Read first rows
df = pd.read('my_data.csv', nrows=1000)
# skip first 1000 rows
df = pd.read('my_data.csv', skiprows=1000)

# find number of rows
n_rows = len(pd.read('my_data.csv', usecols=[0]))
# take last 100 rows
df = pd.read('my_data.csv', skiprows = n_rows - 100)

# Alternative for finding the number of rows
# read each line lazily and sum
n_rows = sum(1 for line in open('my_data.csv'))

Beyond this, skiprows also works like usecols but with two main differences. Rows can only be skipped by their position (starting at 0), not by their name or label. Also, unlike usecols, skiprows excludes rows while usecols includes columns. So watch out for that unfriendly reversal and have your logical operator for NOT on arrays or Series (ie ~) handy to help you keep things straight. For example:

# from above - columns with alphabetical names
col_mask = df_top.columns.str.isalpha()
# tilde (~) is NOT on an array or pandas Series
# flips True-->False and False-->True
not_alpha_col_mask = ~col_mask

Note: the similarly-geared skipfooter argument is not optimised in pandas and will result in a much slower file read especially for large files. If you know the number of rows at the end of the file to skip, often the operating system tools will do the job very easily. For example, in unix to trim off the last two lines of a file:

head -n -2 my_data.csv > my_trimmed_data.csv

If you're able to filter out rows by analysing only a small subset of the columns in your data, you can invert the approach above for filtering columns:

  1. read only the columns you need to analyse for filtering your rows
    row_filt_data = pd.read_csv('employee_data.csv', usecols=['name', 'status'])
    
  2. from these columns, generate a list of which rows need to be skipped, and,
    # find which rows have 'leave' in the 'status' column
    on_leave = row_filt_data['status'].str.contains('leave', case=False)
    
    # extract index for rows that DO NOT have have 'leave' in 'status' col
    # NOTE: must add 1 as header counts as a row when reading a csv
    not_on_leave_rows = row_filt_data.index[~on_leave] + 1
    
  3. read in only the rows you need
    df_on_leave = pd.read_csv('employee_data.csv', skiprows=not_on_leave_rows)
    

A couple of things to note about the above code, especially the not_on_leave_rows line under point 2:

  • After having found which rows contain employees on leave with on_leave, we need to create a list or array that contains the position (starting from 0) for each of these rows.
    • To do this, we access the index of row_filt_data, which is the array of labels for each row, and which always default to their position (ie [0, 1, 2, ...]).
    • We then use filtering or masking with our on_leave, which is a Series of booleans, to get the desired positions from the index.
    • BUT, because skiprows excludes rows, we need to find the positions of all the rows for employees not on leave, thus the NOT operator ~ in ~on_leave to flip the True and False values.
  • The + 1 we add to our not_on_leave_rows might be confusing. Let's explain:
    • Because the first row of the csv file is used for the column names, our dataframe rows have a position that is one less than the line they originate from in the file. That is, row 11 of our dataframe, actually comes from line 12 of file.
    • And so, to compensate for this offset, we need to add 1 to our dataframe row positions so that they line up with the file.

Both Rows and Columns - Double Takes

Make sure to remember that both columns and rows can be excluded simultaneously and that usecols, nrows, skiprows and all of the above filtering techniques can be combined into one single read of you csv file. As pd.read_csv() is well optimised, filtering both the columns and rows simultaneously in a single call will perform speedily.

Chunking Data - Taking Things One Step at a Time

If you can't simply exclude whole portions of your data file, then the next option is to read your datafile one chunk at a time.

Similarly to nrows above, the argument chunksize defines how many rows will be read from the top. What's different though, is that chunksize tells pandas to create an iterator for you, which you can use to read a data file a certain number of rows at a time.

# data with 1000 rows
chunked_data = pd.read_csv('data_1000_rows.csv', chunksize=400)
# read data file 400 rows at a time
for chunk in chunked_data:
    print(f'rows: {len(chunk)}, first_row: {chunk.index[0]}, second_row: {chunk.index[-1]}')
# rows: 400, first_row: 0, second_row: 399
# rows: 400, first_row: 400, second_row: 799
# rows: 200, first_row: 800, second_row: 999

The great thing about this approach is that you're completely in control of how much RAM gets used. The downside is that performing calculations becomes more difficult as one must be thoughtful about how the calculations on each chunk will be aggregated.

Sometimes this can be straightforward. For instance, with the employee data from above, counting the total amount of people on leave requires only counting within each chunk adding this figure to a running total.

## Counting how many employees on leave

# chunk 1000 rows at a time
chunked_dfs = pd.read_csv('employee_data.csv', chunksize=1000)

on_leave_count = 0 # initialise counter before iteration
# Loop through each chunk of rows
for chunk in chunked_dfs:
    # as df.str.contains() returns booleans,
    # sum returns number of occurrences of True
    temp_on_leave_count = sum(chunk['status'].str.contains('leave'))
    # Add count to running total
    on_leave_count += temp_on_leave_count

But, when the calculation requires additional information about the whole dataset and about each chunk, stitching that information together along with the calculation can be tricky.

For instance, let's say we want to know how many employees are on leave for each department. A typical approach, without chunking, would be as follows.

employee_df = pd.read_csv('employee_data.csv')
# Filter for employees on leave
on_leave = employee_df['status'].str.contains('leave')
on_leave_data = employee_df[on_leave]

# group on_leave_data according to department and count
on_leave_by_dept = on_leave_data.groupby('department').count()
#               status  ...
# department
# IT            122     ...
# HR            54      ...
Note: if you're not familiar with groupby and the related pivot_table functions, they're powerful ways of splitting up your data according to the unique values in a column, to then performing a calculation, like count for each unique value. Here, using groupby, we've split our data by department before counting the number of on_leave employees.

To now do this chunk by chunk, we have to keep track of which departments occur in each chunk and then aggregate the counts by department.

# read data file 1000 rows at a time
chunked_dfs = pd.read_csv('employee_data.csv', chunksize=1000)

# prepare empty list for storing the results for each chunk
chunked_results = []
for chunk in chunked_dfs:
    # Filter for employees on leave
    on_leave_data = chunk[chunk['status'].str.contains('leave')]
    # group according to department and count like above
    on_leave_by_dept = on_leave_data.groupby('department').count()
    # add result, including the department data, to chunked_results list
    chunked_results.append(on_leave_by_dept)

# Concatenate each chunked result into a single DataFrame
agg_data = pd.concat(chunked_results)
# Add up count for each chunk according to the department, using groupby again
agg_data.groupby('department').sum()

Fortunately, the groupby function is well suited to solving this problem. But there is a (small) learning curve to using groupby and the way in which the results of each chunk are aggregated will vary depending on the kind of calculation being done. pandas does provide the tools however, which is why it's such a useful. Moreover, using chunksize is a reliable method for reducing the memory load of your analysis, so the above is definitely a pattern to keep in mind when dealing with large data.

Improving Performance - Needs for Speed

Sometimes our problem isn't memory but computational time. This can especially be the case when pandas is being used within a larger work-flow or as part of a broader application. In such cases large data files can simply slow things down. As pd.read_csv() is a well optimised csv reader, leaning into the above methods of filtering data by skipping rows etc, which operate at read and parse time, can ensure that said filtering occurs quickly.

If, for instance, you're filtering a large dataframe by the values in a particular column, one can obtain an improvement in performance by first reading and analysing only the required columns, and then using pd.read_csv() and the skiprows argument to filter the rows. When your datasets have 1000 or more columns, and you can anticipate filtering 50% or more of the rows in your work-flow, using the above methods to put these tasks into pd.read_csv() as much as possible can make your code run up to twice as fast (~10-50% reductions in time).

Going Further

Categorical Columns

Categorical Data is a pandas data type for discrete data, where each value is stored as a reference to a set of possible values. When a column contains relatively few unique values repeated many times, treating that column as categorical in pandas leads to memory optimisations, as the real unique values are only stored once while the whole column of data contains mere references to these values. When your large dataset contains many repeated values, using the categorical dtype can reduce its memory footprint.

Dask

Dask is a python package for easily working with pandas (and numpy) in a parallelised fashion. As a result of this approach, large data gets chunked automatically without substantial changes to your code base.

For example, the above chunking task on employee data would like the following:

# dask interface to pandas dataframes
from dask import dataframe as ddf
# read data normally
employee_ddf = ddf.read_csv('employee_data.csv')
# perform analysis with normal code
employee_ddf = (
  employee_ddf[employee_ddf['status'].str.contains('leave')]
    .groupby('department').count()
    )
# Analysis is not performed until here when compute() is run
# dask now does chunking for us
employee_ddf = employee_ddf.compute()

This approach requires learning another tool, which also involves learning the differences between the dask dataframe and the pandas dataframe. But the equivalence between dask and pandas is substantial such that dask can often be a drop-in replacement. If processing large data chunk by chunk is a recurrent problem, dask should be considered as a potential solution.

HDF5 Data Format

HDF5 is a data format optimised for large data and which pandas handles well. As a result, one can both chunk data and apply filtering logic directly at the data file reading stage when reading files in the HDF5 format. This leads immediately to both memory and performance improvements that can be well beyond anything achieved by the examples above (see the pandas docs here). Though we're not always in control of the data formats we receive, if large data files are a persistent problem, striving for data to be in the HDF5 format, either through an upstream change or a data conversion work flow, is certainly worth considering.