Leveraging data aggregations in Python: using Pandas groupby method for data analysis

I share the intuition that helped me understand Pandas GroupBy method and examples of how it can be so useful when analyzing data.

Leveraging data aggregations in Python: using Pandas groupby method for data analysis
Image created with Midjourney V6.0.

SQL and spreadsheet thinking to level up your Pandas game

When analyzing data as a beginner in Python programming, I would often write for loops or complicated functions to achieve things I could do faster using Pandas’ groupby method, especially when computing data aggregations.

The problem was I didn’t fully understand how it worked and how powerful it was.

But working with spreadsheets and SQL helped me get a better grasp of what it does. Yes, I learned from Excel tips and the kind of things I thought I wanted to move away from.

So here I’m going to share how I understood better what Pandas’s groupby method does, and how we can benefit from it, using a few examples.

What are data aggregations?

Aggregations are computations we use to obtain information about a potentially large amount of data points.

Think about it like a summary.

When we use methods like max(), min(), sum()mean() and median(), they return a single number that provides usefull information about a whole column or row. Those are some of the basic data aggregation methods.

They are easy to understand because we think about them like we were taught in school: “to compute the mean, take the sum of all the elements divided by how many elements you have”.

Pandas dataframes are, essentially, tables. So many of the methods they provide for data aggregation are very similar to Excel and SQL.

In particular, groupby reminds of Excel’s pivot tables and SQL’s own GROUP BY clause.

Split, apply, and combine

This is the mantra I tell myself whenever I’m struggling to understand how a groupby operation works. Essentially, we can see the output of a groupby as the result of three simpler operations that are done sequentially.

These are illustrated in the figure below.

Schematic representation of Split, apply and combine operations used when computing aggregations
Split, apply and combine operations used when computing aggregations

First, we split the data into parts according to the keys we pass as an argument to the groupby method.

Next, we apply the desired aggregation to each one of the parts, separately. For instance, we count them, or take the mean value of them, or any other computation we want.

Finally, we combine the results in a new, aggregated table which shows the computation result for each sub-group.

It becomes easier to understand when you think of the operation as a sequence of steps. At least it did for me.

Why aggregations are powerful when analyzing data?

To illustrate how powerful this is, let's start with an example data frame.

Let’s say we are analyzing data for some business and we have a dataset about customer transactions. We can create a synthetic dataset just for testing:

import pandas as pd
import numpy as np
import time

# Generate a large synthetic dataset with 10k rows
num_rows = 10000
data = {'CustomerID': np.random.randint(1, num_rows/4 + 1, num_rows),
        'TransactionID': np.arange(num_rows),
        'Revenue': np.random.rand(num_rows) * 100}

transactions = pd.DataFrame(data)
transactions.head()

Out[1]: 
				   CustomerID  TransactionID    Revenue
				0          55              0  31.403065
				1         198              1  59.796718
				2        2431              2  22.002152
				3         960              3   3.743878
				4         120              4  40.094364

Now, imagine we want to know the total revenue for each customer. We could get the results by using a for loop:

def extract_with_for_loop(df):

    revenue_by_ID = pd.DataFrame(
                          columns=['CustomerID', 
                          'TotalRevenue'])

	for customer in df.CustomerID.unique():
		transactions = df[df.CustomerID == customer]
		revenue_by_ID = revenue_by_ID.append(
                              {'CustomerID' :customer, 
							  'TotalRevenue' : transactions.Revenue.sum()}, 
							   ignore_index = True)
                               
	revenue_by_ID = revenue_by_ID.sort_values('CustomerID')
				
	return revenue_by_ID


start_time = time.time()
rint(extract_with_for_loop(transactions).head())
end_time = time.time()
print('Time taken using for loop:', end_time - start_time, 'seconds')


Out[2]:    CustomerID  TotalRevenue
				0         1.0    197.259255
				1         2.0    179.829982
				2         3.0    152.390251
				3         4.0    313.157241
				4         5.0    109.242505
				Time taken using for loop: 9.578813314437866 seconds

But what if I told you we can use indexing to do exactly that with a single line of code, and much faster?

# Define another function to extract the same data, this time using groupby
def extract_with_groupby(df):
    revenue_by_ID = df.groupby('CustomerID')['Revenue'].sum().reset_index(name='TotalRevenue')
    return revenue_by_ID

# Test the time when using groupby
start_time = time.time()
print(extract_with_groupby(transactions).head())
end_time = time.time()
print('Time taken using groupby:', end_time - start_time, 'seconds')


Out[3]:    CustomerID  TotalRevenue
				0           1    197.259255
				1           2    179.829982
				2           3    152.390251
				3           4    313.157241
				4           5    109.242505
				Time taken using groupby: 0.01568460464477539 seconds

That is awesome! less typing, and we obtained the same results in a fraction of time.

But why would I want to mess with conditions and indexing if I can just wait a few seconds more?

Well, I’m glad you ask…

What happens when you have more data?

The problem with using for loops to do this kind of data extraction is that your code will need to go through each row and apply the conditional statement individually.

This means that execution time will grow as the size of your dataset grows.

Let’s measure that.

Now, I’m going to do something contradictory. I will prove how slow for loops can be, by using a for loop!

I'll create 10 dataframes with increasing sizes, and use timeit to measure the execution time of the two functions I showed above when running them 5 times for each dataframe size.

# Test the performance of the two methods for varying DataFrame sizes
import matplotlib.pyplot as plt
import timeit



df_sizes = np.linspace(100, 10000, num=10, dtype=int)
indexing_times = []
for_loop_times = []

for size in df_sizes:
    # Create example dataframes with varying sizes for each iteration
    data = {'CustomerID': np.random.randint(1, size/4 + 1, size),
            'TransactionID': np.arange(size),
            'Revenue': np.random.rand(size) * 100}

    df = pd.DataFrame(data)
    
    indexing_time = timeit.timeit('extract_with_groupby(df)',
                                  setup='from __main__ import extract_with_groupby, df',
                                  number=5)
    indexing_times.append(indexing_time)
    
    for_loop_time = timeit.timeit('extract_with_for_loop(df)',
                                  setup='from __main__ import extract_with_for_loop, df',
                                  number=5)
    for_loop_times.append(for_loop_time)

# Plot the results
fig = plt.figure(figsize= (8,6), dpi = 72)
ax = fig.add_subplot(1,1,1)
ax.plot(df_sizes, indexing_times, label='Indexing')
ax.plot(df_sizes, for_loop_times, label='For Loop')
ax.set_xlabel('DataFrame Size')
ax.set_ylabel('Execution Time (Seconds)')
ax.set_title('Execution Time Comparison for DataFrame Extraction')
ax.legend(frameon = False)
plt.tight_layout()
fig.show()
fig.savefig('comparison.png', dpi = 300)
print('Created figure')
Untitled

As you can see, the for loop version of this code is much slower than the indexing version, especially for large data sets.

This is because the for loop requires iterating over every row in the data frame and performing the calculation for each customer individually, whereas indexing allows us to select and manipulate large groups of data at once, making it much faster and more efficient.

Conclusion

Aggregations are the bread and butter of many Data Analysis workflows, and they become more important as the amount of data we handle grows.

Sadly, though, as a Mechanical Engineer coming from Matlab, it took me time to realize this.

Working with tables and aggregations seemed more suitable for people analyzing different kinds of data, like in finance, marketing, sales or insurance companies.

Certainly not data coming from physical systems (which is the only thing I used to work with).

But I was wrong. If you are interested, I’ve written before about how I leveraged it in my job when analyzing data from a test bench.

I hope this short post helped you get a better understanding of Pandas groupby method, and how you can use it to avoid time-consuming tasks and improve the efficiency of your data analysis.

Have a good one!


Check out my other posts for more insights about using and learning Python for tasks you’d usually do with Matlab, or just general tips about Data Analytics and programming.

Consider subscribing (it's free! and I hate spam) to receive new posts about these topics.