Python – Apply Multiple Functions to Multiple Groupby Columns

aggregate-functionsgroup-bypandaspython

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

Best Answer

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494
Related Question