Pandas – How to Apply Multiple Functions to Groupby Apply

aggregateapplyfunctionpandaspandas-groupby

I have a dataframe which shall be grouped and then on each group several functions shall be applied. Normally, I would do this with groupby().agg() (cf. Apply multiple functions to multiple groupby columns), but the functions I'm interested do not need one column as input but multiple columns.

I learned that, when I have one function that has multiple columns as input, I need apply (cf. Pandas DataFrame aggregate function using multiple columns).
But what do I need, when I have multiple functions that have multiple columns as input?

import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10], 'y':[10, 13, 20, 30], 'id':['a', 'a', 'b', 'b']})

def mindist(data): #of course these functions are more complicated in reality
     return min(data['y'] - data['x'])
def maxdist(data):
    return max(data['y'] - data['x'])

I would expect something like df.groupby('id').apply([mindist, maxdist])

    min   max
id      
 a    8    10
 b   30    40

(achieved with pd.DataFrame({'mindist':df.groupby('id').apply(mindist),'maxdist':df.groupby('id').apply(maxdist)} – which obviously isn't very handy if I have a dozend of functions to apply on the grouped dataframe). Initially I thought this OP had the same question, but he seems to be fine with aggregate, meaning his functions take only one column as input.

Best Answer

For this specific issue, how about groupby after difference?

(df['x']-df['y']).groupby(df['id']).agg(['min','max'])

More generically, you could probably do something like

df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))

Related Solutions

Python – Apply Multiple Functions to Multiple Groupby Columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

Python Pandas – Groupby Apply or Aggregate with Custom Function with Multiple Inputs

Since the two other inputs are constants, you can simply use a lambda expression:

df_cpk = df.groupby(['a','b','c'])['value'].agg(lambda x: cpk2(x, 50, 150)).reset_index()

Best Answer

Related Solutions

Python – Apply Multiple Functions to Multiple Groupby Columns

Using apply and returning a Series

Python Pandas – Groupby Apply or Aggregate with Custom Function with Multiple Inputs

Related Question

Using `apply` and returning a Series