Python Pandas – Groupby Apply or Aggregate with Custom Function with Multiple Inputs

aggregate-functionsapplygroup-bypandaspython

I want to apply custom functions to pandas groupby function.

I was able to apply when my custom function has only 1 input which is the grouped value.

I have dataframe like this:

a     b     c      value
a1    b1    c1      v1
a2    b2    c2      v2
a3    b3    c3      v3

Appliable version:

def cpk(a):
    arr = np.asarray(a)
    arr = arr.ravel()
    sigma = np.std(arr)
    m = np.mean(arr)

    Cpu = float(150 - m) / (3*sigma)
    Cpl = float(m - 50) / (3*sigma)
    Cpk = np.min([Cpu, Cpl])
    return Cpk


df_cpk = df_result.groupby(['a','b','c'])['value'].agg(cpk).reset_index()

As you can see in the above code, the grouped 'value' automatically go to the input of the cpk function.

What I want to know is how to apply below function:

def cpk2(a,lsl,usl):
    arr = np.asarray(a)
    arr = arr.ravel()
    sigma = np.std(arr)
    m = np.mean(arr)

    Cpu = float(usl - m) / (3*sigma)
    Cpl = float(m - lsl) / (3*sigma)
    Cpk = np.min([Cpu, Cpl])
    return Cpk

# df_cpk = df_result.groupby(['a','b','c'])['value'].agg(cpk2(?,?,?)).reset_index()

Where there are multiple inputs to the function, one being the group values.
Is there any simple way to do it?

Best Answer

Since the two other inputs are constants, you can simply use a lambda expression:

df_cpk = df.groupby(['a','b','c'])['value'].agg(lambda x: cpk2(x, 50, 150)).reset_index()

Related Solutions

Python – Apply Multiple Functions to Multiple Groupby Columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

Python Pandas Group By – Using Multiple Functions in a Group By with Pandas

You can pass a dictionary to agg with column names as keys and the functions you want as values.

import pandas as pd
import numpy as np

# Create some randomised data
N = 20
date_range = pd.date_range('01/01/2015', periods=N, freq='W')
df = pd.DataFrame({'ages':np.arange(N), 'payments':np.arange(N)*10}, index=date_range)

print(df.head())
#             ages  payments
# 2015-01-04     0         0
# 2015-01-11     1        10
# 2015-01-18     2        20
# 2015-01-25     3        30
# 2015-02-01     4        40

# Apply np.mean to the ages column and np.sum to the payments.
agg_funcs = {'ages':np.mean, 'payments':np.sum}

# Groupby each individual month and then apply the funcs in agg_funcs
grouped = df.groupby(df.index.to_period('M')).agg(agg_funcs)

print(grouped)
#          ages  payments
# 2015-01   1.5        60
# 2015-02   5.5       220
# 2015-03  10.0       500
# 2015-04  14.5       580
# 2015-05  18.0       540

Best Answer

Related Solutions

Python – Apply Multiple Functions to Multiple Groupby Columns

Using apply and returning a Series

Python Pandas Group By – Using Multiple Functions in a Group By with Pandas

Related Question

Using `apply` and returning a Series