My data has ages, and also payments per month.
I'm trying to aggregate summing the payments, but without summing the ages (averaging would work).
Is it possible to use different functions for different columns?
aggregategroup-bypandaspython
My data has ages, and also payments per month.
I'm trying to aggregate summing the payments, but without summing the ages (averaging would work).
Is it possible to use different functions for different columns?
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg
groupby method. Second, never use .ix
.
If you desire to work with two separate columns at the same time I would suggest using the apply
method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__
attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
apply
and returning a SeriesNow, if you had multiple columns that needed to interact together then you cannot use agg
, which implicitly passes a Series to the aggregating function. When using apply
the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
Starting from the result
data frame, you can transform in two steps as follows to the format you need:
# collapse multi index column to single level column
result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
# split the idxmax column into two columns
result = result.assign(
max_score_element = result.idxmax_Score.str[0],
max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)
result
#Group max_Score min_Evaluation max_score_case max_score_element
#0 A 9.19 0.41 y 1
#1 B 9.12 0.10 x 2
An alternative starting from original df
using join
, which may not be as efficient but less verbose similar to @tarashypka's idea:
(df.groupby('Group')
.agg({'Score': 'idxmax', 'Evaluation': 'min'})
.set_index('Score')
.join(df.drop('Evaluation',1))
.reset_index(drop=True))
#Evaluation Group Element Case Score
#0 0.41 A 1 y 9.19
#1 0.10 B 2 x 9.12
Naive timing with the example data set:
%%timeit
(df.groupby('Group')
.agg({'Score': 'idxmax', 'Evaluation': 'min'})
.set_index('Score')
.join(df.drop('Evaluation',1))
.reset_index(drop=True))
# 100 loops, best of 3: 3.47 ms per loop
%%timeit
result = (
df.set_index(['Element', 'Case'])
.groupby('Group')
.agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
.reset_index()
)
result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
result = result.assign(
max_score_element = result.idxmax_Score.str[0],
max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)
# 100 loops, best of 3: 7.61 ms per loop
Best Answer
You can pass a dictionary to
agg
with column names as keys and the functions you want as values.