Situation
I have a pandas dataframe defined as follows:
import pandas as pd
headers = ['Group', 'Element', 'Case', 'Score', 'Evaluation']
data = [
['A', 1, 'x', 1.40, 0.59],
['A', 1, 'y', 9.19, 0.52],
['A', 2, 'x', 8.82, 0.80],
['A', 2, 'y', 7.18, 0.41],
['B', 1, 'x', 1.38, 0.22],
['B', 1, 'y', 7.14, 0.10],
['B', 2, 'x', 9.12, 0.28],
['B', 2, 'y', 4.11, 0.97],
]
df = pd.DataFrame(data, columns=headers)
which looks like this in console output:
Group Element Case Score Evaluation
0 A 1 x 1.40 0.59
1 A 1 y 9.19 0.52
2 A 2 x 8.82 0.80
3 A 2 y 7.18 0.41
4 B 1 x 1.38 0.22
5 B 1 y 7.14 0.10
6 B 2 x 9.12 0.28
7 B 2 y 4.11 0.97
Problem
I'd like to perform a grouping-and-aggregation operation on df
that will give me the following result dataframe:
Group Max_score_value Max_score_element Max_score_case Min_evaluation
0 A 9.19 1 y 0.41
1 B 9.12 2 x 0.10
To clarify in more detail: I'd like to group by the Group
column, and then apply aggregation to get the following result columns:
Max_score_value
: the group-maximum value from theScore
column.Max_score_element
: the value from theElement
column that corresponds to the group-maximumScore
value.Max_score_case
: the value from theCase
column that corresponds to the group-maximumScore
value.Min_evaluation
: the group-minimum value from theEvaluation
column.
Tried thus far
I've come up with the following code for the grouping-and-aggregation:
result = (
df.set_index(['Element', 'Case'])
.groupby('Group')
.agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
.reset_index()
)
print(result)
which gives as output:
Group Score Evaluation
max idxmax min
0 A 9.19 (1, y) 0.41
1 B 9.12 (2, x) 0.10
As you can see the basic data is there, but it's not quite in the format yet that I need. It's this last step that I'm struggling with. Does anyone here have some good ideas for generating a result dataframe in the format that I'm looking for?
Best Answer
Starting from the
result
data frame, you can transform in two steps as follows to the format you need:An alternative starting from original
df
usingjoin
, which may not be as efficient but less verbose similar to @tarashypka's idea:Naive timing with the example data set: