I am working on a dataframe similar to below sample:
import pandas as pd
import numpy as np
np.random.seed(0)
np.random.seed(0)
df = pd.DataFrame({'date' : np.tile(['2024-05-01', '2024-06-01'], 4),
'State' : np.repeat(['fl', 'ny', 'mi', 'nc'], 2),
'Rev' : [21000, 18200, 51200, 48732, 5676, 6798, 24012, 25005],
'Score' : np.random.normal(size = 8),
'Value' : np.random.randint(10, 50, size = 8)})
df
date State Rev Score Value
0 2024-05-01 fl 21000 1.764052 34
1 2024-06-01 fl 18200 0.400157 22
2 2024-05-01 ny 51200 0.978738 11
3 2024-06-01 ny 48732 2.240893 48
4 2024-05-01 mi 5676 1.867558 49
5 2024-06-01 mi 6798 -0.977278 33
6 2024-05-01 nc 24012 0.950088 34
7 2024-06-01 nc 25005 -0.151357 27
Expected output should be the dataframe
sorted by Rev
, largest to the smallest, and within each State
, the date
column should be sorted from in ascending order.
Tried below code:
(df.sort_values(by = ['Rev'], ascending = [False]).
groupby('State', as_index = False).
apply(lambda x : x.sort_values('date')).reset_index(drop = True))
But it's not giving me the required output.
date State Rev Score Value
0 2024-05-01 fl 21000 1.764052345967664 34
1 2024-06-01 fl 18200 0.4001572083672233 22
2 2024-05-01 mi 5676 1.8675579901499675 49
3 2024-06-01 mi 6798 -0.977277879876411 33
4 2024-05-01 nc 24012 0.9500884175255894 34
5 2024-06-01 nc 25005 -0.1513572082976979 27
6 2024-05-01 ny 51200 0.9787379841057392 11
7 2024-06-01 ny 48732 2.240893199201458 48
The output should be NY, NC, FL and MI in that order based on the Rev
and date
columns.
i.e. for a State
group, the Rev
value for 2024-05-01
will decide which state will take precedence in the final output order.
Can someone help me with the code.
Expected Output:
df.iloc[[2,3, 6,7, 0,1, 4,5], : ]
date State Rev Score Value
2 2024-05-01 ny 51200 0.978738 11
3 2024-06-01 ny 48732 2.240893 48
6 2024-05-01 nc 24012 0.950088 34
7 2024-06-01 nc 25005 -0.151357 27
0 2024-05-01 fl 21000 1.764052 34
1 2024-06-01 fl 18200 0.400157 22
4 2024-05-01 mi 5676 1.867558 49
5 2024-06-01 mi 6798 -0.977278 33
Best Answer
IMO, the easiest and most explicit approach to perform "complex"/multi-condition sorts is to use
numpy.lexsort
and pass the constraints in reverse order of preference:Which reads (in reverse order with
lexsort
):In case two States could have the same max Rev, if you want to ensure having separate groups, add
df['State']
as an intermediate condition:Output: