Python Pandas – How to Sort a Pandas DataFrame Using Group By

pandaspython

I am working on a dataframe similar to below sample:

import pandas as pd
import numpy as np

np.random.seed(0)
np.random.seed(0)
df = pd.DataFrame({'date' : np.tile(['2024-05-01', '2024-06-01'], 4),
                  'State' : np.repeat(['fl', 'ny', 'mi', 'nc'], 2),
                  'Rev' : [21000, 18200, 51200, 48732, 5676, 6798, 24012, 25005],
                  'Score' : np.random.normal(size = 8),
                  'Value' : np.random.randint(10, 50, size = 8)})
df

    date        State   Rev     Score       Value
0   2024-05-01  fl      21000   1.764052    34
1   2024-06-01  fl      18200   0.400157    22
2   2024-05-01  ny      51200   0.978738    11
3   2024-06-01  ny      48732   2.240893    48
4   2024-05-01  mi       5676   1.867558    49
5   2024-06-01  mi       6798   -0.977278   33
6   2024-05-01  nc      24012   0.950088    34
7   2024-06-01  nc      25005   -0.151357   27

Expected output should be the dataframe sorted by Rev, largest to the smallest, and within each State, the date column should be sorted from in ascending order.

Tried below code:

(df.sort_values(by = ['Rev'], ascending = [False]).
     groupby('State', as_index = False).
     apply(lambda x : x.sort_values('date')).reset_index(drop = True))

But it's not giving me the required output.

    date        State   Rev     Score               Value
0   2024-05-01  fl      21000   1.764052345967664   34
1   2024-06-01  fl      18200   0.4001572083672233  22
2   2024-05-01  mi       5676   1.8675579901499675  49
3   2024-06-01  mi       6798   -0.977277879876411  33
4   2024-05-01  nc      24012   0.9500884175255894  34
5   2024-06-01  nc      25005   -0.1513572082976979 27
6   2024-05-01  ny      51200   0.9787379841057392  11
7   2024-06-01  ny      48732   2.240893199201458   48

The output should be NY, NC, FL and MI in that order based on the Rev and date columns.
i.e. for a State group, the Rev value for 2024-05-01 will decide which state will take precedence in the final output order.

Can someone help me with the code.

Expected Output:

df.iloc[[2,3, 6,7, 0,1, 4,5], : ]


    date        State   Rev     Score       Value
2   2024-05-01  ny      51200   0.978738    11
3   2024-06-01  ny      48732   2.240893    48
6   2024-05-01  nc      24012   0.950088    34
7   2024-06-01  nc      25005   -0.151357   27
0   2024-05-01  fl      21000   1.764052    34
1   2024-06-01  fl      18200   0.400157    22
4   2024-05-01  mi       5676   1.867558    49
5   2024-06-01  mi       6798   -0.977278   33

Best Answer

IMO, the easiest and most explicit approach to perform "complex"/multi-condition sorts is to use numpy.lexsort and pass the constraints in reverse order of preference:

out = df.iloc[np.lexsort([df['date'],
                          -df.groupby('State')['Rev'].transform('max')])]

Which reads (in reverse order with lexsort):

sort in priority by decreasing max Rev per State
in case of a tie, sort by increasing date

In case two States could have the same max Rev, if you want to ensure having separate groups, add df['State'] as an intermediate condition:

sort in priority by decreasing max Rev per State
in case of a tie in max Rev sort by State name (you could use another condition, like total Rev per State, etc.)
in case of a tie, sort by increasing date

out = df.iloc[np.lexsort([df['date'],
                          df['State'],
                          -df.groupby('State')['Rev'].transform('max')])]

Output:

         date State    Rev     Score  Value
2  2024-05-01    ny  51200  0.978738     11
3  2024-06-01    ny  48732  2.240893     48
6  2024-05-01    nc  24012  0.950088     34
7  2024-06-01    nc  25005 -0.151357     27
0  2024-05-01    fl  21000  1.764052     34
1  2024-06-01    fl  18200  0.400157     22
4  2024-05-01    mi   5676  1.867558     49
5  2024-06-01    mi   6798 -0.977278     33

Related Solutions

Python Pandas – How to Iterate Over Rows in a DataFrame

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index()  # make sure indexes pair with number of rows

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

10 100
11 110
12 120

Obligatory disclaimer from the documentation

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:

Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …

When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.

If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.

Other answers in this thread delve into greater depth on alternatives to iter* functions if you are interested to learn more.

Python Pandas DataFrame – How to Select Rows Based on Column Values

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

The isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df = df.loc[~df['column_name'].isin(some_values)] # .loc is not in-place replacement

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

Best Answer

Related Solutions

Python Pandas – How to Iterate Over Rows in a DataFrame

Python Pandas DataFrame – How to Select Rows Based on Column Values

Related Question