Create new column with subtotal for first instance of unique value in column, 0 otherwise

I have a data set that is as follows: ClientID Date Total sales Cumulative Sales 0 A 01-01-2000 100 300 1 A 01-02-2000 100 0 …

问题描述:

I have a data set that is as follows:

  ClientID        Date  Total sales  Cumulative Sales
0        A  01-01-2000          100               300
1        A  01-02-2000          100                 0
2        A  01-03-2000          100                 0
3        B  01-01-2000           50               100
4        B  01-02-2000           50                 0
5        C  01-01-2000           70               210
6        C  01-02-2000           70                 0
7        C  01-03-2000           70                 0
8        D  01-01-2000           20                40
9        D  01-02-2000           20                 0

I want to figure out a way to create the Cumulative Sales column, that will be the sum of Total sales for the first row of each unique ClientID, and 0 for the rest.

I figured out the solution in excel by creating a new column R for counting first distinct value and then using the IF(SUMIF) function to calculate the sum:

=IF(COUNTIFS($F$2:$F8, $F8)=1, SUMIF($F$2:$F$1065, R8, $P$2:$P$1065), "0")

解决方案 1:[1]

Use numpy.where with set 0 by mask for duplicated values by Series.duplicated and for first values set sum by GroupBy.transform:

df['Cumulative Sales'] = np.where(df['ClientID'].duplicated(), 
                                  0, 
                                  df.groupby('ClientID')['Total sales'].transform('sum'))
print (df)
  ClientID        Date  Total sales  Cumulative Sales
0        A  01-01-2000          100               300
1        A  01-02-2000          100                 0
2        A  01-03-2000          100                 0
3        B  01-01-2000           50               100
4        B  01-02-2000           50                 0
5        C  01-01-2000           70               210
6        C  01-02-2000           70                 0
7        C  01-03-2000           70                 0
8        D  01-01-2000           20                40
9        D  01-02-2000           20                 0

解决方案 2:[2]

Use a groupby.transform and mask the non-first values based on the duplicated IDs:

df['Cumulative Sales'] = (df.groupby('Client ID')['Total sales'].transform('sum')
                            .mask(df['Client ID'].duplicated(), 0)
                         )

Output:

  Client ID        Date  Total sales  Cumulative Sales
0         A  01-01-2000          100               300
1         A  01-02-2000          100                 0
2         A  01-03-2000          100                 0
3         B  01-01-2000           50               100
4         B  01-02-2000           50                 0
5         C  01-01-2000           70               210
6         C  01-02-2000           70                 0
7         C  01-03-2000           70                 0
8         D  01-01-2000           20                40
9         D  01-02-2000           20                 0

参考链接:

Copyright Notice: This article follows StackOverflow’s copyright notice requirements and is licensed under CC BY-SA 3.0.

Article Source: StackOverflow

[1] jezrael

[2] mozway

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
乘风的头像乘风管理团队
上一篇 2023年12月22日
下一篇 2023年12月22日

相关推荐