问题描述:
I have a data set that is as follows:
ClientID Date Total sales Cumulative Sales
0 A 01-01-2000 100 300
1 A 01-02-2000 100 0
2 A 01-03-2000 100 0
3 B 01-01-2000 50 100
4 B 01-02-2000 50 0
5 C 01-01-2000 70 210
6 C 01-02-2000 70 0
7 C 01-03-2000 70 0
8 D 01-01-2000 20 40
9 D 01-02-2000 20 0
I want to figure out a way to create the Cumulative Sales
column, that will be the sum of Total sales
for the first row of each unique ClientID
, and 0
for the rest.
I figured out the solution in excel by creating a new column R for counting first distinct value and then using the IF(SUMIF)
function to calculate the sum:
=IF(COUNTIFS($F$2:$F8, $F8)=1, SUMIF($F$2:$F$1065, R8, $P$2:$P$1065), "0")
解决方案 1:[1]
Use numpy.where
with set 0
by mask for duplicated values by Series.duplicated
and for first values set sum
by GroupBy.transform
:
df['Cumulative Sales'] = np.where(df['ClientID'].duplicated(),
0,
df.groupby('ClientID')['Total sales'].transform('sum'))
print (df)
ClientID Date Total sales Cumulative Sales
0 A 01-01-2000 100 300
1 A 01-02-2000 100 0
2 A 01-03-2000 100 0
3 B 01-01-2000 50 100
4 B 01-02-2000 50 0
5 C 01-01-2000 70 210
6 C 01-02-2000 70 0
7 C 01-03-2000 70 0
8 D 01-01-2000 20 40
9 D 01-02-2000 20 0
解决方案 2:[2]
Use a groupby.transform
and mask
the non-first values based on the duplicated
IDs:
df['Cumulative Sales'] = (df.groupby('Client ID')['Total sales'].transform('sum')
.mask(df['Client ID'].duplicated(), 0)
)
Output:
Client ID Date Total sales Cumulative Sales
0 A 01-01-2000 100 300
1 A 01-02-2000 100 0
2 A 01-03-2000 100 0
3 B 01-01-2000 50 100
4 B 01-02-2000 50 0
5 C 01-01-2000 70 210
6 C 01-02-2000 70 0
7 C 01-03-2000 70 0
8 D 01-01-2000 20 40
9 D 01-02-2000 20 0
参考链接:
Copyright Notice: This article follows StackOverflow’s copyright notice requirements and is licensed under CC BY-SA 3.0.
Article Source: StackOverflow
[1] jezrael
[2] mozway