如何将函数应用于 Pandas 中的多个多索引列?
python 249
原文标题 :How to Apply a function to multiple multiindex columns in Pandas?
给定一个多索引列
a ...
E1 ... E3
g1 g2 g3 ... g1 g2 g3
0 0.548814 0.715189 0.602763 ... 0.437587 0.891773 0.963663
1 0.383442 0.791725 0.528895 ... 0.087129 0.020218 0.832620
2 0.778157 0.870012 0.978618 ... 0.118274 0.639921 0.143353
3 0.944669 0.521848 0.414662 ... 0.568434 0.018790 0.617635
4 0.612096 0.616934 0.943748 ... 0.697631 0.060225 0.666767
5 0.670638 0.210383 0.128926 ... 0.438602 0.988374 0.102045
6 0.208877 0.161310 0.653108 ... 0.158970 0.110375 0.656330
7 0.138183 0.196582 0.368725 ... 0.096098 0.976459 0.468651
8 0.976761 0.604846 0.739264 ... 0.296140 0.118728 0.317983
9 0.414263 0.064147 0.692472 ... 0.093941 0.575946 0.929296
[10 rows x 9 columns]
我想将由第二级过滤的多列(即E1
,E2
,E3
)应用于函数(例如,ration_type1
,ration_type2
,或者在实际实现中可以更多)。
例如。假设我们要计算函数ration_type1
和ration_type2
下E1
的第二级。那么我们只处理以下df
a
E1
g1 g2 g3
0 0.548814 0.715189 0.602763
1 0.383442 0.791725 0.528895
.................
8 0.976761 0.604846 0.739264
9 0.414263 0.064147 0.692472
为了概括所有第二级,我依赖于下面的列表理解
对于ration_type1
和ration_type2
中的每一个。
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
在将其连接回原始df
之前。
但是,我想知道是否有比list comprehension
方法更优雅和紧凑的方式。这是因为,在现实生活中,可以有更多的配给功能。
完整代码如下
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.rand(10,9)
tuples = [('a', 'E1', 'g1'), ('a', 'E1', 'g2'), ('a', 'E1', 'g3'), ('a', 'E2', 'g1'), ('a', 'E2', 'g2'),
('a', 'E2', 'g3'), ('a', 'E3', 'g1'), ('a', 'E3', 'g2'), ('a', 'E3', 'g3')]
df = pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples(tuples))
print(df)
def ration_type1(df):
"""
(g3+g2)/g1
# Ugly way since have to convert to numpy 1st
"""
print(df)
dration = 'ration_type1'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
def ration_type2(df):
"""
(g3+g2+g1)/g1
# Ugly way since have to convert to numpy 1st
"""
dration = 'ration_type2'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g1')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
level1_name = list(set(df.columns.get_level_values(1)))
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
df1 = pd.concat(all_df1, axis=1)
df2 = pd.concat(all_df2, axis=1)
df=pd.concat([df,df1,df2],axis=1)
预期输出。
a ...
E1 ... E2 E3
g1 g2 g3 ... ration_type2 ration_type2 ration_type2
0 0.548814 0.715189 0.602763 ... 3.401458 2.962896 5.240151
1 0.383442 0.791725 0.528895 ... 4.444124 2.754497 10.788191
2 0.778157 0.870012 0.978618 ... 3.375653 2.554145 7.622516
3 0.944669 0.521848 0.414662 ... 1.991363 5.650758 2.119612
4 0.612096 0.616934 0.943748 ... 3.549735 2.168255 2.042087
5 0.670638 0.210383 0.128926 ... 1.505949 3.960760 3.486126
6 0.208877 0.161310 0.653108 ... 4.899035 3.806001 5.822965
7 0.138183 0.196582 0.368725 ... 5.091008 2.138921 16.037821
8 0.976761 0.604846 0.739264 ... 2.376088 11.283905 2.474676
9 0.414263 0.064147 0.692472 ... 2.826423 2.391873 17.023361
[10 rows x 15 columns]
我正在考虑做一些像使用apply
# function for prepending 'Geek'
def multiply_by_2(number):
return 2 * number
# executing the function
df[["Integers", "Float"]] = df[["Integers", "Float"]].apply(multiply_by_2)
但是,自从我的示例涉及多索引列以来,我遇到了困难(由于我的知识有限)
回复
我来回复-
jezrael 评论
该回答已被采纳!
如果使用
MultiIndex
-解决方案过滤器级别,将g
值重命名为ration_type1, ration_type2
用于可能的除法MultiIndex
DataFrames,那就不容易了:idx = pd.IndexSlice c = {'g1':'ration_type1','g2':'ration_type1','g3':'ration_type1'} df1 = df.loc[:, idx[:,:,['g3','g2']]].rename(columns=c).groupby(level=[0,1,2], axis=1).sum() df11 = df1.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c)) c1 = {'g1':'ration_type2','g2':'ration_type2','g3':'ration_type2'} df2 = df.rename(columns=c1).groupby(level=[0,1,2], axis=1).sum() df22 = df2.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c1)) df=pd.concat([df,df11,df22],axis=1)
更简单的是先重塑:
df1 = df.stack([0,1]) df1['ration_type1'] = df1[['g2','g3']].sum(axis=1).div(df1['g1']) df1['ration_type2'] = df1.sum(axis=1).div(df1['g1'])
print(df1) g1 g2 g3 ration_type1 ration_type2 0 a E1 0.548814 0.715189 0.602763 2.401458 7.777186 E2 0.544883 0.423655 0.645894 1.962896 6.565312 E3 0.437587 0.891773 0.963663 4.240151 14.929992 1 a E1 0.383442 0.791725 0.528895 3.444124 13.426259 E2 0.568045 0.925597 0.071036 1.754497 5.843159 E3 0.087129 0.020218 0.832620 9.788191 123.129174 2 a E1 0.778157 0.870012 0.978618 2.375653 6.428577 E2 0.799159 0.461479 0.780529 1.554145 4.498872 E3 0.118274 0.639921 0.143353 6.622516 63.615316 3 a E1 0.944669 0.521848 0.414662 0.991363 3.040793 E2 0.264556 0.774234 0.456150 4.650758 23.230266 E3 0.568434 0.018790 0.617635 1.119612 4.089254 4 a E1 0.612096 0.616934 0.943748 2.549735 7.715318 E2 0.681820 0.359508 0.437032 1.168255 3.881690 E3 0.697631 0.060225 0.666767 1.042087 3.535837 5 a E1 0.670638 0.210383 0.128926 0.505949 2.260380 E2 0.315428 0.363711 0.570197 2.960760 13.347233 E3 0.438602 0.988374 0.102045 2.486126 9.154429 6 a E1 0.208877 0.161310 0.653108 3.899035 23.565714 E2 0.253292 0.466311 0.244426 2.806001 14.884143 E3 0.158970 0.110375 0.656330 4.822965 36.161882 7 a E1 0.138183 0.196582 0.368725 4.091008 34.696743 E2 0.820993 0.097101 0.837945 1.138921 3.526168 E3 0.096098 0.976459 0.468651 15.037821 172.521382 8 a E1 0.976761 0.604846 0.739264 1.376088 3.784915 E2 0.039188 0.282807 0.120197 10.283905 273.710140 E3 0.296140 0.118728 0.317983 1.474676 7.454332 9 a E1 0.414263 0.064147 0.692472 1.826423 7.235273 E2 0.566601 0.265389 0.523248 1.391873 4.848404 E3 0.093941 0.575946 0.929296 16.023361 187.592593
最后重塑为原始
MultiIndex
:df = df1.unstack([1,2]).reorder_levels([1,2,0], axis=1)
2年前