有什么方法可以在没有迭代的情况下用熊猫进行标记吗?
python 434
原文标题 :Is there any method for labeling without iteration with pandas?
我有两个基于时间的数据。一个是加速度计的测量数据,另一个是标签数据。例如,
accelerometer.csv
timestamp,X,Y,Z
1.0,0.5,0.2,0.0
1.1,0.2,0.3,0.0
1.2,-0.1,0.5,0.0
...
2.0,0.9,0.8,0.5
2.1,0.4,0.1,0.0
2.2,0.3,0.2,0.3
...
label.csv
start,end,label
1.0,2.0,"running"
2.0,3.0,"exercising"
也许这些数据是不现实的,因为这些只是例子。
在这种情况下,我想将这些数据合并到下面:merged.csv
timestamp,X,Y,Z,label
1.0,0.5,0.2,0.0,"running"
1.1,0.2,0.3,0.0,"running"
1.2,-0.1,0.5,0.0,"running"
...
2.0,0.9,0.8,0.5,"exercising"
2.1,0.4,0.1,0.0,"exercising"
2.2,0.3,0.2,0.3,"exercising"
...
我正在使用熊猫的“iterrows”。但是,实际数据的行数大于 10,000。因此,程序的运行时间很长。我认为,这项工作至少有一种方法无需迭代。
我的代码如下:
import pandas as pd
acc = pd.read_csv("./accelerometer.csv")
labeled = pd.read_csv("./label.csv")
for index, row in labeled.iterrows():
start = row["start"]
end = row["end"]
acc.loc[(start <= acc["timestamp"]) & (acc["timestamp"] < end), "label"] = row["label"]
如何修改我的代码以摆脱“for”迭代?
回复
我来回复-
Nick 评论
如果
accelerometer
中的时间不超出label
中的时间范围,则可以使用merge_asof
:accmerged = pd.merge_asof(acc, labeled, left_on='timestamp', right_on='start', direction='backward')
输出(对于您问题中的示例数据):
timestamp X Y Z start end label 0 1.0 0.5 0.2 0.0 1.0 2.0 running 1 1.1 0.2 0.3 0.0 1.0 2.0 running 2 1.2 -0.1 0.5 0.0 1.0 2.0 running 3 2.0 0.9 0.8 0.5 2.0 3.0 exercising 4 2.1 0.4 0.1 0.0 2.0 3.0 exercising 5 2.2 0.3 0.2 0.3 2.0 3.0 exercising
请注意,您可以删除
start
和end
列,如果您想:accmerged = accmerged.drop(['start', 'end'], axis=1)
输出:
timestamp X Y Z label 0 1.0 0.5 0.2 0.0 running 1 1.1 0.2 0.3 0.0 running 2 1.2 -0.1 0.5 0.0 running 3 2.0 0.9 0.8 0.5 exercising 4 2.1 0.4 0.1 0.0 exercising 5 2.2 0.3 0.2 0.3 exercising
2年前