如何根据 Python 中的两个数据帧删除未完全填充条件的行?
How to drop rows that do not full-fill a condition based on two dataframes in Python?
我需要两个数据框。例如,第一个具有从 2006 年 12 月 3 日到 2006 年 12 月 20 日的索引。第二个的日期范围为 2000 到 2020。当日期不在第一个的间隔内时,我想从第二个中删除行。
考虑下面的例子:
第一个是这个:
index value
'2006-12-03 13:06:21.955000' 3
'2006-12-03 13:14:54.100000' 4
'2006-12-04 13:23:25.929000' 5
'2006-12-05 13:31:58.074000' 6
'2006-12-05 13:40:29.903000' 7
'2006-12-05 13:49:02.048000' 8
'2006-12-06 13:57:33.877000' 9
.
.
.
'2006-12-20 14:06:06.022000' 100
'2006-12-20 14:14:37.851000' 110
第二个是这个:
id date name
.
.
.
39 2005-08-22 17:27:00 O
40 2005-09-07 17:40:00 F
41 2006-12-05 10:35:00 X
42 2006-12-13 02:40:00 F
43 2010-08-14 10:05:00 F
44 2011-03-07 20:12:00 M
45 2011-06-07 08:03:00 U
46 2011-08-04 04:12:00 M
47 2011-08-09 08:05:00 P
48 2011-09-22 11:01:00 L
49 2011-11-26 07:10:00 N
50 2012-01-23 03:59:00 M
51 2012-01-27 18:37:00 X
.
.
.
想要的是第二个,编辑如下:
41 2006-12-05 10:35:00 X
42 2006-12-13 02:40:00 F
--> 只保留第一个中也存在的日期。
我尝试了以下命令以根据条件删除行:
second_df = second_df[(second_df.date < date_start_first) | (second_df.date > date_end_first)]
(我的灵感来自这个answer。)
不幸的是,上面的代码行不起作用...
date_start_first
和date_end_first
提取如下:
date_start_first = getStartEndDatesOfDataframe(first_df, "start")
date_end_first = getStartEndDatesOfDataframe(first_df, "end")
使用这个函数:
def getStartEndDatesOfDataframe(dataSeriesName, start_or_end):
if (start_or_end == "start"):
date = dataSeriesName.index[0]
else:
date = dataSeriesName.index[len(dataSeriesName.index)-1]
return date
你能帮我解决这个问题吗?
P.S.: 两个“日期”具有相同的类型,我使用 type() 函数进行验证:
print(type(second_df.date[3]), type(first_df.index[3]))
它给出:
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
将字符串切片与 apply
和 lambda
一起使用:
import pandas as pd
df_1 = pd.DataFrame([
['2006-12-03 13:06:21.955000', 3],
['2006-12-03 13:14:54.100000', 4],
['2006-12-04 13:23:25.929000', 5],
['2006-12-05 13:31:58.074000', 6],
['2006-12-05 13:40:29.903000', 7],
['2006-12-05 13:49:02.048000', 8],
['2006-12-06 13:57:33.877000', 9]
], columns=["Date", "value"]
)
df_2 = pd.DataFrame([
["2005-08-22 17:27:00", "O"],
["2005-09-07 17:40:00", "F"],
["2006-12-05 10:35:00", "X"],
["2006-12-13 02:40:00", "F"],
["2010-08-14 10:05:00", "F"],
["2011-03-07 20:12:00", "M"],
["2011-06-07 08:03:00", "U"],
["2011-08-04 04:12:00", "M"],
["2011-08-09 08:05:00", "P"],
["2011-09-22 11:01:00", "L"],
["2011-11-26 07:10:00", "N"],
["2012-01-23 03:59:00", "M"],
["2012-01-27 18:37:00", "X"]
], columns=["Date", "name"]
)
df_1.set_index(["Date"], inplace=True)
dt = [d[:10] for d in df_1.index.values]
filt = df_2.Date.apply(lambda x: x[:10] in dt)
print(df_2[filt])
生产:
Date name
2 2006-12-05 10:35:00 X
我需要两个数据框。例如,第一个具有从 2006 年 12 月 3 日到 2006 年 12 月 20 日的索引。第二个的日期范围为 2000 到 2020。当日期不在第一个的间隔内时,我想从第二个中删除行。
考虑下面的例子:
第一个是这个:
index value
'2006-12-03 13:06:21.955000' 3
'2006-12-03 13:14:54.100000' 4
'2006-12-04 13:23:25.929000' 5
'2006-12-05 13:31:58.074000' 6
'2006-12-05 13:40:29.903000' 7
'2006-12-05 13:49:02.048000' 8
'2006-12-06 13:57:33.877000' 9
.
.
.
'2006-12-20 14:06:06.022000' 100
'2006-12-20 14:14:37.851000' 110
第二个是这个:
id date name
.
.
.
39 2005-08-22 17:27:00 O
40 2005-09-07 17:40:00 F
41 2006-12-05 10:35:00 X
42 2006-12-13 02:40:00 F
43 2010-08-14 10:05:00 F
44 2011-03-07 20:12:00 M
45 2011-06-07 08:03:00 U
46 2011-08-04 04:12:00 M
47 2011-08-09 08:05:00 P
48 2011-09-22 11:01:00 L
49 2011-11-26 07:10:00 N
50 2012-01-23 03:59:00 M
51 2012-01-27 18:37:00 X
.
.
.
想要的是第二个,编辑如下:
41 2006-12-05 10:35:00 X
42 2006-12-13 02:40:00 F
--> 只保留第一个中也存在的日期。
我尝试了以下命令以根据条件删除行:
second_df = second_df[(second_df.date < date_start_first) | (second_df.date > date_end_first)]
(我的灵感来自这个answer。)
不幸的是,上面的代码行不起作用...
date_start_first
和date_end_first
提取如下:
date_start_first = getStartEndDatesOfDataframe(first_df, "start")
date_end_first = getStartEndDatesOfDataframe(first_df, "end")
使用这个函数:
def getStartEndDatesOfDataframe(dataSeriesName, start_or_end):
if (start_or_end == "start"):
date = dataSeriesName.index[0]
else:
date = dataSeriesName.index[len(dataSeriesName.index)-1]
return date
你能帮我解决这个问题吗?
P.S.: 两个“日期”具有相同的类型,我使用 type() 函数进行验证:
print(type(second_df.date[3]), type(first_df.index[3]))
它给出:
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
将字符串切片与 apply
和 lambda
一起使用:
import pandas as pd
df_1 = pd.DataFrame([
['2006-12-03 13:06:21.955000', 3],
['2006-12-03 13:14:54.100000', 4],
['2006-12-04 13:23:25.929000', 5],
['2006-12-05 13:31:58.074000', 6],
['2006-12-05 13:40:29.903000', 7],
['2006-12-05 13:49:02.048000', 8],
['2006-12-06 13:57:33.877000', 9]
], columns=["Date", "value"]
)
df_2 = pd.DataFrame([
["2005-08-22 17:27:00", "O"],
["2005-09-07 17:40:00", "F"],
["2006-12-05 10:35:00", "X"],
["2006-12-13 02:40:00", "F"],
["2010-08-14 10:05:00", "F"],
["2011-03-07 20:12:00", "M"],
["2011-06-07 08:03:00", "U"],
["2011-08-04 04:12:00", "M"],
["2011-08-09 08:05:00", "P"],
["2011-09-22 11:01:00", "L"],
["2011-11-26 07:10:00", "N"],
["2012-01-23 03:59:00", "M"],
["2012-01-27 18:37:00", "X"]
], columns=["Date", "name"]
)
df_1.set_index(["Date"], inplace=True)
dt = [d[:10] for d in df_1.index.values]
filt = df_2.Date.apply(lambda x: x[:10] in dt)
print(df_2[filt])
生产:
Date name
2 2006-12-05 10:35:00 X