给定一个包含多个垃圾链接的列表,如何以这种方式提取所有以 .pdf 结尾的链接?
Given a list with several junk links, how to extract all the links that finish in .pdf in this way?
我有一个 pandas 数据框列,每个单元格上有几个 link:
Name|COL
San Diego|'https://foo.com/energy_docs/tyv/2004/019787_S30_gasTOC.cfm https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/293-_9302SDFS 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/98/019787-S16_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S15_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S14_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf https://foo.com/energy_docs/tyv/96/019787-S12_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S11_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S10_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S9_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S8_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/19-787s007_Amlodipine.cfm https://foo.com/energy_docs/tyv/pre96/019787-S6_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S5_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S4_gas GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S3_gas_toc.cfm https://foo.com/energy_docs/tyv/pre96/019787-S2_gas GAS_TPC.cfm'
Washington|'https://foo.com/energy_docs/a32/2007/022136.cfm'
Texas|'https://foo.com/energy/29380/no_ant/USA/2/2007.pdf'
如何提取所有以 .pdf
结尾的 link,如下所示:
Name|COL
San Diego|https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
San Diego|https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
San Diego|https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
Washington|NaN
Texas|https://foo.com/energy/29380/no_ant/USA/2/2007.pdf
我试过:
import re
def url_extractor(row):
url=str(row)
r = re.compile('(http[^\s]+\.pdf)')
urls = r.findall(url)
if len(urls) == 0:
return 'NaN'
else:
return ' '.join(urls)
在:
df4['COL'] = df4['COL'].apply(url_extractor)
df4
输出:
Name COL
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_g...
1 Washington NaN
2 Texas https://foo.com/energy/29380/no_ant/USA/2/2007...
但是我不明白如何做 stacking/splitting 行部分才能在每一行上得到一个 link/url。例如,让我们检查第一行:
在:
df4['COL'][0]
输出:
'https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf'
每个 link 都应 "mapped" 其名称为 San Diego
。
您应该 [^\s]
或更短的 \S
而不是 [^<]
。然后在那之后添加 \.pdf
.
(http\S+\.pdf)
编辑:
是的,如果你愿意,你也可以使用单词边界。
(\bhttp.*?\.pdf\b)
如果这已经加载到 pandas 数据帧中,您可以使用 pandas 内置字符串方法将 COL
中的字符串分解为列表,提取您需要的元素想要从列表中,将列表的col重组为一个长系列,然后将其与原始数据框合并
# break COL into lists of strings that only end if '.pdf'
COL_series = df.COL.str.split().apply(lambda x: [y for y in x if y.endswith('pdf')])
# create a long format series from the lists
COL_series = COL_series.apply(pd.Series).stack().reset_index(level=1, drop=True)
COL_series.name = 'COL'
# merge with df
pd.merge(df.Name.reset_index(),
COL_series.reset_index(),
how='outer',
on='index').drop('index', axis=1)
# returns:
Name COL
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
1 San Diego https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
2 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
3 San Diego https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
4 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
5 San Diego https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
6 Washington NaN
7 Texas https://foo.com/energy/29380/no_ant/USA/2/2007.pdf
我有一个 pandas 数据框列,每个单元格上有几个 link:
Name|COL
San Diego|'https://foo.com/energy_docs/tyv/2004/019787_S30_gasTOC.cfm https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/293-_9302SDFS 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/98/019787-S16_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S15_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S14_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf https://foo.com/energy_docs/tyv/96/019787-S12_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S11_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S10_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S9_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S8_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/19-787s007_Amlodipine.cfm https://foo.com/energy_docs/tyv/pre96/019787-S6_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S5_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S4_gas GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S3_gas_toc.cfm https://foo.com/energy_docs/tyv/pre96/019787-S2_gas GAS_TPC.cfm'
Washington|'https://foo.com/energy_docs/a32/2007/022136.cfm'
Texas|'https://foo.com/energy/29380/no_ant/USA/2/2007.pdf'
如何提取所有以 .pdf
结尾的 link,如下所示:
Name|COL
San Diego|https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
San Diego|https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
San Diego|https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
Washington|NaN
Texas|https://foo.com/energy/29380/no_ant/USA/2/2007.pdf
我试过:
import re
def url_extractor(row):
url=str(row)
r = re.compile('(http[^\s]+\.pdf)')
urls = r.findall(url)
if len(urls) == 0:
return 'NaN'
else:
return ' '.join(urls)
在:
df4['COL'] = df4['COL'].apply(url_extractor)
df4
输出:
Name COL
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_g...
1 Washington NaN
2 Texas https://foo.com/energy/29380/no_ant/USA/2/2007...
但是我不明白如何做 stacking/splitting 行部分才能在每一行上得到一个 link/url。例如,让我们检查第一行:
在:
df4['COL'][0]
输出:
'https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf'
每个 link 都应 "mapped" 其名称为 San Diego
。
您应该 [^\s]
或更短的 \S
而不是 [^<]
。然后在那之后添加 \.pdf
.
(http\S+\.pdf)
编辑:
是的,如果你愿意,你也可以使用单词边界。
(\bhttp.*?\.pdf\b)
如果这已经加载到 pandas 数据帧中,您可以使用 pandas 内置字符串方法将 COL
中的字符串分解为列表,提取您需要的元素想要从列表中,将列表的col重组为一个长系列,然后将其与原始数据框合并
# break COL into lists of strings that only end if '.pdf'
COL_series = df.COL.str.split().apply(lambda x: [y for y in x if y.endswith('pdf')])
# create a long format series from the lists
COL_series = COL_series.apply(pd.Series).stack().reset_index(level=1, drop=True)
COL_series.name = 'COL'
# merge with df
pd.merge(df.Name.reset_index(),
COL_series.reset_index(),
how='outer',
on='index').drop('index', axis=1)
# returns:
Name COL
0 San Diego https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
1 San Diego https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
2 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
3 San Diego https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
4 San Diego https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
5 San Diego https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
6 Washington NaN
7 Texas https://foo.com/energy/29380/no_ant/USA/2/2007.pdf