根据字符串匹配过滤 URL 的 CSV 文件
Filtering CSV file of URLS based on String Match
我有一个包含几千行的大型 CSV 文件,没有 header 每行一个单独的 URL。
一些示例行:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-palm-beach-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-sarasota-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-goya-foods-miplato-announcement-tampa-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-first-lady-school-lunch-standards-announcement
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/21/weekly-address-creating-jobs-boosting-tourism
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
我想过滤这些 URL,这样我就可以将结果传输到单独的 CSV 文件中。我尝试了多个 grep 和 awk 选项,但我总是得到太多与我引用的字符串不匹配的结果。
比如我想要
grep "remarks-president" speechurls.csv >> remarks-president_urls.csv
到return所有在URL中只有"remarks-president"的url。示例:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
同样
grep "remarks-first-lady" speechurls.csv >> remarks-first-lady_urls.csv
return URL 中带有 "remarks-first-lady" 的所有演讲都应该 return。
我试过其他规格,没有帮助。
grep -w -l "remarks-president" speechurls.csv >> remarks-president_urls.csv
我也试过以下方法,但运气不佳。
awk -F, ' ~ /remarks-president|president-obama/ {print}' speechurls.csv
fgrep -w "remarks-vice-president" speechurls.csv
我不完全确定如何解决这个问题。任何帮助将非常感激。如果 Python 中有更好的方法来执行此操作,我也愿意接受该解决方案。
像这样的情况可以很有趣地编写一个快速而肮脏的 python 脚本。我相信以下应该有效。
import csv
with open('speechurls.csv', 'r') as f:
for row in csv.reader(f):
if 'remarks-president' in row[0]:
with open('remarks-president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'remarks-first-lady' in row[0]:
with open('remarks-first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
else:
pass
它不漂亮,没有优雅的设计,但它可以工作并且似乎符合您的要求。
我不太明白这个问题。"grep "remarks-first-lady" speechurls.csv" 在这种情况下应该可以正常工作。
您遇到的问题可能来自“>>”,“>>”表示向现有文件追加新行,如果您想要一个仅包含命令输出的文件,则需要使用“ >”而不是“>>”。
如果您还可以指出您的代码出了什么问题,我可能会更好地识别您的问题。
我只是想 post 更新我的问题。感谢@Muttonchop 的帮助,我已经能够解决 CSV 过滤问题。
这个 python 解决方案效果很好。修改@Muttonchop 的初始响应,这是我最终得到的完整代码:
def filterSpeechURL():
import csv
with open('speechurls.csv', 'rU') as f:
for row in csv.reader(f):
#Filter President Obama
if 'remarks-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'weekly-address' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'letter' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'statement-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'president-obama' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'excerpts-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
#Filter First Lady
elif 'remarks-first-lady' in row[0]:
with open('__first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter VP
elif 'vice-president' in row[0]:
with open('__vice_president_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter Jill Biden
elif 'jill' in row[0]:
with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
elif 'dr-biden' in row[0]:
with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter Everthing Else
else:
with open('__other_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
filterSpeechURL()
我有一个包含几千行的大型 CSV 文件,没有 header 每行一个单独的 URL。
一些示例行:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-palm-beach-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-sarasota-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-goya-foods-miplato-announcement-tampa-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-first-lady-school-lunch-standards-announcement
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/21/weekly-address-creating-jobs-boosting-tourism
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
我想过滤这些 URL,这样我就可以将结果传输到单独的 CSV 文件中。我尝试了多个 grep 和 awk 选项,但我总是得到太多与我引用的字符串不匹配的结果。
比如我想要
grep "remarks-president" speechurls.csv >> remarks-president_urls.csv
到return所有在URL中只有"remarks-president"的url。示例:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
同样
grep "remarks-first-lady" speechurls.csv >> remarks-first-lady_urls.csv
return URL 中带有 "remarks-first-lady" 的所有演讲都应该 return。
我试过其他规格,没有帮助。
grep -w -l "remarks-president" speechurls.csv >> remarks-president_urls.csv
我也试过以下方法,但运气不佳。
awk -F, ' ~ /remarks-president|president-obama/ {print}' speechurls.csv
fgrep -w "remarks-vice-president" speechurls.csv
我不完全确定如何解决这个问题。任何帮助将非常感激。如果 Python 中有更好的方法来执行此操作,我也愿意接受该解决方案。
像这样的情况可以很有趣地编写一个快速而肮脏的 python 脚本。我相信以下应该有效。
import csv
with open('speechurls.csv', 'r') as f:
for row in csv.reader(f):
if 'remarks-president' in row[0]:
with open('remarks-president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'remarks-first-lady' in row[0]:
with open('remarks-first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
else:
pass
它不漂亮,没有优雅的设计,但它可以工作并且似乎符合您的要求。
我不太明白这个问题。"grep "remarks-first-lady" speechurls.csv" 在这种情况下应该可以正常工作。
您遇到的问题可能来自“>>”,“>>”表示向现有文件追加新行,如果您想要一个仅包含命令输出的文件,则需要使用“ >”而不是“>>”。
如果您还可以指出您的代码出了什么问题,我可能会更好地识别您的问题。
我只是想 post 更新我的问题。感谢@Muttonchop 的帮助,我已经能够解决 CSV 过滤问题。
这个 python 解决方案效果很好。修改@Muttonchop 的初始响应,这是我最终得到的完整代码:
def filterSpeechURL():
import csv
with open('speechurls.csv', 'rU') as f:
for row in csv.reader(f):
#Filter President Obama
if 'remarks-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'weekly-address' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'letter' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'statement-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'president-obama' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
elif 'excerpts-president' in row[0]:
with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
#Filter First Lady
elif 'remarks-first-lady' in row[0]:
with open('__first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter VP
elif 'vice-president' in row[0]:
with open('__vice_president_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter Jill Biden
elif 'jill' in row[0]:
with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
elif 'dr-biden' in row[0]:
with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
#Filter Everthing Else
else:
with open('__other_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
filterSpeechURL()