Python 用于根据数组值清理 .csv 文件的脚本
Python script to clean .csv file based on array values
我是Python的新手,还请见谅。我已经通过我在网上找到的东西将其拼凑在一起,但是,它仍然无法正常工作。
我想要一个 python 脚本,该脚本将在给定的电子表格 (list.csv) 中查找,针对任何 "key_words" 对其进行解析,然后导出仅包含行的文件请勿包含任何名为 "cleaned.csv" 的 "key_words"。我希望它只查看第一列 [0]。如果可能的话,我希望它也能为我导出第二个包含关键字的电子表格,只是为了验证它正在抓取什么。
当前代码查看了整个 csv 文件,我发现它没有将某些行放入 "cleaned.csv",从技术上讲,它应该是,除非我的数组有问题。
这是我当前的代码...
key_words = [ 'Dog', 'Cat', 'Bird', 'Cow', ]
with open('list.csv') as oldfile, open('cleaned.csv', 'w') as newfile:
for line in oldfile:
if not any(key_word in line for key_word in key_words):
newfile.write(line)
前几行数据是...
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
Cleaned.csv 应该显示:
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Other.csv(错误,匹配数组)应该显示:
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
嗯,代码看起来不错,对我有用,所以它本身没有问题。
如果您只想查看第一行,则必须用“,”分隔该行:
key_words = ['Dog', 'Cat', 'Bird', 'Cow', ]
with open('list.csv') as oldfile, open('cleaned.csv', 'w') as cleaned, open("matched.csv", "w") as matched:
for line in oldfile:
if not any(key_word in line.split(",", 1)[0] for key_word in key_words):
cleaned.write(line)
else:
matched.write(line)
如果第一列始终是 "word" 而不是 "sentence"(如 Dog is out
),那么您可以像这样改进测试:
if not line.split(",", 1)[0] in key_words:
注意:使用字符串测试时要注意区分大小写。
请注意,在此处提供 maxsplit=1
line.split(",", 1)
将提高字符串解析性能,尤其是当您的行较长时,因为它将在找到第一个 ,
和 return 后停止解析2 项的列表。第一项将是您的第一列。在这里阅读更多内容:
https://docs.python.org/2/library/stdtypes.html#str.split
测试结果:
mac: cat list.csv
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
mac: cat cleaned.csv
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
mac: cat matched.csv
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
这是一个纯粹的pandas
方法:
In [51]:
key_words = [ 'Dog', 'Cat', 'Bird', 'Cow']
t="""Dog,Walks,Land,4legs
Fish,Swims,Water,fins
Kangaroo,Hops,Land,2legs
Cow,Walks,Land,4legs
Bird,Flies,Air,2legs"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[51]:
0 1 2 3
0 Dog Walks Land 4legs
1 Fish Swims Water fins
2 Kangaroo Hops Land 2legs
3 Cow Walks Land 4legs
4 Bird Flies Air 2legs
我们可以创建一个正则表达式模式并将其传递给 str.contains
并否定布尔条件以在调用 to_csv
:
之前屏蔽 df
In [55]:
pat = '|'.join(key_words)
df[df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)]
Out[55]:
0 1 2 3
1 Fish Swims Water fins
2 Kangaroo Hops Land 2legs
所以我们使用 apply
和参数 axis=1
来逐行应用我们的 lambda,我们用 any
测试否定的 str.contains
看看是否有任何列不包含我们的关键字:
In [56]:
df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)
Out[56]:
0 False
1 True
2 True
3 False
4 False
dtype: bool
我是Python的新手,还请见谅。我已经通过我在网上找到的东西将其拼凑在一起,但是,它仍然无法正常工作。
我想要一个 python 脚本,该脚本将在给定的电子表格 (list.csv) 中查找,针对任何 "key_words" 对其进行解析,然后导出仅包含行的文件请勿包含任何名为 "cleaned.csv" 的 "key_words"。我希望它只查看第一列 [0]。如果可能的话,我希望它也能为我导出第二个包含关键字的电子表格,只是为了验证它正在抓取什么。
当前代码查看了整个 csv 文件,我发现它没有将某些行放入 "cleaned.csv",从技术上讲,它应该是,除非我的数组有问题。
这是我当前的代码...
key_words = [ 'Dog', 'Cat', 'Bird', 'Cow', ]
with open('list.csv') as oldfile, open('cleaned.csv', 'w') as newfile:
for line in oldfile:
if not any(key_word in line for key_word in key_words):
newfile.write(line)
前几行数据是...
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
Cleaned.csv 应该显示:
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Other.csv(错误,匹配数组)应该显示:
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
嗯,代码看起来不错,对我有用,所以它本身没有问题。
如果您只想查看第一行,则必须用“,”分隔该行:
key_words = ['Dog', 'Cat', 'Bird', 'Cow', ]
with open('list.csv') as oldfile, open('cleaned.csv', 'w') as cleaned, open("matched.csv", "w") as matched:
for line in oldfile:
if not any(key_word in line.split(",", 1)[0] for key_word in key_words):
cleaned.write(line)
else:
matched.write(line)
如果第一列始终是 "word" 而不是 "sentence"(如 Dog is out
),那么您可以像这样改进测试:
if not line.split(",", 1)[0] in key_words:
注意:使用字符串测试时要注意区分大小写。
请注意,在此处提供 maxsplit=1
line.split(",", 1)
将提高字符串解析性能,尤其是当您的行较长时,因为它将在找到第一个 ,
和 return 后停止解析2 项的列表。第一项将是您的第一列。在这里阅读更多内容:
https://docs.python.org/2/library/stdtypes.html#str.split
测试结果:
mac: cat list.csv
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
mac: cat cleaned.csv
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
mac: cat matched.csv
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
这是一个纯粹的pandas
方法:
In [51]:
key_words = [ 'Dog', 'Cat', 'Bird', 'Cow']
t="""Dog,Walks,Land,4legs
Fish,Swims,Water,fins
Kangaroo,Hops,Land,2legs
Cow,Walks,Land,4legs
Bird,Flies,Air,2legs"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[51]:
0 1 2 3
0 Dog Walks Land 4legs
1 Fish Swims Water fins
2 Kangaroo Hops Land 2legs
3 Cow Walks Land 4legs
4 Bird Flies Air 2legs
我们可以创建一个正则表达式模式并将其传递给 str.contains
并否定布尔条件以在调用 to_csv
:
In [55]:
pat = '|'.join(key_words)
df[df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)]
Out[55]:
0 1 2 3
1 Fish Swims Water fins
2 Kangaroo Hops Land 2legs
所以我们使用 apply
和参数 axis=1
来逐行应用我们的 lambda,我们用 any
测试否定的 str.contains
看看是否有任何列不包含我们的关键字:
In [56]:
df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)
Out[56]:
0 False
1 True
2 True
3 False
4 False
dtype: bool