未能在 python 中清理 csv 文件

Question

我尝试在 jupyter notebook 中使用 python 清理存储在 csv 文件中的一些推特数据，所以我尝试了以下代码：

unwanted_characters = [',', '@', '\n','&','_']
           
with open('facebook_Tweet.csv','r') as f:
   with open('cleaned_facebook_Tweet.csv','w') as ff:
        for unwanted in unwanted_characters:
            ff.write(f.read().replace(unwanted,''))

                    
tweety = pd.read_csv("cleaned_facebook_Tweet.csv", error_bad_lines=False)
tweety.head()

当我运行这段代码时，我得到了这个结果：

tweet1:Time:Sun Dec 06 09:59:02 +0000 2020 tweet text:RT @_Aaron_Anthony_: Seen this of Facebook and it hit home.\n\nRemember this Christmas if someone pays \u00a320 for a gift for you and they get\u2026
tweet2:Time:Sun Dec 06 09:59:02 +0000 2020 tweet text:RT @TopAchat: Concours \ud83c\udf81 #PetitPapaTopAchat \ud83c\udf84\n\n\ud83d\udd25 + de 60 000 \u20ac de cadeaux \u00e0 gagner !\n\nCa continue avec le #Lot7 de 4333 \u20ac ! \ud83d\udd25\n\nPour partici\u2026

如您所见，保留了不需要的字符，我的代码只是删除了示例中的第一个不需要的字符“,”，并保留其他示例中的“@”和“\n”。

如何修复我的代码？非常感谢。

Answer 1

你好！

您“不能”读取一个打开的文件两次。至少那不会像您预期的那样起作用。当您调用 f.read() 时，它会 returns 从头到尾显示文件的内容，并将阅读光标留在文件末尾。因此，当您再次调用 f.read() 时，它 returns 什么也没有。

此外，即使它像您想象的那样工作，您也会在每次替换时多次附加整个文件，最终结果将不是预期的。那是因为您多次调用 write() 方法。

我的建议是：使用一个中间变量，像这样：

import pandas as pd
unwanted_characters = [',', '@', '\n','&','_']
           
with open('facebook_Tweet.csv','r') as f:
   output_string = f.read()
   for unwanted in unwanted_characters:
     output_string = output_string.replace(unwanted, '')

   with open('cleaned_facebook_Tweet.csv','w') as ff:
    ff.write(output_string)

最终代码并不重要，但我认为您理解我所解释的概念很重要。我建议您也阅读 this doc and maybe this one。

此外，this question in stack overflow may help you to understand what I said.

未能在 python 中清理 csv 文件

failed to cleaning csv file in python

python

csv

twitter

pandas

jupyter-notebook