如何编辑 python 中的 .csv 以进行 NLP
How to edit .csv in python to proceed NLP
你好,我对编程不是很熟悉,在研究我的任务时发现了 Whosebug。我想对这样的 .csv 文件进行自然语言处理,该文件大约有 15.000 行
ID | Title | Body
----------------------------------------
1 | Who is Jack? | Jack is a teacher...
2 | Who is Sam? | Sam is a dog....
3 | Who is Sarah?| Sarah is a doctor...
4 | Who is Amy? | Amy is a wrestler...
我想读取 .csv 文件并执行一些基本的 NLP 操作,然后将结果写回到新文件或同一文件中。经过一些研究 python 和 nltk seams 成为我需要的技术。 (我希望那是对的)。标记化后,我希望我的 .csv 文件看起来像这样
ID | Title | Body
-----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Sam" "is" "a" "dog"....
3 | "Who" "is" "Sarah" "?"| "Sarah" "is" "a" "doctor"...
4 | "Who" "is" "Amy" "?" | "Amy" "is" "a" "wrestler"...
经过一天的研究和拼凑,我取得的成果如下所示
ID | Title | Body
----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Jack" "is" "a" "teacher"...
3 | "Who" "is" "Sarah" "?"| "Jack" "is" "a" "teacher"...
4 | "Who" "is" "Amy" "?" | "Jack" "is" "a" "teacher"...
我的第一个想法是读取 .csv 中的特定单元格,进行操作并将其写回同一个单元格。而不是以某种方式在所有行上自动执行此操作。显然我设法读取了一个单元格并将其标记化。但我无法设法将它写回那个特定的单元格。而我离"do that automatically to all rows"很远。如果可能的话,我将不胜感激。
我的代码:
import csv
from nltk.tokenize import word_tokenize
############Read CSV File######################
########## ID , Title, Body####################
line_number = 1 #line to read (need some kind of loop here)
column_number = 2 # column to read (need some kind of loop here)
with open('test10in.csv', 'rb') as f:
reader = csv.reader(f)
reader = list(reader)
text = reader[line_number][column_number]
stringtext = ''.join(text) #tokenizing just work on strings
tokenizedtext = (word_tokenize(stringtext))
print(tokenizedtext)
#############Write back in same cell in new CSV File######
with open('test11out.csv', 'wb') as g:
writer = csv.writer(g)
for row in reader:
row[2] = tokenizedtext
writer.writerow(row)
我希望我问的问题是正确的,有人可以帮助我。
pandas 库将使这一切变得容易得多。
pd.read_csv() 将更容易处理输入,您可以使用 pd.DataFrame.apply()
将相同的函数应用于列
这里有一个简单示例,说明您希望的关键部分如何工作。在 .applymap() 方法中,您可以将我的 lambda 函数替换为 word_tokenize() 以将其应用于所有元素。
In [58]: import pandas as pd
In [59]: pd.read_csv("test.csv")
Out[59]:
0 1
0 wrestler Amy dog is teacher dog dog is
1 is wrestler ? ? Sarah doctor teacher Jack
2 a ? Sam Sarah is dog Sam Sarah
3 Amy a a doctor Amy a Amy Jack
In [60]: df = pd.read_csv("test.csv")
In [61]: df.applymap(lambda x: x.split())
Out[61]:
0 1
0 [wrestler, Amy, dog, is] [teacher, dog, dog, is]
1 [is, wrestler, ?, ?] [Sarah, doctor, teacher, Jack]
2 [a, ?, Sam, Sarah] [is, dog, Sam, Sarah]
3 [Amy, a, a, doctor] [Amy, a, Amy, Jack]
另见:http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application
您首先需要解析您的文件,然后分别处理(标记化等)每个字段。
如果我们的文件真的很像您的示例,我不会将其称为 CSV。您 可以 使用 csv
模块解析它,该模块专门用于读取各种 CSV 文件:将 delimiter="|"
添加到 csv.reader()
的参数中,将行分隔为单元格。 (并且不要以二进制模式打开文件。)但是你的文件很容易直接解析:
with open('test10in.csv', encoding="utf-8") as fp: # Or whatever encoding is right
content = fp.read()
lines = content.splitlines()
allrows = [ [ fld.strip() for fld in line.split("|") ] for line in lines ]
# Headers and data:
headers = allrows[0]
rows = allrows[2:]
然后您可以使用 nltk.word_tokenize()
标记 rows
的每个字段,然后从那里继续。
你好,我对编程不是很熟悉,在研究我的任务时发现了 Whosebug。我想对这样的 .csv 文件进行自然语言处理,该文件大约有 15.000 行
ID | Title | Body
----------------------------------------
1 | Who is Jack? | Jack is a teacher...
2 | Who is Sam? | Sam is a dog....
3 | Who is Sarah?| Sarah is a doctor...
4 | Who is Amy? | Amy is a wrestler...
我想读取 .csv 文件并执行一些基本的 NLP 操作,然后将结果写回到新文件或同一文件中。经过一些研究 python 和 nltk seams 成为我需要的技术。 (我希望那是对的)。标记化后,我希望我的 .csv 文件看起来像这样
ID | Title | Body
-----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Sam" "is" "a" "dog"....
3 | "Who" "is" "Sarah" "?"| "Sarah" "is" "a" "doctor"...
4 | "Who" "is" "Amy" "?" | "Amy" "is" "a" "wrestler"...
经过一天的研究和拼凑,我取得的成果如下所示
ID | Title | Body
----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Jack" "is" "a" "teacher"...
3 | "Who" "is" "Sarah" "?"| "Jack" "is" "a" "teacher"...
4 | "Who" "is" "Amy" "?" | "Jack" "is" "a" "teacher"...
我的第一个想法是读取 .csv 中的特定单元格,进行操作并将其写回同一个单元格。而不是以某种方式在所有行上自动执行此操作。显然我设法读取了一个单元格并将其标记化。但我无法设法将它写回那个特定的单元格。而我离"do that automatically to all rows"很远。如果可能的话,我将不胜感激。
我的代码:
import csv
from nltk.tokenize import word_tokenize
############Read CSV File######################
########## ID , Title, Body####################
line_number = 1 #line to read (need some kind of loop here)
column_number = 2 # column to read (need some kind of loop here)
with open('test10in.csv', 'rb') as f:
reader = csv.reader(f)
reader = list(reader)
text = reader[line_number][column_number]
stringtext = ''.join(text) #tokenizing just work on strings
tokenizedtext = (word_tokenize(stringtext))
print(tokenizedtext)
#############Write back in same cell in new CSV File######
with open('test11out.csv', 'wb') as g:
writer = csv.writer(g)
for row in reader:
row[2] = tokenizedtext
writer.writerow(row)
我希望我问的问题是正确的,有人可以帮助我。
pandas 库将使这一切变得容易得多。
pd.read_csv() 将更容易处理输入,您可以使用 pd.DataFrame.apply()
将相同的函数应用于列这里有一个简单示例,说明您希望的关键部分如何工作。在 .applymap() 方法中,您可以将我的 lambda 函数替换为 word_tokenize() 以将其应用于所有元素。
In [58]: import pandas as pd
In [59]: pd.read_csv("test.csv")
Out[59]:
0 1
0 wrestler Amy dog is teacher dog dog is
1 is wrestler ? ? Sarah doctor teacher Jack
2 a ? Sam Sarah is dog Sam Sarah
3 Amy a a doctor Amy a Amy Jack
In [60]: df = pd.read_csv("test.csv")
In [61]: df.applymap(lambda x: x.split())
Out[61]:
0 1
0 [wrestler, Amy, dog, is] [teacher, dog, dog, is]
1 [is, wrestler, ?, ?] [Sarah, doctor, teacher, Jack]
2 [a, ?, Sam, Sarah] [is, dog, Sam, Sarah]
3 [Amy, a, a, doctor] [Amy, a, Amy, Jack]
另见:http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application
您首先需要解析您的文件,然后分别处理(标记化等)每个字段。
如果我们的文件真的很像您的示例,我不会将其称为 CSV。您 可以 使用 csv
模块解析它,该模块专门用于读取各种 CSV 文件:将 delimiter="|"
添加到 csv.reader()
的参数中,将行分隔为单元格。 (并且不要以二进制模式打开文件。)但是你的文件很容易直接解析:
with open('test10in.csv', encoding="utf-8") as fp: # Or whatever encoding is right
content = fp.read()
lines = content.splitlines()
allrows = [ [ fld.strip() for fld in line.split("|") ] for line in lines ]
# Headers and data:
headers = allrows[0]
rows = allrows[2:]
然后您可以使用 nltk.word_tokenize()
标记 rows
的每个字段,然后从那里继续。