从我的文本对象的开头删除单词?
Removing word from the beginning of my text object?
我有一个功能可以从 millercenter.org 和 returns 处理过的语音中抓取语音。然而,我的每一个演讲的开头都有 "transcript" 这个词(这就是它被编码到 HTML 中的方式)。所以,我所有的文本文件都是这样的:
\n <--- there's really just a new line, here, not literally '\n'
transcript
fourscore and seven years ago, blah blah blah
我将这些文件保存在我的 U:/ 驱动器中 - 如何遍历这些文件并删除 'transcript'?这些文件基本上是这样的:
编辑:
speech_dict = {}
for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"):
with open(filename, 'r') as inputFile:
filecontent = inputFile.read();
filecontent.replace('transcript','',1)
speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm
这并没有改变我的发言。 'transcript' 还在。
我也试过将它放入我的文本处理函数中,但这也不起作用:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub(' ',item_str)
item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1)
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # giving back filename and the text itself
这是一个示例 url 我 运行 到 processURL
:http://millercenter.org/president/harding/speeches/speech-3805
你可以使用Python的优秀replace()
:
data = data.replace('transcript', '', 1)
此行会将 'transcript'
替换为 ''
(空字符串)。最后一个参数是要进行的替换次数。 1 仅表示 'transcript'
的第一个实例,所有实例均为空白。
如果您知道您想要的数据总是从第 x 行开始,那么请执行以下操作:
with open('filename.txt', 'r') as fin:
for _ in range(x): # This loop will skip x no. of lines.
next(fin)
for line in fin:
# do something with the line.
print(line)
或者假设您要删除转录前的任何行:
with open('filename.txt', 'r') as fin:
while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines.
break
# if you want to skip the empty line after *transcript*
next(fin) # skips the next line.
for line in fin:
# do something with the line.
print(line)
我有一个功能可以从 millercenter.org 和 returns 处理过的语音中抓取语音。然而,我的每一个演讲的开头都有 "transcript" 这个词(这就是它被编码到 HTML 中的方式)。所以,我所有的文本文件都是这样的:
\n <--- there's really just a new line, here, not literally '\n'
transcript
fourscore and seven years ago, blah blah blah
我将这些文件保存在我的 U:/ 驱动器中 - 如何遍历这些文件并删除 'transcript'?这些文件基本上是这样的:
编辑:
speech_dict = {}
for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"):
with open(filename, 'r') as inputFile:
filecontent = inputFile.read();
filecontent.replace('transcript','',1)
speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm
这并没有改变我的发言。 'transcript' 还在。
我也试过将它放入我的文本处理函数中,但这也不起作用:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub(' ',item_str)
item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1)
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # giving back filename and the text itself
这是一个示例 url 我 运行 到 processURL
:http://millercenter.org/president/harding/speeches/speech-3805
你可以使用Python的优秀replace()
:
data = data.replace('transcript', '', 1)
此行会将 'transcript'
替换为 ''
(空字符串)。最后一个参数是要进行的替换次数。 1 仅表示 'transcript'
的第一个实例,所有实例均为空白。
如果您知道您想要的数据总是从第 x 行开始,那么请执行以下操作:
with open('filename.txt', 'r') as fin:
for _ in range(x): # This loop will skip x no. of lines.
next(fin)
for line in fin:
# do something with the line.
print(line)
或者假设您要删除转录前的任何行:
with open('filename.txt', 'r') as fin:
while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines.
break
# if you want to skip the empty line after *transcript*
next(fin) # skips the next line.
for line in fin:
# do something with the line.
print(line)