删除成绩单时间戳并加入行以制作段落
Remove transcript timestamps and join the lines to make paragraph
- 文件:纯文本文档
- 内容:带有时间戳的 Youtube 文字记录
我可以分别去掉每一行的时间戳:
for count, line in enumerate(content, start=1):
if count % 2 == 0:
s = line.replace('\n','')
print(s)
不去掉时间戳我也可以加入句子:
with open('file.txt') as f:
print (" ".join(line.strip() for line in f))
但我尝试以各种格式一起执行这些操作(删除时间戳并加入行)但没有正确的结果:
with open('Russell Brand Script.txt') as m:
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n',' ')
print(" ".join(sentence.rstrip('\n')))
我也尝试了各种形式的 print(" ".join(sentence.rstrip('\n')))
和 print(" ".join(sentence.strip()))
但结果总是以下之一:
如何删除时间戳并加入句子以立即创建一个段落?
每当您在字符串上调用 .join()
时,它都会在字符串的 每个字符 之间插入分隔符。您还应该注意 print()
,默认情况下,在打印字符串后添加一个换行符。
为了解决这个问题,您可以将每个修改后的句子保存到一个列表中,然后在最后使用 "".join()
一次输出整个段落。这解决了上述换行问题,并使您能够在需要时对段落进行后续处理。
with open('put_your_filename_here.txt') as m:
sentences = []
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n', '')
sentences.append(sentence)
print(' '.join(sentences))
(对代码进行了小幅修改——旧版本的代码在该段之后产生了尾随 space。)
TL;DR: 使用 list-comprehension 和 if 作为过滤器和正则表达式来匹配时间戳的复制粘贴解决方案:
' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)])
.
已解释
假设您输入的文本是:
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
然后您可以忽略带有正则表达式 \d{2}:\d{2}
和 append
的时间戳,所有过滤的行作为列表的短语。 Trim 每个短语使用 strip()
删除 heading/trailing 白色 space。但是当你最终 join
一个段落的所有短语都使用 space 作为分隔符时:
import re
def to_paragraph(transcript_lines):
phrases = []
for line in transcript_lines:
trimmed = line.strip()
if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
phrases.append(trimmed)
else: # TODO: for debug only, remove
print(line) # TODO: for debug only, remove
return " ".join(phrases)
t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''
paragraph = to_paragraph(t.splitlines())
print(paragraph)
with open('put_your_filename_here.txt') as f:
print(to_paragraph(f.readlines())
输出:
00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")
结果与youtubetranscript.com returned for the given youtube video相同。
- 文件:纯文本文档
- 内容:带有时间戳的 Youtube 文字记录
我可以分别去掉每一行的时间戳:
for count, line in enumerate(content, start=1):
if count % 2 == 0:
s = line.replace('\n','')
print(s)
不去掉时间戳我也可以加入句子:
with open('file.txt') as f:
print (" ".join(line.strip() for line in f))
但我尝试以各种格式一起执行这些操作(删除时间戳并加入行)但没有正确的结果:
with open('Russell Brand Script.txt') as m:
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n',' ')
print(" ".join(sentence.rstrip('\n')))
我也尝试了各种形式的 print(" ".join(sentence.rstrip('\n')))
和 print(" ".join(sentence.strip()))
但结果总是以下之一:
如何删除时间戳并加入句子以立即创建一个段落?
每当您在字符串上调用 .join()
时,它都会在字符串的 每个字符 之间插入分隔符。您还应该注意 print()
,默认情况下,在打印字符串后添加一个换行符。
为了解决这个问题,您可以将每个修改后的句子保存到一个列表中,然后在最后使用 "".join()
一次输出整个段落。这解决了上述换行问题,并使您能够在需要时对段落进行后续处理。
with open('put_your_filename_here.txt') as m:
sentences = []
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n', '')
sentences.append(sentence)
print(' '.join(sentences))
(对代码进行了小幅修改——旧版本的代码在该段之后产生了尾随 space。)
TL;DR: 使用 list-comprehension 和 if 作为过滤器和正则表达式来匹配时间戳的复制粘贴解决方案:
' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)])
.
已解释
假设您输入的文本是:
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
然后您可以忽略带有正则表达式 \d{2}:\d{2}
和 append
的时间戳,所有过滤的行作为列表的短语。 Trim 每个短语使用 strip()
删除 heading/trailing 白色 space。但是当你最终 join
一个段落的所有短语都使用 space 作为分隔符时:
import re
def to_paragraph(transcript_lines):
phrases = []
for line in transcript_lines:
trimmed = line.strip()
if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
phrases.append(trimmed)
else: # TODO: for debug only, remove
print(line) # TODO: for debug only, remove
return " ".join(phrases)
t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''
paragraph = to_paragraph(t.splitlines())
print(paragraph)
with open('put_your_filename_here.txt') as f:
print(to_paragraph(f.readlines())
输出:
00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")
结果与youtubetranscript.com returned for the given youtube video相同。