删除成绩单时间戳并加入行以制作段落

Remove transcript timestamps and join the lines to make paragraph

我可以分别去掉每一行的时间戳:

for count, line in enumerate(content, start=1):
        if count % 2 == 0:
            s = line.replace('\n','')
            print(s) 

不去掉时间戳我也可以加入句子:

with open('file.txt') as f:
    print (" ".join(line.strip() for line in f))

但我尝试以各种格式一起执行这些操作(删除时间戳并加入行)但没有正确的结果:

with open('Russell Brand Script.txt') as m:
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n',' ')
            print(" ".join(sentence.rstrip('\n'))) 

我也尝试了各种形式的 print(" ".join(sentence.rstrip('\n')))print(" ".join(sentence.strip())) 但结果总是以下之一:

如何删除时间戳并加入句子以立即创建一个段落?

每当您在字符串上调用 .join() 时,它都会在字符串的 每个字符 之间插入分隔符。您还应该注意 print(),默认情况下,在打印字符串后添加一个换行符。

为了解决这个问题,您可以将每个修改后的句子保存到一个列表中,然后在最后使用 "".join() 一次输出整个段落。这解决了上述换行问题,并使您能够在需要时对段落进行后续处理。

with open('put_your_filename_here.txt') as m:
    sentences = []
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n', '')
            sentences.append(sentence)
    print(' '.join(sentences))

(对代码进行了小幅修改——旧版本的代码在该段之后产生了尾随 space。)

TL;DR: 使用 list-comprehension 和 if 作为过滤器和正则表达式来匹配时间戳的复制粘贴解决方案: ' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)]).

已解释

假设您输入的文本是:

00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho

然后您可以忽略带有正则表达式 \d{2}:\d{2}append 的时间戳,所有过滤的行作为列表的短语。 Trim 每个短语使用 strip() 删除 heading/trailing 白色 space。但是当你最终 join 一个段落的所有短语都使用 space 作为分隔符时:

import re

def to_paragraph(transcript_lines):
        phrases = []  
        for line in transcript_lines:
            trimmed = line.strip()
            if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
                phrases.append(trimmed)
            else:  # TODO: for debug only, remove
                print(line)  # TODO: for debug only, remove
        return " ".join(phrases) 

t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''

paragraph = to_paragraph(t.splitlines())
print(paragraph)

with open('put_your_filename_here.txt') as f:
     print(to_paragraph(f.readlines())

输出:


00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")

结果与youtubetranscript.com returned for the given youtube video相同。