删除成绩单时间戳并加入行以制作段落

Question

文件：纯文本文档
内容：带有时间戳的 Youtube 文字记录

我可以分别去掉每一行的时间戳：

for count, line in enumerate(content, start=1):
        if count % 2 == 0:
            s = line.replace('\n','')
            print(s)

不去掉时间戳我也可以加入句子:

with open('file.txt') as f:
    print (" ".join(line.strip() for line in f))

但我尝试以各种格式一起执行这些操作（删除时间戳并加入行）但没有正确的结果：

with open('Russell Brand Script.txt') as m:
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n',' ')
            print(" ".join(sentence.rstrip('\n')))

我也尝试了各种形式的 print(" ".join(sentence.rstrip('\n'))) 和 print(" ".join(sentence.strip())) 但结果总是以下之一：

如何删除时间戳并加入句子以立即创建一个段落？

Answer 1

每当您在字符串上调用 .join() 时，它都会在字符串的 每个字符 之间插入分隔符。您还应该注意 print()，默认情况下，在打印字符串后添加一个换行符。

为了解决这个问题，您可以将每个修改后的句子保存到一个列表中，然后在最后使用 "".join() 一次输出整个段落。这解决了上述换行问题，并使您能够在需要时对段落进行后续处理。

with open('put_your_filename_here.txt') as m:
    sentences = []
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n', '')
            sentences.append(sentence)
    print(' '.join(sentences))

（对代码进行了小幅修改——旧版本的代码在该段之后产生了尾随 space。）

Answer 2

TL;DR: 使用 list-comprehension 和 if 作为过滤器和正则表达式来匹配时间戳的复制粘贴解决方案： ' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)]).

已解释

假设您输入的文本是：

00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho

然后您可以忽略带有正则表达式 \d{2}:\d{2} 和 append 的时间戳，所有过滤的行作为列表的短语。 Trim 每个短语使用 strip() 删除 heading/trailing 白色 space。但是当你最终 join 一个段落的所有短语都使用 space 作为分隔符时：

import re

def to_paragraph(transcript_lines):
        phrases = []  
        for line in transcript_lines:
            trimmed = line.strip()
            if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
                phrases.append(trimmed)
            else:  # TODO: for debug only, remove
                print(line)  # TODO: for debug only, remove
        return " ".join(phrases) 

t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''

paragraph = to_paragraph(t.splitlines())
print(paragraph)

with open('put_your_filename_here.txt') as f:
     print(to_paragraph(f.readlines())

输出：


00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")

结果与youtubetranscript.com returned for the given youtube video相同。

删除成绩单时间戳并加入行以制作段落

Remove transcript timestamps and join the lines to make paragraph

python

youtube

string

text

已解释