遍历多个文件，将日期提取到另一个文件

Question

好的，我有一个包含多个文件夹的源目录。每个文件夹都有一个名为 tvshow.nfo 的文件，我想从中提取数据。我写了以下 -

import sys
import os
import re
from pathlib import Path

L = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    M = str(item).splitlines()
    for i in M:
        f = open(i, "r")
        for i in f:
            for j in re.findall("<title>(.+)</title>", i):
                L.append(j)
            for j in re.findall("<year>(.+)</year>", i):
                L.append(j)
            for j in re.findall("<status>(.+)</status>", i):
                L.append(j)
            for j in re.findall("<studio>(.+)</studio>", i):
                L.append(j)
        for i in L:
            print (i)
        f.close()

我使用 glob 获取所有 nfos 的确切路径，然后使用分割线分隔每个路径，遍历每个路径的文件，然后使用正则表达式提取信息。并尝试将此信息附加到空列表中。我得到以下输出 -

APB
2017
Continuing
FOX (US)
APB
2017
Continuing
FOX (US)
Angie Tribeca
2016
Continuing
TBS
APB
2017
Continuing
FOX (US)
Angie Tribeca
2016
Continuing
TBS
Arrow
2012
Continuing
The CW
['APB', '2017', 'Continuing', 'FOX (US)', 'Angie Tribeca', '2016', 'Continuing', 'TBS', 'Arrow', '2012', 'Continuing', 'The CW']

我希望将输出导出到一个新文件中：

APB 2017 Continuing FOX (US)
Angie Tribeca 2016 Continuing TBS
Arrow 2012 Continuing The CW

谁能帮帮我？还有比我尝试过的更好的方法吗？

Answer 1

根据你展示的情况，你可以试试这个。

import sys
import os
import re
from pathlib import Path

info = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    M = str(item).splitlines()
    for i in M:
        L = []
        f = open(i, "r")
        for i in f:
            for j in re.findall("<title>(.+)</title>", i):
                L.append(j)
            for j in re.findall("<year>(.+)</year>", i):
                L.append(j)
            for j in re.findall("<status>(.+)</status>", i):
                L.append(j)
            for j in re.findall("<studio>(.+)</studio>", i):
                L.append(j)
        f.close()
        info.append(' '.join(L))
with open("new_file", "w") as w:
    for i in info:
        w.write(i + "\n")

Answer 2

与其为每个节目制作一个包含所有不同属性的列表，不如以更易于阅读的方式构建数据。一种可能是列表的列表，其中 top-level 列表中每个节目都有一个条目，内部列表包含一个节目的标题、年份、状态和工作室属性。您可以很容易地修改现有代码来完成此操作：

    for i in f:
        show_attributes = []
        for j in re.findall("<title>(.+)</title>", i):
            show_attributes.append(j)
        for j in re.findall("<year>(.+)</year>", i):
            show_attributes.append(j)
        for j in re.findall("<status>(.+)</status>", i):
            show_attributes.append(j)
        for j in re.findall("<studio>(.+)</studio>", i):
            show_attributes.append(j)
        L.append(show_attributes)
    for i in L:
        for attribute in i:
            print(attribute, end=' ')
    f.close()

Answer 3

从您的示例来看，每个节目的所有标签似乎都在一行中。

如果一个节目的所有标签都在一行上，我认为这样的事情可能会有所帮助：

import sys
import os
import re
from pathlib import Path


def find_tag(tag, l):
    ''' returns result of findall on a tag on line l'''
    full_tag = "<" + tag + ">(.+)</" + tag + ">"
    return re.findall(full_tag, l)


L = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    # changed the file variable to data_file
    M = str(item).splitlines()
    for data_file in M:
        # use with to open the file without needing to close it
        with open(data_file, "r") as f:

            for line in f:
                title = find_tag("title", line)
                year = find_tag("year", line)
                status = find_tag("status", line)
                studio = find_tag("studio", line)
                L.append(' '.join(str(d[0]) for d in [title, year, status, studio] if d))

# print the data or whatever else you're doing with it
for data in L:
    print(data)

这使用 with 打开文件而不需要使用 try-catch 并自行关闭它。有关 with 的信息可在此处找到：file methods

需要

str(d[0]) 才能将组列表项从 re.findall 更改为字符串。 if d 是为了防止该行缺少标签（我可能误解了标签在文件中的放置方式，如果我是这样的话，我很抱歉）

也可以使用列表理解来构建 L： L = [(find_tag("title", line), find_tag("year", line), find_tag("status", line), find_tag("studio", line)) for line in f] 而不是附加到列表。

打印列表时可以使用 join 方法：print(' '.join(str(d[0]) for d in data if d))。

是否要这样做取决于您对列表推导式的喜欢程度。

我还创建了一个 find_tag 函数，但这主要是我试图弄清楚发生了什么。

在不知道文件是什么样子的情况下，很难判断您是否应该在单独的行中查找每个文件。也很难判断顺序是否重要，或者您是否需要进行任何错误处理。

遍历多个文件，将日期提取到另一个文件

Iterating through multiple files, extract date to another file

python

python-3.4