如何加快搜索包含所需文本的行？

Question

我有第一个文件（大小约 1-3 kb），其中几行如下所示::

Name1
Name2
Name3
Name4
etc

还有第二个文件（大小为 1.2 GB），其中的字符串如下所示：

<root><img>url</img><title>Name1</title>(a few more tags there)</root>

第二个文件包含第一个文件的所有名称（以及与 file1 相同的文件的名称），但还有其他信息。

我需要一个代码来遍历file1的每一行，从那里获取名称并在文件2中查找包含相同名称的标签。找到包含所需名称的标签后，有必要复制父根标签及其中的所有内容到输出文件。

我得到了这个代码：

root = ET.parse('file2.xml').getroot()

with open('output.xml', 'a') as x, open('file1.xml', 'r') as f:
    for line in f:
        element = line
        search = root.xpath('.//root/Title[text()="%s"]' % element)
        for i in search:
            print(ET.tostring(i.getparent().decode('utf-8')))

它可以工作，但是速度很慢，我需要加快这段代码的速度

问题：我怎样才能加快这段代码的速度，或者是否有另一种快速的方法来按文本搜索元素？

编辑

大文件中每一行的结构（印刷精美）

<root>
  <Big_Images>
    <Big_Images0>url to img</Big_Images0>
    <Big_Images1>url to img</Big_Images1>
  </Big_Images>
  <Small_Images>
    <Small_Images0>url to small img</Small_Images0> 
    <Small_Images1>url to small img</Small_Images1> 
  </Small_Images>
  <title>Name1</title>
  <Summary/> # can contain some info
  <Price>4.1</Price>
  <Main_Info>
    <item>many html code there</item>
  </Main_Info>
</root>

Small_Images(从 0(<Small_Images/>) 到 10 的数字) 总是等于 Big_Images(从 0(<Big_Images/>) 到 10 的数字)

我还删除了所有重复的字符串（在一个大文件中）。例如，要么没有包含 Name1 的字符串，要么只有 1 个包含 Name1 的字符串。

root 总是包含 1 个 title 标签

只有 Summary 、 Big_Images 和 Small_Images 可能没有元素

在 xml 文件中有 1 个父标签 data，其中每一行都有一个 root

Answer 1

也许你可以尝试正则表达式方法

import re

names = []
with open("small_file", "r") as f:
    names = f.readlines()

with open("big_file", "r") as f:
    pattern = re.compile(r"\<root\>[\W\w]*\<\/root\>")
    lines = f.readlines()
    for line in lines:
        match = pattern.search(line)
        if match:
            print(line)

Answer 2

谢谢大家的建议，就我而言，我写了这段工作代码：

with open('main_data_file.xml', 'r') as f:
    txt = ''.join(f.readlines())
    with open('names.txt', 'r') as g, open('output.txt', 'a') as x:
        for element in g.readlines():
            line_regexp = r'^(.*<title>%s</title>.*)$' % element
            matches = re.search(line_regexp, txt, re.MULTILINE)
            try:
              x.write(matches + "\n")
            except AttributeError:
              pass

但他还是太慢了（1 KB 大约需要 5 秒）

我不知道我哪里错了，我可以更快地搜索名称正确的行吗？

编辑

我测试了很多并找到了适合我的代码：

data_set = set()
with open('main_data_file.xml', 'r') as f:
    data_set.update(f.readlines())
    with open("names.txt", 'r', encoding='utf-8') as g, open("output.txt", 'a') as x:
        for line in g.readlines():
            line_regexp = '<title>%s</title>' % line.strip()
            # print('Searching line:' + line_regexp)
            for element in data_set:
                if line_regexp in element:
                    x.write(element)
                    # print('Element found ' + line.strip() + "\n")

以足够高的速度运行

如何加快搜索包含所需文本的行？

How to speed up search for a line containing needed text?

python

regex

xml

lxml

编辑

编辑