从巨大的文本 (XML) 文件中提取标签之间的数据

Question

注意：我在 Windows 7、64 位系统上 - 刚刚安装了 cygwin。

我需要从许多不同的大型（数百 MB）XML 文件中提取大量数据。 xml 文件包含一堆行序列，如下所示：

<taggie>
lotsolines which include some string that I'm searching for.
</taggie>

我想提取从开始标记到包含搜索字符串的结束标记的所有内容。（这是在 python 中执行此操作还是在 cygwin 中执行此操作的折腾。）

我的计划是编写一个脚本，从其中一个 xml 文件中预处理出 table 开始和结束标记，并创建一个 table 行号引用开始结束。像

filename, start line (begin tag), end line (end tag)
bogusname.xml, 50025, 100003

然后我进行另一次搜索以创建我的字符串出现位置的列表。它看起来可能像这样。

filename, searchstring, line number
bogusname.xml, "foo", 76543

然后我根据第一个列表处理第二个列表，以提取信息（可能到第二个大文件或一组文件中。我现在不在乎。

无论如何，当我这样做时，我突然想到有人几乎肯定做过这件事或非常相似的事情。

那么，任何人都可以指导我使用已经执行此操作的代码吗？ Python 首选，但 cygwin 的 unix 样式脚本会很方便。我更喜欢源代码而不是任何我看不到源代码在做什么的 executable。

与此同时，我正在独自进行。提前致谢。

为了准确的数据，我正在下载这个文件（例如）： http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip 我解压缩它，我想提取那些 XML 包含大量搜索字符串的文档。这是一个包含数千个串联 XML 文档的单个文件。我想提取任何包含搜索字符串之一的 XML。

我现在正在试验 BeautifulSoup:

from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import urllib2

xml_handle = open("t.xml", "r")
soup = BeautifulSoup(xml_handle)

i = 0
for grant in soup('us-patent-grant'):
    i = i + 1
    print (i)
    print (grant)

print (i)

当我这样做时，i 的最终值为 9。如果它获得了所有 'us-patent-grant' 标签，我希望 i 会超过 6000 - 这表明它可能没有解析整个文件。

Answer 1

（过去的答案）

使用 python 包 beautifulsoup 怎么样？加上正则表达式。 BeautifulSoup 是最著名的处理 .html、.xml 文件的工具。重新进口从 bs4 导入 BeautifulSoup

f = open("filename.xml")
xml = f.read()
soup = BeautifulSoup(xml)
find_search = re.compile("[search]+")
#remain code here....

检查此网站 http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 是否有 beautifulsoup， https://docs.python.org/2/library/re.html 用于正则表达式语法。

但访问此网页后，您可以轻松地做您想做的事。

============================================= =========================

文件太大，需要一些代码将文件拆分成单独的文件。从 link Split diary file into multiple files using Python，你可以把你的代码写成

<!-- language: python -->
def files():
    n = 0
    while True:
        n += 1
        yield open('xml_%d.xml' % n, 'w')
pat = '<?xml'
fs = files()
outfile = next(fs) 
with open("ipg150106.xml") as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

此代码给了我xml_6527.xml。

def files():
    n = 0
    while True:
        n += 1
        yield open('xml_%d.xml' % n, 'w')

if __name__ == '__main__':
    #make file seperate
    # pat = '<?xml'
    # fs = files()
    # outfile = next(fs) 

    # with open("ipg150106.xml") as infile:
    #     for line in infile:
    #         if pat not in line:
    #             outfile.write(line)
    #         else:
    #             items = line.split(pat)
    #             outfile.write(items[0])
    #             for item in items[1:]:
    #                 outfile = next(fs)
    #                 outfile.write(pat + item)

    #analyzing each file
    import os
    pwd = os.path.dirname(os.path.realpath(__file__))
    xml_files = [xml_file for xml_file in os.listdir(pwd) if os.path.isfile(os.path.join(pwd, xml_file))]

    for f in xml_files:
        xml = f.read()
        soup = BeautifulSoup(xml)
        #Remain code here..

（抱歉奇怪的代码块:( )

Answer 2

我目前正在 Python 中处理类似的问题。我知道这已经晚了几年，但我将分享我解析类似大文件的经验。

我发现 Python 的内置函数 xml.etree.ElementTree works quite well for this (the C implementation of this, called cElementTree has the same API and is also builtin). I tried every approach in the docs, and iterparse() with clear() 是迄今为止库中最快的（比我在 Python 中所做的任何其他实现快 5 倍）。这种方法允许您在内存中增量加载和清除 xml，将其作为流处理（使用生成器）。这比将整个文件加载到内存中要好得多，后者可能会使您的计算机变慢。

参考文献：

The accepted answer here explains basically the best approach that I could find.

This IBM site talks about lxml which is similar to the xml library but has better XPath support.

lxml website and cElementTree website 比较 xml 和 lxml 包的执行速度。

从巨大的文本 (XML) 文件中提取标签之间的数据

Extracting data between tags from huge text (XML) files

xml

tags

extraction