用 python 解析多个 xml 文件并在每个文件中查找特定文本，然后将输出制成表格

Question

我正在尝试为特定标签解析多个 xml 文件，如果文件包含该标签，则提取与标签关联的文本。

我断断续续地学习 Python 一年多了，这是我第一次尝试处理 xml。

这是我的代码，其中 changeM 是感兴趣的标签：

import os
import glob
import xml.etree.ElementTree as ET
import pandas as pd

read_files = glob.glob(os.path.join(path, '*.xml'))

for file in read_files:
    
    new_tree = ET.parse(file)
    root = new_tree.getroot()
    
    changes=[]
    for elm in root.findall('.//para[@changeM="1"]'):
        changes.append(elm.text)

名为 'changes' 的列表是空白的。或者，如果我在上面的代码中丢弃列表，我会添加一个打印语句，然后它会选择其中一个文本，但会重复打印相同的文本匹配项。

Answer 1

我想我已经解决了这个问题。

changes =[]
for file in read_files:
     if re.match(r'.*EN.*', file):
        tree = ET.parse(file) # gives an element tree
        root = tree.getroot() # gives an Element, the root element
        for elm in root.findall('.//para[@changeMark="1"]'):
            changes.append(elm.text)

Answer 2

考虑使用用户定义的方法进行 list/dict 理解：

def parse_data(xml_file):
   doc = ET.parse(xml_file)

   # LIST COMPREHENSION
   elem_texts = [elem.text for elem in doc.findall(".//para[@changeMark='1']")]

   return elem_texts


# DICT WITH FILE NAMES FOR KEYS AND PARSED TEXT LISTS FOR VALUES
changes_dict = {f:parse_data(f) for f in read_files if re.match(r'.*EN.*', f)}

# FLAT LIST WITH NO FILE INDICATOR
changes_list = [item for f,lst in changes_dict.items() for item in lst]

用 python 解析多个 xml 文件并在每个文件中查找特定文本，然后将输出制成表格

parsing multiple xml file with python and finding specific text in each file, and tabulating the output

python

xml

elementtree