如何使用带 Python 的标签名称获取特定标签内的文本

How to get text inside specific tag using tag name with Python

我正在尝试打开一个 XML 文件并对其进行解析,查看其标签并在每个特定标签中查找文本。如果标签中的文本与字符串匹配,我希望它删除字符串的一部分或用其他内容替换它。

我的问题是,我不确定是否:start = x.find('start_char').text 实际上是获取“start_char”标签内的文本并将其保存到“开始”变量。 (“x.find('tag_name').text 是否真的获取标签内的文本?)

XML 文件具有以下数据:

<?xml version="1.0" encoding="utf-8"?>
<metadata>
    <filter>
        <regex>ATL|LAX|DFW</regex >
        <start_char>3</start_char>
        <end_char></end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>DFW.+\.$</regex >
        <start_char>3</start_char>
        <end_char>-1</end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>\-</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex>\s</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex> T&amp;R$</regex >
        <start_char></start_char>
        <end_char>-4</end_char>
        <action>remove</action>
    </filter>
</metadata>

我使用的 Python 代码是:

from xml.etree.ElementTree import ElementTree    

# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")

# Get the data in the XML file 
root = tree.getroot()

# Loop through filters
for x in root.findall('filter'):

    # Find the text inside the regex tag
    regex = x.find('regex').text

    # Find the text inside the start_char tag
    start = x.find('start_char').text

    # Find the text inside the end_char tag
    end = x.find('end_char').text

    # Find the text inside the replacement tag
    #replace = x.find('replacement')

    # Find the text inside the action tag
    action = x.find('action').text

    if action == 'remove':
        if re.match(r'regex', mfn_pn, re.IGNORECASE):
            mfn_pn = mfn_pn[start:end]

    elif action == 'substitute':
        mfn_pn = re.sub(r'regex', '', mfn_pn)

    return mfn_pn

代码 start = x.find('start_char').text 将在 filter 元素有 start_char 个子元素的情况下起作用,否则会抛出错误 AttributeError: 'NoneType' object has no attribute 'text'.

这可以避免,例如使用以下方法:

# find element
start_el = x.find('start_char')
# check if element exist and assign its text to the variable, None (or another default value) otherwise
start = start_el.text if start_el is not None else None

同样适用于 end 变量。

使用这种方法,将为您的示例文档检索以下值:

3 None
3 -1
None None
None None
None -4