正则表达式获得常量和模式的组合

Question

我正在研究一个正则表达式，它可以帮助我替换字符串中的模式。

我在流中的字符串很长，在应用正则表达式（找到模式，然后用常量值替换）后，我必须将字符串转发到我的 ETL 流中。

To find:
<customer attribute="any number">
 like <customer attribute="1">
and replace with:
<customer>. (basically just keep "customer" and delete everything)

我是正则表达式的新手，正在学习它。

任何帮助!!

Answer 1

Input:

<consumer attribute=\"1\"><birth-date>1990-07-23</birth-date> </consumer>;

my $element_name = "consumer";

my $str = "<consumer attribute=\"1\"><birth-date>1990-07-23</birth-date> </consumer>";

$str=~s/<($element_name)[^>]*attribute="[^\"]*"[^>]*>/<>/g;

print $str;

output:

<consumer><birth-date>1990-07-23</birth-date> </consumer>

Answer 2

拜托，拜托，拜托。 Don't use regular expressions to parse XML。

这是个坏消息。它既脆弱又不可靠，最重要的是 - 完全没有必要。

正则表达式不处理上下文。 XML 是关于上下文的。

XML 已经有一种名为 xpath 的查询语言，它更适合。

这是使用 xpath.

查找节点的示例

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ('yourfile.xml'); 

print $twig -> get_xpath('//consumer', 0) -> att('attribute'),"\n";

但是如果你想改造它并删除attribute:

$_ -> del_att('attribute') for $twig -> get_xpath('//consumer[@attribute]');
$twig -> set_pretty_print('indented_a');
$twig -> print;

虽然我会问 - 你为什么要这样做？这听起来更像是某个地方的另一个损坏的过程 - 也许另一个脚本试图 regex XML？

但是 XML::Twig 确实做得很好的另一件事是它具有 twig_handlers 让您更整洁地处理 XML 流（例如无需将其全部解析到内存中。

有点像这样：

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

sub delete_unwanted {
    my ( $twig, $element ) = @_; 
    $element -> del_att('attribute'); 
    #dump progress so far 'out'. 
    $twig -> flush; 
    #free memory already processed. 
    $twig -> purge; 
}

my $twig = XML::Twig -> new ( twig_handlers => { '//consumer[@attribute]' => \&delete_unwanted } );
   $twig -> parsefile ( 'your_xml.xml');

我们设置了一个处理程序，以便每次解析器遇到具有 attribute 属性的 consumer 时。（坏名字）它删除它，flushes（打印）解析的 XML，并将其从内存中清除。这使得它的内存效率非常高，因为您不会将整个内容读入内存，并且可以执行很多内联正则表达式类型的操作。

正则表达式获得常量和模式的组合

Regex to get a combination of constant and pattern

perl

pentaho

regex-negation

kettle

pdi