使用正则表达式用空格分割字符串

Question

我正在尝试使用正则表达式拆分字符串。我需要在 nifi 中使用正则表达式将字符串拆分成组。谁能帮我如何使用正则表达式拆分下面的字符串。

或者我们如何给出分隔符的具体出现次数来拆分字符串。例如，在下面的字符串中，如何在第 3 次出现 space.

后指定我想要一个字符串

假设我有一个字符串

"6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"

我想要这样的结果：

group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-
          addr(4)arpa(0)

谁能帮帮我。提前致谢。

Answer 1

如果您真的只是想为分隔符使用某些空格，您可以这样做以避免固定宽度的噩梦：

regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"

几乎就是它的样子，非空格 \S+ 和空格 \s 的组，每个都用 parans 分组。最后的 .* 只是该行的其余部分，可以根据需要进行调整。如果您希望每个组都是每个非间隔组，您可以进行拆分而不是正则表达式，但看起来这不是所需要的。我无法访问 nifi 进行测试，但这是 Python.

中的示例

import re

text = "6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"

match = re.search(regex, text)
print ("group 1 - " + match.group(1))
print ("group 2 - " + match.group(2))
print ("group 3 - " + match.group(3))
print ("group 4 - " + match.group(4))
print ("group 5 - " + match.group(5))
print ("group 6 - " + match.group(6))
print ("group 7 - " + match.group(7))
print ("group 8 - " + match.group(8))

输出：

group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)

Answer 2

您是否尝试将每个组提取到一个单独的属性中？这在 "pure" NiFi 中当然是可能的，但是对于这么长的行，使用 ExecuteScript 处理器来使用 Groovy 或 Python 更复杂的正则表达式可能更有意义与 String#split() 一起处理并提供类似 sniperd 的脚本。

要使用 ExtractText 执行此任务，您需要按如下方式配置它：

可复制的模式：

group 1: (^\S+\s\S+\s\S+)
group 2: (?i)(?<=\s)([a-f0-9]{4})(?=\s)
group 3: (?i)(?<=\s)(PACKET\s[a-f0-9]{4,16})(?=\s)
group 4: (?i)(?<=\s\S{16}\s)([\w]{3,})(?=\s)
group 5: (?i)(?<=\s.{3}\s)([\w]{3,})(?=\s)
group 6: (?i)(?<=\s.{3}\s)([\d\.]{7,15})(?=\s)
group 7: (?i)(?<=\d\s)([a-f0-9]{4})(?=\s)
group 8: (?i)(?<=\d\s[a-f0-9]{4}\s)(.*)$

需要注意的是Include Capture Group 0设置为false。由于在 NiFi 中验证正则表达式的方式，您将得到重复的组（group 1 和 group 1.1）（目前所有正则表达式 必须至少有一个捕获组 - - 这将通过 NIFI-4095 | ExtractText should not require a capture group in every regular expression) 修复。

生成的流文件具有正确填充的属性：

完整日志输出：

2017-06-20 14:45:57,050 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=c6b04310-015c-1000-b21e-c64aec5b035e] logging for flow file StandardFlowFileRecord[uuid=5209cc65-08fe-44a4-be96-9f9f58ed2490,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1497984255809-1, container=default, section=1], offset=444, length=148],offset=0,name=1920315756631364,size=148]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
    Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'lineageStartDate'
    Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'fileSize'
    Value: '148'
FlowFile Attribute Map Content
Key: 'filename'
    Value: '1920315756631364'
Key: 'group 1'
    Value: '6/19/2017 12:14:07 PM'
Key: 'group 1.1'
    Value: '6/19/2017 12:14:07 PM'
Key: 'group 2'
    Value: '0FA0'
Key: 'group 2.1'
    Value: '0FA0'
Key: 'group 3'
    Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 3.1'
    Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 4'
    Value: 'UDP'
Key: 'group 4.1'
    Value: 'UDP'
Key: 'group 5'
    Value: 'Snd'
Key: 'group 5.1'
    Value: 'Snd'
Key: 'group 6'
    Value: '11.222.333.44'
Key: 'group 6.1'
    Value: '11.222.333.44'
Key: 'group 7'
    Value: '93c8'
Key: 'group 7.1'
    Value: '93c8'
Key: 'group 8'
    Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'group 8.1'
    Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'path'
    Value: './'
Key: 'uuid'
    Value: '5209cc65-08fe-44a4-be96-9f9f58ed2490'
--------------------------------------------------
6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)

NiFi 1.3.0 版本的另一个选择是使用 record processing capabilities. This is a new feature which allows arbitrary input formats (Avro, JSON, CSV, etc.) to be parsed and manipulated in a streaming manner. Mark Payne has written a very good tutorial here，它介绍了该功能并提供了一些简单的演练。

使用正则表达式用空格分割字符串

String split with spaces using regex

regex

split

apache-nifi