Ruby:如何在保留分隔符的同时拆分正则表达式上的字符串?
Ruby: how to split a string on a regex while keeping the delimiters?
这里有 been asked multiple times,但从未得到通用答案,所以我们开始:
假设您有一个字符串,任何字符串,但让我们使用 "oruh43451rohcs56oweuex59869rsr"
,并且您想用正则表达式拆分它。任何正则表达式,但让我们使用数字序列:/\d+/
。然后你会使用 split
:
"oruh43451rohcs56oweuex59869rsr".split(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]
太好了,但我想要数字。因此,我们有 scan
:
"oruh43451rohcs56oweuex59869rsr".scan(/\d+/)
# => ["43451", "56", "59869"]
可是我全都要!有没有,比方说,split_and_scan
?没有。
我 split
和 scan
然后 zip
他们呢?让我在那儿阻止你。
好的,那怎么样?
很高兴您提出问题……好吧,有 String#shatter
from Facets。我不喜欢它,因为它是使用技巧实现的(查看源代码,这是一个可爱的聪明技巧,但如果您的字符串实际上包含 ""
怎么办?)。
所以我自己动手了。这是你得到的:
"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]
实现如下:
class Object
def unfold(&f)
(m, n = f[self]).nil? ? [] : n.unfold(&f).unshift(m)
end
end
class String
def unjoin(rx)
unfold do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s , ""]
when ix == 0; [$&, $']
when ix > 0; [$`, $& + $']
end
end
end
end
(详细版本在底部)
下面是一些处理极端情况的例子:
"".unjoin(/\d+/) # => []
"w".unjoin(/\d+/) # => ["w"]
"1".unjoin(/\d+/) # => ["1"]
"w1".unjoin(/\d+/) # => ["w", "1"]
"1w".unjoin(/\d+/) # => ["1", "w"]
"1w1".unjoin(/\d+/) # => ["1", "w", "1"]
"w1w".unjoin(/\d+/) # => ["w", "1", "w"]
仅此而已,但还有更多……
或者,如果您不喜欢使用内置 类...那么,您可以使用 Refinements...但如果您真的不喜欢它,这里是函数:
def unfold(x, &f)
(m, n = f[x]).nil? ? [] : unfold(n, &f).unshift(m)
end
def unjoin(s, rx)
unfold(s) do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s , ""]
when ix == 0; [$&, $']
when ix > 0; [$`, $& + $']
end
end
end
我还想到,可能并不总是很清楚哪些是分隔符,哪些是分隔位,所以这里有一点补充,可以让您查询带有 #joint?
的字符串以了解它的作用拆分前播放:
class String
def joint?
false
end
class Joint < String
def joint?
true
end
end
def unjoin(rx)
unfold do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s, ""]
when ix == 0; [Joint.new($&), $']
when ix > 0; [$`, $& + $']
end
end
end
end
这里正在使用中:
"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)\
.map { |s| s.joint? ? "(#{s})" : s }.join(" ")
# => "oruh (43451) rohcs (56) oweuex (59869) rsr"
您现在可以轻松地重新实现 split
和 scan
:
class String
def split2(rx)
unjoin(rx).reject(&:joint?)
end
def scan2(rx)
unjoin(rx).select(&:joint?)
end
end
"oruh43451rohcs56oweuex59869rsr".split2(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]
"oruh43451rohcs56oweuex59869rsr".scan2(/\d+/)
# => ["43451", "56", "59869"]
如果您讨厌匹配全局变量和一般的简洁性……
class Object
def unfold(&map_and_next)
result = map_and_next.call(self)
return [] if result.nil?
mapped_value, next_value = result
[mapped_value] + next_value.unfold(&map_and_next)
end
end
class String
def unjoin(regex)
unfold do |tail_string|
next if tail_string.empty?
match = tail_string.match(regex)
index = match.begin(0)
case
when index.nil?; [tail_string, ""]
when index == 0; [match.to_s, match.post_match]
when index > 0; [match.pre_match, match.to_s + match.post_match]
end
end
end
end
如果split
的模式包含捕获组,则该组将包含在结果数组中。
str = "oruh43451rohcs56oweuex59869rsr"
str.split(/(\d+)/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]
如果你想压缩它,
str.split(/(\d+)/).each_slice(2).to_a
# => [["oruh", "43451"], ["rohcs", "56"], ["oweuex", "59869"], ["rsr"]]
这里有 been asked multiple times,但从未得到通用答案,所以我们开始:
假设您有一个字符串,任何字符串,但让我们使用 "oruh43451rohcs56oweuex59869rsr"
,并且您想用正则表达式拆分它。任何正则表达式,但让我们使用数字序列:/\d+/
。然后你会使用 split
:
"oruh43451rohcs56oweuex59869rsr".split(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]
太好了,但我想要数字。因此,我们有 scan
:
"oruh43451rohcs56oweuex59869rsr".scan(/\d+/)
# => ["43451", "56", "59869"]
可是我全都要!有没有,比方说,split_and_scan
?没有。
我 split
和 scan
然后 zip
他们呢?让我在那儿阻止你。
好的,那怎么样?
很高兴您提出问题……好吧,有 String#shatter
from Facets。我不喜欢它,因为它是使用技巧实现的(查看源代码,这是一个可爱的聪明技巧,但如果您的字符串实际上包含 ""
怎么办?)。
所以我自己动手了。这是你得到的:
"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]
实现如下:
class Object
def unfold(&f)
(m, n = f[self]).nil? ? [] : n.unfold(&f).unshift(m)
end
end
class String
def unjoin(rx)
unfold do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s , ""]
when ix == 0; [$&, $']
when ix > 0; [$`, $& + $']
end
end
end
end
(详细版本在底部)
下面是一些处理极端情况的例子:
"".unjoin(/\d+/) # => []
"w".unjoin(/\d+/) # => ["w"]
"1".unjoin(/\d+/) # => ["1"]
"w1".unjoin(/\d+/) # => ["w", "1"]
"1w".unjoin(/\d+/) # => ["1", "w"]
"1w1".unjoin(/\d+/) # => ["1", "w", "1"]
"w1w".unjoin(/\d+/) # => ["w", "1", "w"]
仅此而已,但还有更多……
或者,如果您不喜欢使用内置 类...那么,您可以使用 Refinements...但如果您真的不喜欢它,这里是函数:
def unfold(x, &f)
(m, n = f[x]).nil? ? [] : unfold(n, &f).unshift(m)
end
def unjoin(s, rx)
unfold(s) do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s , ""]
when ix == 0; [$&, $']
when ix > 0; [$`, $& + $']
end
end
end
我还想到,可能并不总是很清楚哪些是分隔符,哪些是分隔位,所以这里有一点补充,可以让您查询带有 #joint?
的字符串以了解它的作用拆分前播放:
class String
def joint?
false
end
class Joint < String
def joint?
true
end
end
def unjoin(rx)
unfold do |s|
next if s.empty?
ix = s =~ rx
case
when ix.nil?; [s, ""]
when ix == 0; [Joint.new($&), $']
when ix > 0; [$`, $& + $']
end
end
end
end
这里正在使用中:
"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)\
.map { |s| s.joint? ? "(#{s})" : s }.join(" ")
# => "oruh (43451) rohcs (56) oweuex (59869) rsr"
您现在可以轻松地重新实现 split
和 scan
:
class String
def split2(rx)
unjoin(rx).reject(&:joint?)
end
def scan2(rx)
unjoin(rx).select(&:joint?)
end
end
"oruh43451rohcs56oweuex59869rsr".split2(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]
"oruh43451rohcs56oweuex59869rsr".scan2(/\d+/)
# => ["43451", "56", "59869"]
如果您讨厌匹配全局变量和一般的简洁性……
class Object
def unfold(&map_and_next)
result = map_and_next.call(self)
return [] if result.nil?
mapped_value, next_value = result
[mapped_value] + next_value.unfold(&map_and_next)
end
end
class String
def unjoin(regex)
unfold do |tail_string|
next if tail_string.empty?
match = tail_string.match(regex)
index = match.begin(0)
case
when index.nil?; [tail_string, ""]
when index == 0; [match.to_s, match.post_match]
when index > 0; [match.pre_match, match.to_s + match.post_match]
end
end
end
end
如果split
的模式包含捕获组,则该组将包含在结果数组中。
str = "oruh43451rohcs56oweuex59869rsr"
str.split(/(\d+)/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]
如果你想压缩它,
str.split(/(\d+)/).each_slice(2).to_a
# => [["oruh", "43451"], ["rohcs", "56"], ["oweuex", "59869"], ["rsr"]]