正则表达式拆分 message_txt 超过 160 个字符

Question

我正在尝试将消息系统的消息文本拆分为最多 160 个字符的序列，以 space 结尾，除非它是最后一个序列，否则它可以以任何长度结尾因为它等于或小于 160 个字符。

这个重新表达式 '.{1,160}\s' 几乎可以工作，但是它会删除消息的最后一个字，因为消息的最后一个字符通常不是 space.

我也试过 '.{1,160}\s|.{1,160}' 但这不起作用，因为最后的序列只是最后一个 space 之后的剩余文本。有没有人知道如何做到这一点？

示例：

two_cities = ("It was the best of times, it was the worst of times, it was " +
         "the age of wisdom, it was the age of foolishness, it was the " +
         "epoch of belief, it was the epoch of incredulity, it was the " +
         "season of Light, it was the season of Darkness, it was the " +
         "spring of hope, it was the winter of despair, we had " +
         "everything before us, we had nothing before us, we were all " +
         "going direct to Heaven, we were all going direct the other " +
         "way-- in short, the period was so far like the present period," +
         " that some of its noisiest authorities insisted on its being " +
         "received, for good or for evil, in the superlative degree of " +
         "comparison only.")


chunks = re.findall('.{1,160}\s|.{1,160}', two_cities)
print(chunks)

将return

['It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of ', 'incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we ', 'had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, ', 'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison ', 'only.']

列表的最后一个元素应该是

'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.'

不是'only.'

Answer 1

试试这个 - .{1,160}(?:(?<=[ ])|$)

 .{1,160}                      # 1 - 160 chars
 (?:
      (?<= [ ] )                    # Lookbehind, must end with a space
   |  $                             # or, be at End of String
 )

信息 -

默认情况下，引擎将尝试匹配 160 个字符（贪婪）。
然后它检查表达式的下一部分。

lookbehind 强制与 .{1,160} 匹配的最后一个字符是 space.
或者，如果在字符串的末尾，则不执行。

如果lookbehind 失败，并且不是在字符串的末尾，引擎将回溯到159 个字符，然后再次检查。重复此过程直到断言通过。

Answer 2

您应该避免使用正则表达式，因为它们可能效率低下。

我会推荐这样的东西：(see it in action here)

list = []
words = two_cities.split(" ")

for i in range(0, len(words)):
    str = []
    while i < len(words) and len(str) + len(words[i]) <= 160:
        str.append(words[i] + " ")
        i += 1
    list.append(''.join(str))

print list

这将创建一个包含所有单词的列表，按空格分隔。

如果单词适合字符串，它会将其添加到字符串中。如果不能，它会将其添加到列表中并开始一个新字符串。最后，您有一个结果列表。

正则表达式拆分 message_txt 超过 160 个字符

Regex to split up message_txt over 160 characters

python

regex