字符串分析：按单词百分比将字符串拆分为n个部分

Question

我需要计算列表中包含的每个字符串的长度：

list_strings=["I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best","So many books, so little time.","In three words I can sum up everything I've learned about life: it goes on.","if you tell the truth, you don't have to remember anything.","Always forgive your enemies; nothing annoys them so much."]

将它们分成三部分：

30%（第一部分）
30%（第二部分）
40%（第三部分）

我可以计算每个字符串的长度到列表中，但我不知道如何将每个字符串分成三部分并保存它们。例如。：第一句话 "I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best" 的长度为 201（标记化）所以我需要

201 的 30% 并将这些单词保存到一个数组中（大约前 60 个单词）；
剩余的 30%（即接下来的 60 个单词）；
最后40%，即最后80个字

我了解了块的使用，但我不知道如何申请。另外，我需要一个条件来确保我使用整数（元素这样的词不能被认为是 1/2）单词并且我不会超出长度。

Answer 1

根据标点符号上的百分比拆分文本

def split_text(s):
  """ Partitions text into three parts
      in proportion 30%, 40%, 30%"""

  i1 = int(0.3*len(s))  # first part from 0 to i1
  i2 = int(0.7*len(s))  # 2nd for i1 to i2, 3rd i2 onward

  # Use isalpha() to check when we are at a punctuation
  # i.e. . or ; or , or ? " or ' etc.
  # Find nearest alphanumeric boundary
  # backup as long as we are in an alphanumeric
  while s[i1].isalpha() and i1 > 0:
    i1 -= 1

  # Find nearest alphanumeric boundary (for 2nd part)
  while s[i2].isalpha() and i2 > i1:
    i2 -= 1

  # Returns the three parts
  return s[:i1], s[i1:i2], s[i2:]


for s in list_strings:
  # Loop over list reporting lengths of parts
  # Three parts are a, b, c
  a, b, c = split_text(s)
  print(f'{s}\nLengths: {len(a)}, {len(b)}, {len(c)}')
  print()

输出

I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best
Lengths: 52, 86, 63

So many books, so little time.
Lengths: 7, 10, 13

In three words I can sum up everything I've learned about life: it goes on.
Lengths: 20, 31, 24

if you tell the truth, you don't have to remember anything.
Lengths: 15, 25, 19

Always forgive your enemies; nothing annoys them so much.
Lengths: 14, 22, 21

split_text

的输出

代码

for s in list_strings:
    a, b, c = split_text(s)
    print(a)
    print(b)
    print(c)
    print()

结果

I'm selfish, impatient and a little insecure. I make
 mistakes, I am out of control and at times hard to handle. But if you can't handle me
 at my worst, then you sure as hell don't deserve me at my best

So many
 books, so
 little time.

In three words I can
 sum up everything I've learned
 about life: it goes on.

if you tell the
 truth, you don't have to
 remember anything.

Always forgive
 your enemies; nothing
 annoys them so much.

捕获分区

result_a, result_b, result_c = [], [], []
for s in list_strings:
      # Loop over list reporting lengths of parts
      # Three parts are a, b, c
      a, b, c = split_text(s)
      result_a.append(a)
      result_b.append(b)
      result_c.append(c)

Answer 2

在此解决方案中，我们将使用此正则表达式按字母和撇号内容来考虑单词：

[\w]+[']?[\w]*

它将按标点符号拆分文本。所以如果我们要拆分“I'm selfish, impatient and a”的话，它会产生这个：

["I'm", "selfish", "impatient", "and", "a"]

然后，我们得到字符串列表的百分比，并将这些单词保存到一个数组中根据 perc_list 的 3 个位置，在开头定义。

代码如下：

import re 
perc_list = [0.3, 0.3, 0.4] #tot must be 1
list_strings=["I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best","So many books, so little time.","In three words I can sum up everything I've learned about life: it goes on.","if you tell the truth, you don't have to remember anything.","Always forgive your enemies; nothing annoys them so much."]

for string in list_strings:
    ls = re.findall("[\w]+[']?[\w]*", string)
    idxl = [round(perc_list[0] * len(ls))]
    idxl.append(idxl[0] + round(perc_list[1] * len(ls)))
    arr_str = [ls[0:idxl[0]], ls[idxl[0]: idxl[1]], ls[idxl[1]:]]
    print (string, '\n ', idxl[0], idxl[1], len(ls), '\n ', "\n  ".join(str(i) for i in arr_str), '\n')

结果如下：

字符串分析：按单词百分比将字符串拆分为n个部分

Strings analysis: splitting strings into n parts by percentage of words

python

string

text-mining