如何在不删除重复项的情况下找到最常用的单词？

Question

我有如下列表：

group = [
#Group 1 ('aaa bbbb' the most common words = two words)
['aaaa bbbb nnnn',   #<-- row 1
 'aaaa bbbb oooo',   #<-- row 2
 'aaaa bbbb pppp'],   #<-- row 3

#Group 2 ('hello' the most common word = one word)
['hello Jack   T.',  #<-- row 1
 'hello Ramona D.',  #<-- row 2
 'hello Robert G.'], #<-- row 3

#Group 3 ('yes! go go' the most common words = the whole string)
['yes! go go',      #<-- row 1
 'yes! go go',      #<-- row 2
 'yes! go go',      #<-- row 3     
 'yes! go go'],     #<-- row 4    

#Group 4 (only one word  = it's an invalid group)
['python'],          #<-- row 1 

#Group 5 (only one word = it's an invalid group)
['java']            #<-- row 1

]

我需要为每个组找到最常用的单词并将它们保存到新列表中：

像这样：

OUT : ['aaaa','hello','yes! go go']

但是第三组有重复的单词 -> 'go go' 我两个都需要，所以真正的结果是：

OUT : ['aaaa','hello','yes! go']

这是工作代码

#Try to count words for each group
for groups in group:
    #how many groups ?
    nGroup = len(groups)
    #join lists
    words = " ".join(groups).split()

我得到：

WORDS ['aaaa', 'bbbb', 'nnnn', 'aaaa', 'bbbb', 'oooo', 'aaaa', 'bbbb', 'pppp']
WORDS ['hello', 'Jack', 'T.', 'hello', 'Ramona', 'D.', 'hello', 'Robert', 'G.']
WORDS ['java']
WORDS ['python']
WORDS ['yes!', 'go', 'go', 'yes!', 'go', 'go', 'yes!', 'go', 'go', 'yes!', 'go', 'go']

    #how many identical rows ?
    rows = collections.Counter(words)
    #what's the common words for each row ?
    wCommon = rows.most_common()
    #how often that's?
    mCommon = rows.most_common(1)[0][1]
    print (f"wCommon :{wCommon}  rows :{rows}  mCommon :{mCommon}")

我得到：

#Group 1
wCommon :[('aaaa', 3), ('bbbb', 3),
      ('nnnn', 1), ('oooo', 1),
      ('pppp', 1)]
rows :Counter({'aaaa': 3, 'bbbb': 3,
           'nnnn': 1, 'oooo': 1,
           'pppp': 1})
mCommon :3        


#Group 2
wCommon :[('hello', 3), ('Jack', 1), ('T.', 1),
      ('Ramona', 1), ('D.', 1),
      ('Robert', 1), ('G.', 1)]
rows:Counter({'hello': 3, 'Jack': 1, 'T.': 1,
          'Ramona': 1, 'D.': 1,
          'Robert': 1, 'G.': 1})
mCommon:3


#Group 3
wCommon :[('java', 1)]  rows:Counter({'java': 1})  mCommon:1
#Group 4
wCommon :[('python', 1)]  rows:Counter({'python': 1})  mCommon:1
#Group 5
wCommon :[('go', 8), ('yes!', 4)]  rows:Counter({'go': 8, 'yes!': 4})
mCommon:8

以下是原始列表，但可以更改。我试图将它分成几组并计算每行的常用词...... 例如：

aaaa, hello , yes! go go

但有时会出现一个或多个常用词，例如'aaaa bbbb'如何获取？或像 'go' 这样的重复，在这种情况下它不起作用

list_1 = [

 "aaaa bbbb nnnn",
 "aaaa bbbb oooo",
 "aaaa bbbb pppp",
 "hello Ramona D.",
 "hello Jack   T.",
 "hello Robert G.",
 "yes! go go",
 "yes! go go",
 "yes! go go",
 "yes! go go",
 "python",
 "java"

]

编辑：谢谢大家

Answer 1

您可以使用 collections.defaultdict:

import collections, re
def most_common(d):
   if len(d) < 2:
      return #invalid group
   groups, _d = [re.split('\s+', i) for i in d], collections.defaultdict(int)
   for i in groups:
       for b in [i[k:j] for j in range(len(i)+1) for k in range(j)]:
          _d[' '.join(b)] += 1
   return max(_d, key=lambda x:(_d[x] > 1, len(x.split()), _d[x]))

group = [['aaaa bbbb nnnn', 'aaaa bbbb oooo', 'aaaa bbbb pppp'], ['hello Jack   T.', 'hello Ramona D.', 'hello Robert G.'], ['yes! go go', 'yes! go go', 'yes! go go', 'yes! go go'], ['python'], ['java']]
print(list(filter(None, map(most_common, group))))

输出：

['aaaa bbbb', 'hello', 'yes! go go']

Answer 2

您可以只检查多个单词是否出现相同的次数以及它们是否连续出现：

import collections

groups = [
    #Group 1 ('aaa bbbb' the most common words = two words)
    [
        'aaaa bbbb nnnn',  #<-- row 1
        'aaaa bbbb oooo',  #<-- row 2
        'aaaa bbbb pppp'
    ],
    # Group 2 (one word 'aaaa' or 'bbbb', lets take the first)
    ['aaaa nnnn bbbb', 'aaaa oooo bbbb', 'aaaa pppp bbbb'],
    #Group 3 (two words 'oooo bbbb')
    ['aaa1 oooo bbbb', 'aaa2 oooo bbbb', 'aaa3 oooo bbbb'],

    #Group 4 ('hello' the most common word = one word)
    [
        'hello Jack   T.',  #<-- row 1
        'hello Ramona D.',  #<-- row 2
        'hello Robert G.'
    ],  #<-- row 3

    #Group 5 ('yes! go go' the most common words = the whole string)
    [
        'yes! go go',  #<-- row 1
        'yes! go go',  #<-- row 2
        'yes! go go',  #<-- row 3
        'yes! go go'
    ],  #<-- row 4

    #Group 6 (only one word  = it's an invalid group)
    ['python'],  #<-- row 1

    #Group 7 (only one word = it's an invalid group)
    ['java'],
    [
        "yu yu hakusho co dell'altro mondo", "yu yu hakusho re dell'inferno jr",
        'yu yu hakusho un amico per la pelle'
    ],
    [
        "yu yu yu hakusho co dell'altro mondo",
        "yu yu hakusho re dell'inferno jr yu yu",
        'yu yu yu hakusho un amico per la pelle'
    ]
]


def mostCommon(group):
    # skip invalid
    if len(group) < 2:
        return

    # all identical!
    if len(set(group)) == 1:
        return group[0]

    words = " ".join(group).split()
    c = collections.Counter(words)
    _maxCounts = max(c.values())

    # normalize maxCounts, in case maxCounts > length of group
    _maxItems = []
    for k, v in c.items():
        if v >= len(group) or v >= _maxCounts:
            _maxItems.extend([k] * divmod(v, len(group))[0])

    # One word appears most often.
    if len(_maxItems) == 1:
        return _maxItems[0]

    # Multiple words having same max. occurences, do the words appear consecutively ?
    # Lookup reverse, starting with longest
    _combinations = [_maxItems[:x] for x in range(1, len(_maxItems) + 1)]
    print(_combinations)
    for c in _combinations[::-1]:
        if len(set([item
                    for item in group if ' '.join(c) in item])) == len(group):
            return ' '.join(c)


for i, group in enumerate(groups):
    result = mostCommon(group)
    print(f"Group {i+1}: {result}")

输出：

Group 1: aaaa bbbb
Group 2: aaaa
Group 3: oooo bbbb
Group 4: hello
Group 5: yes! go go
Group 6: None
Group 7: None
Group 8: yu yu hakusho
Group 9: yu yu

如何在不删除重复项的情况下找到最常用的单词？

How to find the most common words without delete duplicate?

python

string

collections