使用 python 找到具有各种行大小的文件中最常见的文本行长度

find the most common length of text line in a file with various line sizes present using python

所以我有点陷入思考这个问题 -

我有一个包含许多字符行的文件 - 一个接一个。它们不在段落中,而是以这种形式 -

xxxxxxx
xxx
xxxxxxxxxxxx
xxx
xxxxxxx
xxxx
xxxxxxxx
xxx
xxxxxx
xxxx
xxx

想法是找出具有最常见大小(或字符数)的行数。在上面的示例中 - 4 行就是答案

我正在尝试在 python 中执行此操作,因为其余代码已写入其中。 任何帮助将不胜感激。

使用行长度列表,然后最大化出现次数:

with open('file.txt') as data:
    length = [len(i) for i in data] # line length
    common = max(length.count(i) for i in length)

您可以使用计数器,然后使用计数器的 most_common 方法:

from collections import Counter
with open("a.txt") as f:
    c = Counter(len(line.rstrip("\n")) for line in f)
print(c.most_common(1))

结果:

[(3, 4)]

表示长度 3 最常见,出现 4 次。

这是获取最常见长度的方法:

with open('file.txt', 'rb') as fin:
    lst = [len(line.strip()) for line in fin]

print max(set(lst), key=lst.count)

嗯,从阅读台词开始,您可以采取以下几种方法:

myFile = open(path)
for line in file:
    #do something with 'line'

或者也许

lines = file.readlines()
for i in range(lines.length):
     #do something

然后,您需要以某种方式存储每一行​​的长度

lengths.append(line.length)

现在,你只需要找到最经常出现的长度

frequencies = {}
for length in lengths:
    if length in frequencies: #Check if we already had this length before
        frequencies[length] += 1 #Increment it
    else:
        frequencies[length] = 1 #Add to the list

从集合中找出最大值应该是微不足道的,但以防万一:

maximum = 0
for i in frequencies:
    if frequencies[i] > maximum:
        maximum = frequencies[i]
#after this completes, no entry on frequencies is greater than maximum

collections module has a dictionary subclass named Counter 可用于跟踪遇到的每一行的长度。

这使得解决问题变得非常容易。如果文件不是很大,你可以这样使用它:

from collections import Counter

def most_common_line_len(filename):
    with open('somefile.txt') as f:
        return Counter(map(len, f.read().splitlines())).most_common(1)[0][0]

print(most_common_line_len('somefile.txt'))  # --> 3 for your sample data

否则,您可以使用 generator expression in conjunction with a lambda 函数避免一次将其全部读入内存:

def most_common_line_len(filename):
    with open('somefile.txt') as f:
        return Counter(map(lambda line: len(line.rstrip()),
                           (line for line in f))).most_common(1)[0][0]