使用 python 找到具有各种行大小的文件中最常见的文本行长度
find the most common length of text line in a file with various line sizes present using python
所以我有点陷入思考这个问题 -
我有一个包含许多字符行的文件 - 一个接一个。它们不在段落中,而是以这种形式 -
xxxxxxx
xxx
xxxxxxxxxxxx
xxx
xxxxxxx
xxxx
xxxxxxxx
xxx
xxxxxx
xxxx
xxx
想法是找出具有最常见大小(或字符数)的行数。在上面的示例中 - 4 行就是答案
我正在尝试在 python 中执行此操作,因为其余代码已写入其中。
任何帮助将不胜感激。
使用行长度列表,然后最大化出现次数:
with open('file.txt') as data:
length = [len(i) for i in data] # line length
common = max(length.count(i) for i in length)
您可以使用计数器,然后使用计数器的 most_common
方法:
from collections import Counter
with open("a.txt") as f:
c = Counter(len(line.rstrip("\n")) for line in f)
print(c.most_common(1))
结果:
[(3, 4)]
表示长度 3 最常见,出现 4 次。
这是获取最常见长度的方法:
with open('file.txt', 'rb') as fin:
lst = [len(line.strip()) for line in fin]
print max(set(lst), key=lst.count)
嗯,从阅读台词开始,您可以采取以下几种方法:
myFile = open(path)
for line in file:
#do something with 'line'
或者也许
lines = file.readlines()
for i in range(lines.length):
#do something
然后,您需要以某种方式存储每一行的长度
lengths.append(line.length)
现在,你只需要找到最经常出现的长度
frequencies = {}
for length in lengths:
if length in frequencies: #Check if we already had this length before
frequencies[length] += 1 #Increment it
else:
frequencies[length] = 1 #Add to the list
从集合中找出最大值应该是微不足道的,但以防万一:
maximum = 0
for i in frequencies:
if frequencies[i] > maximum:
maximum = frequencies[i]
#after this completes, no entry on frequencies is greater than maximum
collections
module has a dictionary subclass named Counter
可用于跟踪遇到的每一行的长度。
这使得解决问题变得非常容易。如果文件不是很大,你可以这样使用它:
from collections import Counter
def most_common_line_len(filename):
with open('somefile.txt') as f:
return Counter(map(len, f.read().splitlines())).most_common(1)[0][0]
print(most_common_line_len('somefile.txt')) # --> 3 for your sample data
否则,您可以使用 generator expression in conjunction with a lambda
函数避免一次将其全部读入内存:
def most_common_line_len(filename):
with open('somefile.txt') as f:
return Counter(map(lambda line: len(line.rstrip()),
(line for line in f))).most_common(1)[0][0]
所以我有点陷入思考这个问题 -
我有一个包含许多字符行的文件 - 一个接一个。它们不在段落中,而是以这种形式 -
xxxxxxx
xxx
xxxxxxxxxxxx
xxx
xxxxxxx
xxxx
xxxxxxxx
xxx
xxxxxx
xxxx
xxx
想法是找出具有最常见大小(或字符数)的行数。在上面的示例中 - 4 行就是答案
我正在尝试在 python 中执行此操作,因为其余代码已写入其中。 任何帮助将不胜感激。
使用行长度列表,然后最大化出现次数:
with open('file.txt') as data:
length = [len(i) for i in data] # line length
common = max(length.count(i) for i in length)
您可以使用计数器,然后使用计数器的 most_common
方法:
from collections import Counter
with open("a.txt") as f:
c = Counter(len(line.rstrip("\n")) for line in f)
print(c.most_common(1))
结果:
[(3, 4)]
表示长度 3 最常见,出现 4 次。
这是获取最常见长度的方法:
with open('file.txt', 'rb') as fin:
lst = [len(line.strip()) for line in fin]
print max(set(lst), key=lst.count)
嗯,从阅读台词开始,您可以采取以下几种方法:
myFile = open(path)
for line in file:
#do something with 'line'
或者也许
lines = file.readlines()
for i in range(lines.length):
#do something
然后,您需要以某种方式存储每一行的长度
lengths.append(line.length)
现在,你只需要找到最经常出现的长度
frequencies = {}
for length in lengths:
if length in frequencies: #Check if we already had this length before
frequencies[length] += 1 #Increment it
else:
frequencies[length] = 1 #Add to the list
从集合中找出最大值应该是微不足道的,但以防万一:
maximum = 0
for i in frequencies:
if frequencies[i] > maximum:
maximum = frequencies[i]
#after this completes, no entry on frequencies is greater than maximum
collections
module has a dictionary subclass named Counter
可用于跟踪遇到的每一行的长度。
这使得解决问题变得非常容易。如果文件不是很大,你可以这样使用它:
from collections import Counter
def most_common_line_len(filename):
with open('somefile.txt') as f:
return Counter(map(len, f.read().splitlines())).most_common(1)[0][0]
print(most_common_line_len('somefile.txt')) # --> 3 for your sample data
否则,您可以使用 generator expression in conjunction with a lambda
函数避免一次将其全部读入内存:
def most_common_line_len(filename):
with open('somefile.txt') as f:
return Counter(map(lambda line: len(line.rstrip()),
(line for line in f))).most_common(1)[0][0]