计算整个 CSV 文件中某些单词以及 Python 中每行的出现次数
Counting number occurrences of certain words in entire CSV file as well as per row in Python
我正在处理来自多个服务器的数据并为每个服务器生成一个 CSV 文件。我已经设法将来自所有服务器的数据编译到一个文件中,合并文件的数据如下-
Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01
1.1 Database Placement,PASSED,PASSED,PASSED
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED
1.3 Diable MySQL history,PASSED,PASSED,FAILED
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA
以上文件中的每个服务器列都可以有结果值,可以是以下任一-
[“通过”,“失败”,“异常”,“不适用”,“弃用”]
从上面的 CSV 文件中,我想计算结果并创建一个如下所示的数据集
Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01,PASSED,FAILED,EXCEPTION,NA,DEPRECATED
1.1 Database Placement,PASSED,PASSED,PASSED,3,0,0,0,0
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED,3,0,0,0,0
1.3 Diable MySQL history,PASSED,PASSED,FAILED,2,1,0,0,0
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA,1,0,0,1,1
这里有一个建议(相当冗长以突出正在发生的事情):
import csv
events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
# Open files
with open('data.csv', 'r') as csv_in, open('data_out.csv', 'w') as csv_out:
# Initialize csv-reader and -writer
csv_reader, csv_writer = csv.reader(csv_in), csv.writer(csv_out)
# Process header
line_in = next(csv_reader)
line_out = line_in + events
csv_writer.writerow(line_out)
# Process data
for line_in in csv_reader:
line_out = line_in
for event in events:
line_out += [sum(1 if event == entry else 0
for entry in line_in[1:])]
csv_writer.writerow(line_out)
我假设您的数据位于名为 data.csv
的文件中。你必须调整它。我希望它有效...
PS:您的示例数据中存在拼写错误:DEPRICATED
应该是 DEPRECATED
。这会导致 non-expected 输出。
没有不必要的辅助变量的更紧凑的版本如下所示:
import csv
events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
in_, out = csv.reader(fin), csv.writer(fout)
out.writerow(next(in_) + events)
out.writerows(line + [sum(1 if event == entry else 0 for entry in line[1:])
for event in events]
for line in in_)
您可以使用 Counter 来计算特定单词的出现次数。假设您已经打开 .csv
文件并存储在字符串 input
中:您可以这样做:
from collections import Counter
res_values = ("PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED")
input = ("Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01\n"
"1.1 Database Placement,PASSED,PASSED,PASSED\n"
"1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED\n"
"1.3 Diable MySQL history,PASSED,PASSED,FAILED\n"
"2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA")
print('\n'.join(
[line + ',' + ','.join(
[str(Counter(line.split(','))[res])
if i != 0
else res
for res in res_values]
)
for i, line in enumerate(input.split('\n'))]))
我使用列表理解来更好地优化流程(因为文件可能非常大),但这是另一个更清晰的代码,它做的事情完全相同:
split = input.split('\n') # Split the input line by line
for i, line in enumerate(split): # For each line of the input
if i == 0: # Write full result name (for the first line)
split[i] += ',' + ','.join(res_values)
else: # Count and write result occurrences
counts = Counter(line.split(','))
for res in res_values:
split[i] += ',' + str(counts[res])
print('\n'.join(split)) # Join the full string
我提出了一个 ready-to-execute 解决方案,但出于优化目的,逐行读取文件当然比将其存储在此处的字符串变量中更好。
我正在处理来自多个服务器的数据并为每个服务器生成一个 CSV 文件。我已经设法将来自所有服务器的数据编译到一个文件中,合并文件的数据如下-
Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01
1.1 Database Placement,PASSED,PASSED,PASSED
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED
1.3 Diable MySQL history,PASSED,PASSED,FAILED
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA
以上文件中的每个服务器列都可以有结果值,可以是以下任一-
[“通过”,“失败”,“异常”,“不适用”,“弃用”]
从上面的 CSV 文件中,我想计算结果并创建一个如下所示的数据集
Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01,PASSED,FAILED,EXCEPTION,NA,DEPRECATED
1.1 Database Placement,PASSED,PASSED,PASSED,3,0,0,0,0
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED,3,0,0,0,0
1.3 Diable MySQL history,PASSED,PASSED,FAILED,2,1,0,0,0
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA,1,0,0,1,1
这里有一个建议(相当冗长以突出正在发生的事情):
import csv
events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
# Open files
with open('data.csv', 'r') as csv_in, open('data_out.csv', 'w') as csv_out:
# Initialize csv-reader and -writer
csv_reader, csv_writer = csv.reader(csv_in), csv.writer(csv_out)
# Process header
line_in = next(csv_reader)
line_out = line_in + events
csv_writer.writerow(line_out)
# Process data
for line_in in csv_reader:
line_out = line_in
for event in events:
line_out += [sum(1 if event == entry else 0
for entry in line_in[1:])]
csv_writer.writerow(line_out)
我假设您的数据位于名为 data.csv
的文件中。你必须调整它。我希望它有效...
PS:您的示例数据中存在拼写错误:DEPRICATED
应该是 DEPRECATED
。这会导致 non-expected 输出。
没有不必要的辅助变量的更紧凑的版本如下所示:
import csv
events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
in_, out = csv.reader(fin), csv.writer(fout)
out.writerow(next(in_) + events)
out.writerows(line + [sum(1 if event == entry else 0 for entry in line[1:])
for event in events]
for line in in_)
您可以使用 Counter 来计算特定单词的出现次数。假设您已经打开 .csv
文件并存储在字符串 input
中:您可以这样做:
from collections import Counter
res_values = ("PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED")
input = ("Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01\n"
"1.1 Database Placement,PASSED,PASSED,PASSED\n"
"1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED\n"
"1.3 Diable MySQL history,PASSED,PASSED,FAILED\n"
"2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA")
print('\n'.join(
[line + ',' + ','.join(
[str(Counter(line.split(','))[res])
if i != 0
else res
for res in res_values]
)
for i, line in enumerate(input.split('\n'))]))
我使用列表理解来更好地优化流程(因为文件可能非常大),但这是另一个更清晰的代码,它做的事情完全相同:
split = input.split('\n') # Split the input line by line
for i, line in enumerate(split): # For each line of the input
if i == 0: # Write full result name (for the first line)
split[i] += ',' + ','.join(res_values)
else: # Count and write result occurrences
counts = Counter(line.split(','))
for res in res_values:
split[i] += ',' + str(counts[res])
print('\n'.join(split)) # Join the full string
我提出了一个 ready-to-execute 解决方案,但出于优化目的,逐行读取文件当然比将其存储在此处的字符串变量中更好。