CSV 计数重复项

CSV counting duplicates

我有一个 csv 文件,其中有一列包含年、月、日、小时的日期。我正在尝试创建一个新的 csv 文件,其中一列是第一个文件中最大和最小之间的所有日期,第二列是该日期出现的次数。例如:

file 1:
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57

会变成

file2:
2016-01-03-05    1
2016-01-03-06    0
2016-01-03-07    1
...
2016-02-18-22    0
2016-02-18-23    2

我可以提取日期并使用计数器制作日期及其出现的字典,我猜我将不得不使用 datetime 在字典中按小时创建从最大值到最小值的列表,然后以某种方式创建将计数分配给第二个列表。这将适用于非常大的数据集。

如有任何帮助,我们将不胜感激。

这是pandas解决方案。

import pandas as pd                                                                                                                                                                            
df=pd.read_csv("file1",sep=":",names=['v'])                                                                                                                                                    
df.index=pd.to_datetime(df.index)                                                                                                                                                              
df.groupby(pd.TimeGrouper('H')).size().to_csv("file2")

输出文件将如下所示,

2016-01-03 05:00:00,1
2016-01-03 06:00:00,0
2016-01-03 07:00:00,1
2016-01-03 08:00:00,0
...
2016-02-18 19:00:00,0
2016-02-18 20:00:00,0
2016-02-18 21:00:00,0
2016-02-18 22:00:00,0
2016-02-18 23:00:00,2

我认为你可以使用正则表达式:

import re

regex = re.compile(r'^\d{4}-\d{2}-\d{2}-\d{2}:\d{2}$')
stamps = {}

with open('file1.csv', 'r') as input_file:
    lines = input_file.read().splitlines()

for line in lines:
    if regex.search(line):
        elements = line.split('-')
        elements.extend(elements.pop().split(':'))
        key = elements[0] + '-' + elements[1] + '-' + elements[2] + '-' + elements[3]
        stamps.setdefault(key, 0)
        stamps[key] += 1

with open('file2.csv','w') as output_file:
    for key, value in sorted(stamps.items()):
        output_file.write(key + '\t' + str(value) + '\n')

file1.csv

2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57

file2.csv

2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2

根据您与问题关联的标签,我提供了一个使用 Counterdatetime 和好的 ol' csv:[=17= 的解决方案]

from collections import Counter
from datetime import datetime
import csv


with open('file2.txt','w') as outfile:
    csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
    data = Counter([datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M').strftime('%Y-%m-%d-%H') for x in open('file1.txt')]).items()
    data = sorted(data, key = lambda x: x[0])
    csv_writer.writerows(data)

这会生成一个包含以下内容的文件:

2016-01-03-05   1
2016-01-03-07   1
2016-02-18-23   2

编辑:

再想想,我想我可能有点误解了这个问题。在我看来,您确实希望将原始日期中的某些日期添加到输出文件中,并且它们的计数为零。我认为以下内容应该更全面一些:

from collections import Counter
from datetime import datetime, timedelta
import csv


with open('file2.txt','w') as outfile:
    csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")

    # Get each row and convert it to datetime
    # Get the minimum and maximum values
    datetimes = [datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M') for x in open('file1.txt')]
    min_date = min(datetimes)

    # Get the number of hours between min and max dates
    num_hours = (max(datetimes) - min_date).seconds//3600 + 24 * (max(datetimes) - min_date).days

    # Convert to desired date format
    datetimes = [x.strftime('%Y-%m-%d-%H') for x in datetimes]

    # Count the values
    data = Counter(datetimes).items()

    # Add the mising days from the original file
    for i in range(num_hours):
        if (min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H') not in datetimes:
            data.append(((min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H'), 0))

    # Sort by dates
    data = sorted(data, key = lambda x: x[0])

    # Output the data into file2.txt
    csv_writer.writerows(data)

这个应该产生:

2016-01-03-05   1
2016-01-03-06   0
2016-01-03-07   1
2016-01-03-08   0
2016-01-03-09   0
2016-01-03-10   0
...
2016-02-18-21   0
2016-02-18-22   0
2016-02-18-23   2

希望这有用。