CSV 计数重复项
CSV counting duplicates
我有一个 csv 文件,其中有一列包含年、月、日、小时的日期。我正在尝试创建一个新的 csv 文件,其中一列是第一个文件中最大和最小之间的所有日期,第二列是该日期出现的次数。例如:
file 1:
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57
会变成
file2:
2016-01-03-05 1
2016-01-03-06 0
2016-01-03-07 1
...
2016-02-18-22 0
2016-02-18-23 2
我可以提取日期并使用计数器制作日期及其出现的字典,我猜我将不得不使用 datetime 在字典中按小时创建从最大值到最小值的列表,然后以某种方式创建将计数分配给第二个列表。这将适用于非常大的数据集。
如有任何帮助,我们将不胜感激。
这是pandas解决方案。
import pandas as pd
df=pd.read_csv("file1",sep=":",names=['v'])
df.index=pd.to_datetime(df.index)
df.groupby(pd.TimeGrouper('H')).size().to_csv("file2")
输出文件将如下所示,
2016-01-03 05:00:00,1
2016-01-03 06:00:00,0
2016-01-03 07:00:00,1
2016-01-03 08:00:00,0
...
2016-02-18 19:00:00,0
2016-02-18 20:00:00,0
2016-02-18 21:00:00,0
2016-02-18 22:00:00,0
2016-02-18 23:00:00,2
我认为你可以使用正则表达式:
import re
regex = re.compile(r'^\d{4}-\d{2}-\d{2}-\d{2}:\d{2}$')
stamps = {}
with open('file1.csv', 'r') as input_file:
lines = input_file.read().splitlines()
for line in lines:
if regex.search(line):
elements = line.split('-')
elements.extend(elements.pop().split(':'))
key = elements[0] + '-' + elements[1] + '-' + elements[2] + '-' + elements[3]
stamps.setdefault(key, 0)
stamps[key] += 1
with open('file2.csv','w') as output_file:
for key, value in sorted(stamps.items()):
output_file.write(key + '\t' + str(value) + '\n')
file1.csv
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57
file2.csv
2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2
根据您与问题关联的标签,我提供了一个使用 Counter
、datetime
和好的 ol' csv
:[=17= 的解决方案]
from collections import Counter
from datetime import datetime
import csv
with open('file2.txt','w') as outfile:
csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
data = Counter([datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M').strftime('%Y-%m-%d-%H') for x in open('file1.txt')]).items()
data = sorted(data, key = lambda x: x[0])
csv_writer.writerows(data)
这会生成一个包含以下内容的文件:
2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2
编辑:
再想想,我想我可能有点误解了这个问题。在我看来,您确实希望将原始日期中的某些日期添加到输出文件中,并且它们的计数为零。我认为以下内容应该更全面一些:
from collections import Counter
from datetime import datetime, timedelta
import csv
with open('file2.txt','w') as outfile:
csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
# Get each row and convert it to datetime
# Get the minimum and maximum values
datetimes = [datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M') for x in open('file1.txt')]
min_date = min(datetimes)
# Get the number of hours between min and max dates
num_hours = (max(datetimes) - min_date).seconds//3600 + 24 * (max(datetimes) - min_date).days
# Convert to desired date format
datetimes = [x.strftime('%Y-%m-%d-%H') for x in datetimes]
# Count the values
data = Counter(datetimes).items()
# Add the mising days from the original file
for i in range(num_hours):
if (min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H') not in datetimes:
data.append(((min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H'), 0))
# Sort by dates
data = sorted(data, key = lambda x: x[0])
# Output the data into file2.txt
csv_writer.writerows(data)
这个应该产生:
2016-01-03-05 1
2016-01-03-06 0
2016-01-03-07 1
2016-01-03-08 0
2016-01-03-09 0
2016-01-03-10 0
...
2016-02-18-21 0
2016-02-18-22 0
2016-02-18-23 2
希望这有用。
我有一个 csv 文件,其中有一列包含年、月、日、小时的日期。我正在尝试创建一个新的 csv 文件,其中一列是第一个文件中最大和最小之间的所有日期,第二列是该日期出现的次数。例如:
file 1:
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57
会变成
file2:
2016-01-03-05 1
2016-01-03-06 0
2016-01-03-07 1
...
2016-02-18-22 0
2016-02-18-23 2
我可以提取日期并使用计数器制作日期及其出现的字典,我猜我将不得不使用 datetime 在字典中按小时创建从最大值到最小值的列表,然后以某种方式创建将计数分配给第二个列表。这将适用于非常大的数据集。
如有任何帮助,我们将不胜感激。
这是pandas解决方案。
import pandas as pd
df=pd.read_csv("file1",sep=":",names=['v'])
df.index=pd.to_datetime(df.index)
df.groupby(pd.TimeGrouper('H')).size().to_csv("file2")
输出文件将如下所示,
2016-01-03 05:00:00,1
2016-01-03 06:00:00,0
2016-01-03 07:00:00,1
2016-01-03 08:00:00,0
...
2016-02-18 19:00:00,0
2016-02-18 20:00:00,0
2016-02-18 21:00:00,0
2016-02-18 22:00:00,0
2016-02-18 23:00:00,2
我认为你可以使用正则表达式:
import re
regex = re.compile(r'^\d{4}-\d{2}-\d{2}-\d{2}:\d{2}$')
stamps = {}
with open('file1.csv', 'r') as input_file:
lines = input_file.read().splitlines()
for line in lines:
if regex.search(line):
elements = line.split('-')
elements.extend(elements.pop().split(':'))
key = elements[0] + '-' + elements[1] + '-' + elements[2] + '-' + elements[3]
stamps.setdefault(key, 0)
stamps[key] += 1
with open('file2.csv','w') as output_file:
for key, value in sorted(stamps.items()):
output_file.write(key + '\t' + str(value) + '\n')
file1.csv
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57
file2.csv
2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2
根据您与问题关联的标签,我提供了一个使用 Counter
、datetime
和好的 ol' csv
:[=17= 的解决方案]
from collections import Counter
from datetime import datetime
import csv
with open('file2.txt','w') as outfile:
csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
data = Counter([datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M').strftime('%Y-%m-%d-%H') for x in open('file1.txt')]).items()
data = sorted(data, key = lambda x: x[0])
csv_writer.writerows(data)
这会生成一个包含以下内容的文件:
2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2
编辑:
再想想,我想我可能有点误解了这个问题。在我看来,您确实希望将原始日期中的某些日期添加到输出文件中,并且它们的计数为零。我认为以下内容应该更全面一些:
from collections import Counter
from datetime import datetime, timedelta
import csv
with open('file2.txt','w') as outfile:
csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
# Get each row and convert it to datetime
# Get the minimum and maximum values
datetimes = [datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M') for x in open('file1.txt')]
min_date = min(datetimes)
# Get the number of hours between min and max dates
num_hours = (max(datetimes) - min_date).seconds//3600 + 24 * (max(datetimes) - min_date).days
# Convert to desired date format
datetimes = [x.strftime('%Y-%m-%d-%H') for x in datetimes]
# Count the values
data = Counter(datetimes).items()
# Add the mising days from the original file
for i in range(num_hours):
if (min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H') not in datetimes:
data.append(((min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H'), 0))
# Sort by dates
data = sorted(data, key = lambda x: x[0])
# Output the data into file2.txt
csv_writer.writerows(data)
这个应该产生:
2016-01-03-05 1
2016-01-03-06 0
2016-01-03-07 1
2016-01-03-08 0
2016-01-03-09 0
2016-01-03-10 0
...
2016-02-18-21 0
2016-02-18-22 0
2016-02-18-23 2
希望这有用。