如何从多个数据帧创建字典?
How to create dictionary from multiple dataframes?
我有一个包含多个 csv 文件的文件夹。目录中 csv 文件的数据帧示例:
d1 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20,
2012-05-05, 2018-02-04],
'site': ['plus.google.com','vk.com','yandex.ru','plus.google.com','vk.com', 'oracle.com', 'oracle.com']}
df1 = pd.DataFrame(data = d)
d2 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20,],
'site': ['plus.google.com','meduza.ru','yandex.ru','google.com', 'meduza.ru'}
df2 = pd.DataFrame(data = d2)
我需要创建一个函数来接受到文件目录的路由和 return 站点频率字典(一个用于文件目录中的所有站点),具有以下类型的唯一站点名称:{'site_string': [site_id, site_freq]} 对于我们的例子,它将是:{'vk.com': (1, 2), 'plus.google.com': (2, 3), 'yandex.ru': (3, 2), 'meduza.ru': (4, 2), 'oracle.com': (5, 2), 'google.com': (6, 1)}
我尝试将 value_counts() 应用于每个数据帧,制作它们的字典并尝试连接字典,但在这种情况下重复项被删除。我该如何解决这个问题?我该怎么办?
def prepare_train_set(path_to_csv_files):
frequency = {}
for filename in glob(f'{path_to_csv_files}/*'):
sub_iterationed_df = pd.read_csv(filename)
value_counts_dict = dict(sub_iterationed_df["site"].value_counts())
frequency.update(value_counts_dict)
return frequency
我还尝试从 value_counts() 字典的键和值制作列表,然后用 zip 函数制作字典,但出现错误“列表分配索引超出范围”。为什么会出现这个错误,我该如何绕过这个错误?
def CheckForDuplicates(keys_list, values_list):
keys_list = list(value_counts_dict.keys())
values_list = list(value_counts_dict.values())
keys_list_constant = keys_list[:]
values_list_constant = values_list[:]
for i in range(len(keys_list_constant)):
checking_dup_keys_list = keys_list[:i]
checking_dup_values_list = values_list[:i]
key_value = keys_list_constant[i]
if key_value in checking_dup_keys_list:
duplicate_index = checking_dup_keys_list.index(key_value)
values_list[duplicate_index] = values_list[duplicate_index] + values_list_constant[i]
del values_list[i]
del keys_list[i]
return(keys_list, values_list)
CheckForDuplicates(keys_list, values_list)
您可以使用 Counter
而不是普通字典:
from collections import Counter
def prepare_train_set(path_to_csv_files):
frequency = Counter()
for filename in glob(f'{path_to_csv_files}/*'):
sub_iterationed_df = pd.read_csv(filename)
value_counts_dict = sub_iterationed_df['site'].value_counts().to_dict()
frequency.update(value_counts_dict)
return frequency
来自docs:
update([iterable-or-mapping])
:
Elements are counted from an iterable or added-in from another mapping (or counter). Like dict.update()
but adds counts instead of replacing them.
或者连接所有数据帧,然后取 .value_counts()
。
我有一个包含多个 csv 文件的文件夹。目录中 csv 文件的数据帧示例:
d1 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20,
2012-05-05, 2018-02-04],
'site': ['plus.google.com','vk.com','yandex.ru','plus.google.com','vk.com', 'oracle.com', 'oracle.com']}
df1 = pd.DataFrame(data = d)
d2 = {'timestamp': [2013-01-30, 2015-02-29, 2014-03-25, 2016-01-01, 2018-02-20,],
'site': ['plus.google.com','meduza.ru','yandex.ru','google.com', 'meduza.ru'}
df2 = pd.DataFrame(data = d2)
我需要创建一个函数来接受到文件目录的路由和 return 站点频率字典(一个用于文件目录中的所有站点),具有以下类型的唯一站点名称:{'site_string': [site_id, site_freq]} 对于我们的例子,它将是:{'vk.com': (1, 2), 'plus.google.com': (2, 3), 'yandex.ru': (3, 2), 'meduza.ru': (4, 2), 'oracle.com': (5, 2), 'google.com': (6, 1)}
我尝试将 value_counts() 应用于每个数据帧,制作它们的字典并尝试连接字典,但在这种情况下重复项被删除。我该如何解决这个问题?我该怎么办?
def prepare_train_set(path_to_csv_files):
frequency = {}
for filename in glob(f'{path_to_csv_files}/*'):
sub_iterationed_df = pd.read_csv(filename)
value_counts_dict = dict(sub_iterationed_df["site"].value_counts())
frequency.update(value_counts_dict)
return frequency
我还尝试从 value_counts() 字典的键和值制作列表,然后用 zip 函数制作字典,但出现错误“列表分配索引超出范围”。为什么会出现这个错误,我该如何绕过这个错误?
def CheckForDuplicates(keys_list, values_list):
keys_list = list(value_counts_dict.keys())
values_list = list(value_counts_dict.values())
keys_list_constant = keys_list[:]
values_list_constant = values_list[:]
for i in range(len(keys_list_constant)):
checking_dup_keys_list = keys_list[:i]
checking_dup_values_list = values_list[:i]
key_value = keys_list_constant[i]
if key_value in checking_dup_keys_list:
duplicate_index = checking_dup_keys_list.index(key_value)
values_list[duplicate_index] = values_list[duplicate_index] + values_list_constant[i]
del values_list[i]
del keys_list[i]
return(keys_list, values_list)
CheckForDuplicates(keys_list, values_list)
您可以使用 Counter
而不是普通字典:
from collections import Counter
def prepare_train_set(path_to_csv_files):
frequency = Counter()
for filename in glob(f'{path_to_csv_files}/*'):
sub_iterationed_df = pd.read_csv(filename)
value_counts_dict = sub_iterationed_df['site'].value_counts().to_dict()
frequency.update(value_counts_dict)
return frequency
来自docs:
update([iterable-or-mapping])
:
Elements are counted from an iterable or added-in from another mapping (or counter). Likedict.update()
but adds counts instead of replacing them.
或者连接所有数据帧,然后取 .value_counts()
。