从字符串或列表创建字典
Create dict from a string or list
背景
我想为给定的字符串或给定的列表生成哈希 table。散列 table 将元素视为 key
,将显示时间视为 value
。例如:
s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}
我的尝试
# method 1
from collections import Counter
s = 'ababcd'
hash_table1 = Counter(s)
# method 2
s = 'ababdc'
hash_table2 = dict()
for i in s:
if hash_table2.get(i) == None:
hash_table2[i] = 1
else:
hash_table2[i] += 1
hash_table1 == hash_table2
True
一般我都是用上面的2种方法。一个来自标准库,但在某些代码练习站点中是不允许的。另一个是从头开始写的,但我认为它太长了。如果我使用字典理解,我会想出另外两种方法:
{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}
问题
我想知道是否有其他方法可以清楚或有效地从列表的字符串中初始化散列 table?
我提到的4种方法的速度比较
from collections import Counter
import random,string,numpy,perfplot
def from_set(s):
return {i:s.count(i) for i in set(s)}
def from_string(s):
return {i:s.count(i) for i in s}
def handy(s):
hash_table2 = dict()
for i in s:
if hash_table2.get(i) == None:
hash_table2[i] = 1
else:
hash_table2[i] += 1
return hash_table2
def counter(s):
return Counter(s)
perfplot.show(
setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)), # or simply setup=numpy.random.rand
kernels=[from_set,from_string,handy,counter],
labels=['set','string','handy','counter'],
n_range=[2 ** k for k in range(17)],
xlabel="len(string)",
equality_check= None
# More optional arguments with their default values:
# title=None,
# logx="auto", # set to True or False to force scaling
# logy="auto",
# equality_check=numpy.allclose, # set to None to disable "correctness" assertion
# automatic_order=True,
# colors=None,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
最好的方法是使用内置计数器,否则,您可以使用与第二次尝试非常相似的 defaultdict
from collections import defualtdict
d = defualtdict(int) # this makes every value 0 by defualt
for letter in string:
d[letter] +=1
我通常使用 Counter 或 defaultdict 来创建出现频率。
令人惊讶地发现发帖者的方法 from_set 在大多数情况下都优于两者。
观察结果
- from_set(标记为 'set')整体表现最佳
- 各种字典方法仅适用于较小的字符串长度(即
< 100)
- 计数器方法仅适用于小范围的字符串长度。
- from_set 比 defaultdict 快 2.3 倍,比大字符串的 Counter 快 1.5 倍
算法
from collections import Counter
from collections import defaultdict
import random,string,numpy,perfplot
def from_set(s):
" Use builtin count function for each item in set "
return {i:s.count(i) for i in set(s)}
def counter(s):
" Uses counter module "
return Counter(s)
def normal_dic(s):
" Update dictionary by checking if item in it or not "
d = {}
for i in s:
if i in d:
d[i] += 1
else:
d[i] = 1
return d
def setdefault_dic(s):
" Use setdefault to preset unknown keys "
d = {}
for i in s:
d.setdefault(i, 0)
d[i] += 1
return d
def default_dic(s):
" Used defaultdict from collections module "
d = defaultdict(int)
for i in s:
d[i] += 1
return d
def try_dic(s):
" Use try/except to check if item in dictionary "
d = {}
for i in s:
try:
d[i] += 1
except:
d[i] = 1
return d
测试代码
out = perfplot.bench(
setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)), # or simply setup=numpy.random.rand
kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
n_range=[2 ** k for k in range(17)],
xlabel="len(string)",
equality_check= None
# More optional arguments with their default values:
# title=None,
# logx="auto", # set to True or False to force scaling
# logy="auto",
# equality_check=numpy.allclose, # set to None to disable "correctness" assertion
# automatic_order=True,
# colors=None,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
out.show()
#out.save("perf.png")
out
图表
绝对值
from_set 图中的标签 'set'。在下面的相对图中,相对性能比这个绝对图更容易。
相对值
from_set 图中的标签 'set'。
from_set方法是横线。对于更大的值,包括 Counter 和 defaultdict 在内的所有其他方法都在它上面(更耗时)。
Table
实际次数
n setdefault try_dic defaultdict counter from_set
1.0 799.0 899.0 1299.0 6099.0 1399.0
2.0 1099.0 1199.0 1599.0 6299.0 1699.0
4.0 1699.0 1699.0 2199.0 6299.0 2399.0
8.0 3199.0 3099.0 3499.0 6899.0 3699.0
16.0 6099.0 5499.0 5899.0 7899.0 5900.0
32.0 10899.0 9299.0 9899.0 8999.0 10299.0
64.0 20799.0 15599.0 15999.0 11899.0 15099.0
128.0 38499.0 25499.0 25899.0 16599.0 21899.0
256.0 73100.0 44099.0 42700.0 26299.0 30299.0
512.0 137999.0 77099.0 72699.0 43199.0 46699.0
1024.0 286599.0 154500.0 144099.0 85700.0 79699.0
2048.0 549700.0 289999.0 266799.0 157499.0 145699.0
4096.0 1103899.0 577399.0 535499.0 309399.0 278999.0
8192.0 2200099.0 1151500.0 1051799.0 606999.0 542499.0
16384.0 4658199.0 2534399.0 2295300.0 1414199.0 1087799.0
32768.0 9509200.0 5270200.0 4838000.0 3066600.0 2177200.0
65536.0 19539500.0 10806300.0 9942100.0 6503299.0 4337599.0
背景
我想为给定的字符串或给定的列表生成哈希 table。散列 table 将元素视为 key
,将显示时间视为 value
。例如:
s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}
我的尝试
# method 1
from collections import Counter
s = 'ababcd'
hash_table1 = Counter(s)
# method 2
s = 'ababdc'
hash_table2 = dict()
for i in s:
if hash_table2.get(i) == None:
hash_table2[i] = 1
else:
hash_table2[i] += 1
hash_table1 == hash_table2
True
一般我都是用上面的2种方法。一个来自标准库,但在某些代码练习站点中是不允许的。另一个是从头开始写的,但我认为它太长了。如果我使用字典理解,我会想出另外两种方法:
{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}
问题
我想知道是否有其他方法可以清楚或有效地从列表的字符串中初始化散列 table?
我提到的4种方法的速度比较
from collections import Counter
import random,string,numpy,perfplot
def from_set(s):
return {i:s.count(i) for i in set(s)}
def from_string(s):
return {i:s.count(i) for i in s}
def handy(s):
hash_table2 = dict()
for i in s:
if hash_table2.get(i) == None:
hash_table2[i] = 1
else:
hash_table2[i] += 1
return hash_table2
def counter(s):
return Counter(s)
perfplot.show(
setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)), # or simply setup=numpy.random.rand
kernels=[from_set,from_string,handy,counter],
labels=['set','string','handy','counter'],
n_range=[2 ** k for k in range(17)],
xlabel="len(string)",
equality_check= None
# More optional arguments with their default values:
# title=None,
# logx="auto", # set to True or False to force scaling
# logy="auto",
# equality_check=numpy.allclose, # set to None to disable "correctness" assertion
# automatic_order=True,
# colors=None,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
最好的方法是使用内置计数器,否则,您可以使用与第二次尝试非常相似的 defaultdict
from collections import defualtdict
d = defualtdict(int) # this makes every value 0 by defualt
for letter in string:
d[letter] +=1
我通常使用 Counter 或 defaultdict 来创建出现频率。
令人惊讶地发现发帖者的方法 from_set 在大多数情况下都优于两者。
观察结果
- from_set(标记为 'set')整体表现最佳
- 各种字典方法仅适用于较小的字符串长度(即 < 100)
- 计数器方法仅适用于小范围的字符串长度。
- from_set 比 defaultdict 快 2.3 倍,比大字符串的 Counter 快 1.5 倍
算法
from collections import Counter
from collections import defaultdict
import random,string,numpy,perfplot
def from_set(s):
" Use builtin count function for each item in set "
return {i:s.count(i) for i in set(s)}
def counter(s):
" Uses counter module "
return Counter(s)
def normal_dic(s):
" Update dictionary by checking if item in it or not "
d = {}
for i in s:
if i in d:
d[i] += 1
else:
d[i] = 1
return d
def setdefault_dic(s):
" Use setdefault to preset unknown keys "
d = {}
for i in s:
d.setdefault(i, 0)
d[i] += 1
return d
def default_dic(s):
" Used defaultdict from collections module "
d = defaultdict(int)
for i in s:
d[i] += 1
return d
def try_dic(s):
" Use try/except to check if item in dictionary "
d = {}
for i in s:
try:
d[i] += 1
except:
d[i] = 1
return d
测试代码
out = perfplot.bench(
setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)), # or simply setup=numpy.random.rand
kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
n_range=[2 ** k for k in range(17)],
xlabel="len(string)",
equality_check= None
# More optional arguments with their default values:
# title=None,
# logx="auto", # set to True or False to force scaling
# logy="auto",
# equality_check=numpy.allclose, # set to None to disable "correctness" assertion
# automatic_order=True,
# colors=None,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
out.show()
#out.save("perf.png")
out
图表
绝对值
from_set 图中的标签 'set'。在下面的相对图中,相对性能比这个绝对图更容易。
相对值
from_set 图中的标签 'set'。
from_set方法是横线。对于更大的值,包括 Counter 和 defaultdict 在内的所有其他方法都在它上面(更耗时)。
Table
实际次数
n setdefault try_dic defaultdict counter from_set
1.0 799.0 899.0 1299.0 6099.0 1399.0
2.0 1099.0 1199.0 1599.0 6299.0 1699.0
4.0 1699.0 1699.0 2199.0 6299.0 2399.0
8.0 3199.0 3099.0 3499.0 6899.0 3699.0
16.0 6099.0 5499.0 5899.0 7899.0 5900.0
32.0 10899.0 9299.0 9899.0 8999.0 10299.0
64.0 20799.0 15599.0 15999.0 11899.0 15099.0
128.0 38499.0 25499.0 25899.0 16599.0 21899.0
256.0 73100.0 44099.0 42700.0 26299.0 30299.0
512.0 137999.0 77099.0 72699.0 43199.0 46699.0
1024.0 286599.0 154500.0 144099.0 85700.0 79699.0
2048.0 549700.0 289999.0 266799.0 157499.0 145699.0
4096.0 1103899.0 577399.0 535499.0 309399.0 278999.0
8192.0 2200099.0 1151500.0 1051799.0 606999.0 542499.0
16384.0 4658199.0 2534399.0 2295300.0 1414199.0 1087799.0
32768.0 9509200.0 5270200.0 4838000.0 3066600.0 2177200.0
65536.0 19539500.0 10806300.0 9942100.0 6503299.0 4337599.0