如何压缩这个字符串压缩代码以使其更高效?
How to condense this string compression code to make it more efficient?
我刚刚为 Cracking the Coding Interview 中的问题 1.6 字符串压缩编写了代码。我想知道如何压缩这段代码以使其更有效率。另外,我想确保此代码为 O(n),因为我没有连接到新字符串。
问题状态:
Implement a method to perform basic string compression using the counts of repeated characters. For example, the string 'aabcccccaaa'
would become a2b1c5a3
. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).
我的代码有效。我在 else
之后的第一个 if
语句检查字符的计数是否为 1,如果是则仅附加字符。我在检查最终结果的长度和原始字符串以决定要 return.
时这样做
import string
def stringcompress(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1)
同样,我的代码按预期工作,并按照问题的要求进行操作。我只是想看看有没有哪几行代码我可以拿出来让程序更高效
我用 timeit 模块测试了不同的变体。当我生成不经常重复的测试数据时,您的变体非常有效,但对于短字符串,我的 stringcompress_using_string
是最快的方法。随着字符串变长,一切都颠倒了,你做事的方法变得最快,而 stringcompress_using_string
是最慢的。
这恰恰表明了在不同情况下进行测试的重要性。我的初步结论不完整,并且有更多的测试数据显示了关于这三种方法有效性的真实故事。
import string
import timeit
import random
def stringcompress_original(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1, key=len)
def stringcompress_using_list(str1):
res = []
count = 0
for i in range(1, len(str1)):
count += 1
if str1[i] is str1[i-1]:
continue
res.append(str1[i-1])
res.append(str(count))
count = 0
res.append(str1[i] + str(count+1))
return min(''.join(res), str1, key=len)
def stringcompress_using_string(str1):
res = ''
count = 0
# we can start at 1 because we already know the first letter is not a repition of any previous letters
for i in range(1, len(str1)):
count += 1
# we keep going through the for loop, until a character does not repeat with the previous one
if str1[i] is str1[i-1]:
continue
# add the character along with the number of times it repeated to the final string
# reset the count
# and we start all over with the next character
res += str1[i-1] + str(count)
count = 0
# add the final character + count
res += str1[i] + str(count+1)
return min(res, str1, key=len)
def generate_test_data(min_length=3, max_length=300, iterations=3000, repeat_chance=.66):
assert repeat_chance > 0 and repeat_chance < 1
data = []
chr = 'a'
for i in range(iterations):
the_str = ''
# create a random string with a random length between min_length and max_length
for j in range( random.randrange(min_length, max_length+1) ):
# if we've decided to not repeat by randomization, then grab a new character,
# otherwise we will continue to use (repeat) the character that was chosen last time
if random.random() > repeat_chance:
chr = random.choice(string.ascii_letters)
the_str += chr
data.append(the_str)
return data
# generate test data beforehand to make sure all of our tests use the same test data
test_data = generate_test_data()
#make sure all of our test functions are doing the algorithm correctly
print('showing that the algorithms all produce the correct output')
print('stringcompress_original: ', stringcompress_original('aabcccccaaa'))
print('stringcompress_using_list: ', stringcompress_using_list('aabcccccaaa'))
print('stringcompress_using_string: ', stringcompress_using_string('aabcccccaaa'))
print()
print('stringcompress_original took', timeit.timeit("[stringcompress_original(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_list took', timeit.timeit("[stringcompress_using_list(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_string took', timeit.timeit("[stringcompress_using_string(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
以下结果均采用英特尔 i7-5700HQ CPU @ 2.70GHz 四核处理器。比较每个 blockquote 中的不同函数,但不要尝试将一个 blockquote 的结果交叉比较到另一个 blockquote,因为测试数据的大小会不同。
使用长字符串
用generate_test_data(10000, 50000, 100, .66)
生成的测试数据
stringcompress_original took 7.346990528497378 seconds
stringcompress_using_list took 7.589927956366313 seconds
stringcompress_using_string took 7.713812443264496 seconds
使用短字符串
用generate_test_data(2, 5, 10000, .66)
生成的测试数据
stringcompress_original took 0.40272931026355685 seconds
stringcompress_using_list took 0.1525574881739265 seconds
stringcompress_using_string took 0.13842854253813164 seconds
10% 的重复字符几率
使用generate_test_data(10, 300, 10000, .10)
生成的测试数据
stringcompress_original took 4.675965586924492 seconds
stringcompress_using_list took 6.081609410376534 seconds
stringcompress_using_string took 5.887430301813865 seconds
90% 的重复字符几率
用generate_test_data(10, 300, 10000, .90)
生成的测试数据
stringcompress_original took 2.6049783549783547 seconds
stringcompress_using_list took 1.9739111725413099 seconds
stringcompress_using_string took 1.9460854974553605 seconds
创建一个像这样的小框架很重要,您可以使用它来测试对算法的更改。通常看似无用的更改会使您的代码运行得更快,因此优化性能时游戏的关键是尝试不同的东西,并对结果进行计时。我敢肯定,如果您尝试进行不同的更改,可以找到更多发现,但这对您要优化的数据类型非常重要——压缩短字符串、长字符串和不重复的字符串与那些经常做的人一样。
我刚刚为 Cracking the Coding Interview 中的问题 1.6 字符串压缩编写了代码。我想知道如何压缩这段代码以使其更有效率。另外,我想确保此代码为 O(n),因为我没有连接到新字符串。
问题状态:
Implement a method to perform basic string compression using the counts of repeated characters. For example, the string
'aabcccccaaa'
would becomea2b1c5a3
. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).
我的代码有效。我在 else
之后的第一个 if
语句检查字符的计数是否为 1,如果是则仅附加字符。我在检查最终结果的长度和原始字符串以决定要 return.
import string
def stringcompress(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1)
同样,我的代码按预期工作,并按照问题的要求进行操作。我只是想看看有没有哪几行代码我可以拿出来让程序更高效
我用 timeit 模块测试了不同的变体。当我生成不经常重复的测试数据时,您的变体非常有效,但对于短字符串,我的 stringcompress_using_string
是最快的方法。随着字符串变长,一切都颠倒了,你做事的方法变得最快,而 stringcompress_using_string
是最慢的。
这恰恰表明了在不同情况下进行测试的重要性。我的初步结论不完整,并且有更多的测试数据显示了关于这三种方法有效性的真实故事。
import string
import timeit
import random
def stringcompress_original(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1, key=len)
def stringcompress_using_list(str1):
res = []
count = 0
for i in range(1, len(str1)):
count += 1
if str1[i] is str1[i-1]:
continue
res.append(str1[i-1])
res.append(str(count))
count = 0
res.append(str1[i] + str(count+1))
return min(''.join(res), str1, key=len)
def stringcompress_using_string(str1):
res = ''
count = 0
# we can start at 1 because we already know the first letter is not a repition of any previous letters
for i in range(1, len(str1)):
count += 1
# we keep going through the for loop, until a character does not repeat with the previous one
if str1[i] is str1[i-1]:
continue
# add the character along with the number of times it repeated to the final string
# reset the count
# and we start all over with the next character
res += str1[i-1] + str(count)
count = 0
# add the final character + count
res += str1[i] + str(count+1)
return min(res, str1, key=len)
def generate_test_data(min_length=3, max_length=300, iterations=3000, repeat_chance=.66):
assert repeat_chance > 0 and repeat_chance < 1
data = []
chr = 'a'
for i in range(iterations):
the_str = ''
# create a random string with a random length between min_length and max_length
for j in range( random.randrange(min_length, max_length+1) ):
# if we've decided to not repeat by randomization, then grab a new character,
# otherwise we will continue to use (repeat) the character that was chosen last time
if random.random() > repeat_chance:
chr = random.choice(string.ascii_letters)
the_str += chr
data.append(the_str)
return data
# generate test data beforehand to make sure all of our tests use the same test data
test_data = generate_test_data()
#make sure all of our test functions are doing the algorithm correctly
print('showing that the algorithms all produce the correct output')
print('stringcompress_original: ', stringcompress_original('aabcccccaaa'))
print('stringcompress_using_list: ', stringcompress_using_list('aabcccccaaa'))
print('stringcompress_using_string: ', stringcompress_using_string('aabcccccaaa'))
print()
print('stringcompress_original took', timeit.timeit("[stringcompress_original(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_list took', timeit.timeit("[stringcompress_using_list(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_string took', timeit.timeit("[stringcompress_using_string(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
以下结果均采用英特尔 i7-5700HQ CPU @ 2.70GHz 四核处理器。比较每个 blockquote 中的不同函数,但不要尝试将一个 blockquote 的结果交叉比较到另一个 blockquote,因为测试数据的大小会不同。
使用长字符串
用generate_test_data(10000, 50000, 100, .66)
stringcompress_original took 7.346990528497378 seconds
stringcompress_using_list took 7.589927956366313 seconds
stringcompress_using_string took 7.713812443264496 seconds
使用短字符串
用generate_test_data(2, 5, 10000, .66)
stringcompress_original took 0.40272931026355685 seconds
stringcompress_using_list took 0.1525574881739265 seconds
stringcompress_using_string took 0.13842854253813164 seconds
10% 的重复字符几率
使用generate_test_data(10, 300, 10000, .10)
stringcompress_original took 4.675965586924492 seconds
stringcompress_using_list took 6.081609410376534 seconds
stringcompress_using_string took 5.887430301813865 seconds
90% 的重复字符几率
用generate_test_data(10, 300, 10000, .90)
stringcompress_original took 2.6049783549783547 seconds
stringcompress_using_list took 1.9739111725413099 seconds
stringcompress_using_string took 1.9460854974553605 seconds
创建一个像这样的小框架很重要,您可以使用它来测试对算法的更改。通常看似无用的更改会使您的代码运行得更快,因此优化性能时游戏的关键是尝试不同的东西,并对结果进行计时。我敢肯定,如果您尝试进行不同的更改,可以找到更多发现,但这对您要优化的数据类型非常重要——压缩短字符串、长字符串和不重复的字符串与那些经常做的人一样。