如何压缩这个字符串压缩代码以使其更高效?

How to condense this string compression code to make it more efficient?

我刚刚为 Cracking the Coding Interview 中的问题 1.6 字符串压缩编写了代码。我想知道如何压缩这段代码以使其更有效率。另外,我想确保此代码为 O(n),因为我没有连接到新字符串。

问题状态:

Implement a method to perform basic string compression using the counts of repeated characters. For example, the string 'aabcccccaaa' would become a2b1c5a3. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).

我的代码有效。我在 else 之后的第一个 if 语句检查字符的计数是否为 1,如果是则仅附加字符。我在检查最终结果的长度和原始字符串以决定要 return.

时这样做
import string


def stringcompress(str1):
    res = []
    d = dict.fromkeys(string.ascii_letters, 0)
    main = str1[0]
    for char in range(len(str1)):
        if str1[char] == main:
            d[main] += 1
        else:
            if d[main] == 1:
                res.append(main)
                d[main] = 0
                main = str1[char]
                d[main] += 1
            else:
                res.append(main + str(d[main]))
                d[main] = 0
                main = str1[char]
                d[main] += 1

    res.append(main + str(d[main]))
    return min(''.join(res), str1)

同样,我的代码按预期工作,并按照问题的要求进行操作。我只是想看看有没有哪几行代码我可以拿出来让程序更高效

我用 timeit 模块测试了不同的变体。当我生成不经常重复的测试数据时,您的变体非常有效,但对于短字符串,我的 stringcompress_using_string 是最快的方法。随着字符串变长,一切都颠倒了,你做事的方法变得最快,而 stringcompress_using_string 是最慢的。

这恰恰表明了在不同情况下进行测试的重要性。我的初步结论不完整,并且有更多的测试数据显示了关于这三种方法有效性的真实故事。

import string
import timeit
import random

def stringcompress_original(str1):
    res = []
    d = dict.fromkeys(string.ascii_letters, 0)
    main = str1[0]
    for char in range(len(str1)):
        if str1[char] == main:
            d[main] += 1
        else:
            if d[main] == 1:
                res.append(main)
                d[main] = 0
                main = str1[char]
                d[main] += 1
            else:
                res.append(main + str(d[main]))
                d[main] = 0
                main = str1[char]
                d[main] += 1

    res.append(main + str(d[main]))
    return min(''.join(res), str1, key=len)

def stringcompress_using_list(str1):
    res = []

    count = 0
    for i in range(1, len(str1)):
        count += 1

        if str1[i] is str1[i-1]:
            continue

        res.append(str1[i-1])
        res.append(str(count))
        count = 0
    res.append(str1[i] + str(count+1))

    return min(''.join(res), str1, key=len)

def stringcompress_using_string(str1):
    res = ''

    count = 0
    # we can start at 1 because we already know the first letter is not a repition of any previous letters
    for i in range(1, len(str1)):
        count += 1

        # we keep going through the for loop, until a character does not repeat with the previous one
        if str1[i] is str1[i-1]:
            continue

        # add the character along with the number of times it repeated to the final string
        # reset the count
        # and we start all over with the next character
        res += str1[i-1] + str(count)
        count = 0
    # add the final character + count
    res += str1[i] + str(count+1)

    return min(res, str1, key=len)

def generate_test_data(min_length=3, max_length=300, iterations=3000, repeat_chance=.66):
    assert repeat_chance > 0 and repeat_chance < 1
    data = []
    chr = 'a'
    for i in range(iterations):
        the_str = ''
        # create a random string with a random length between min_length and max_length
        for j in range( random.randrange(min_length, max_length+1) ):
            # if we've decided to not repeat by randomization, then grab a new character,
            # otherwise we will continue to use (repeat) the character that was chosen last time
            if random.random() > repeat_chance:
                chr = random.choice(string.ascii_letters)
            the_str += chr
        data.append(the_str)
    return data


# generate test data beforehand to make sure all of our tests use the same test data
test_data = generate_test_data()

#make sure all of our test functions are doing the algorithm correctly
print('showing that the algorithms all produce the correct output')
print('stringcompress_original: ', stringcompress_original('aabcccccaaa'))
print('stringcompress_using_list: ', stringcompress_using_list('aabcccccaaa'))
print('stringcompress_using_string: ', stringcompress_using_string('aabcccccaaa'))
print()

print('stringcompress_original took', timeit.timeit("[stringcompress_original(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_list took', timeit.timeit("[stringcompress_using_list(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_string took', timeit.timeit("[stringcompress_using_string(x) for x in test_data]", number=10, globals=globals()), ' seconds' )

以下结果均采用英特尔 i7-5700HQ CPU @ 2.70GHz 四核处理器。比较每个 blockquote 中的不同函数,但不要尝试将一个 blockquote 的结果交叉比较到另一个 blockquote,因为测试数据的大小会不同。


使用长字符串

generate_test_data(10000, 50000, 100, .66)

生成的测试数据

stringcompress_original took 7.346990528497378 seconds
stringcompress_using_list took 7.589927956366313 seconds
stringcompress_using_string took 7.713812443264496 seconds


使用短字符串

generate_test_data(2, 5, 10000, .66)

生成的测试数据

stringcompress_original took 0.40272931026355685 seconds
stringcompress_using_list took 0.1525574881739265 seconds
stringcompress_using_string took 0.13842854253813164 seconds


10% 的重复字符几率

使用generate_test_data(10, 300, 10000, .10)

生成的测试数据

stringcompress_original took 4.675965586924492 seconds
stringcompress_using_list took 6.081609410376534 seconds
stringcompress_using_string took 5.887430301813865 seconds


90% 的重复字符几率

generate_test_data(10, 300, 10000, .90)

生成的测试数据

stringcompress_original took 2.6049783549783547 seconds
stringcompress_using_list took 1.9739111725413099 seconds
stringcompress_using_string took 1.9460854974553605 seconds


创建一个像这样的小框架很重要,您可以使用它来测试对算法的更改。通常看似无用的更改会使您的代码运行得更快,因此优化性能时游戏的关键是尝试不同的东西,并对结果进行计时。我敢肯定,如果您尝试进行不同的更改,可以找到更多发现,但这对您要优化的数据类型非常重要——压缩短字符串、长字符串和不重复的字符串与那些经常做的人一样。