由 40 个不同字符组成的 400 个字符可以压缩多少?

How much can I compress 400 characters made up of 40 different characters?

我有一个 400 个字符的字符串,由 40 个可能的字符组成。这 40 个字符是 a-z A-N。在这个例子中我有:

nnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnuuunnnuxunnnnnnnnnnuxddnnnuxduuuuuunnnnuxddnnnuxddddddunnnnuxddnuuuxdddddduuuuuuxddnnnuxduuudducuccuxddnnnuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduiuiiuxddnnnuxdddddducuccuxddnnnuxdddddduiuiiuxddnnnuxduuudducuccuxddnuuuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduuuuuuxddnnnuxxxxxxxunnnnuxxxnnnnuuuuuuuunnnnnuuunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

我需要用 URL 安全字符制作压缩版本。我使用 a-z A-N0-9 - . _ ~ ( ) ' ! * : @ . ; 来表示原始字符串中的字符最多重复了 23 次。如果字符重复超过 23 次,则字符和数字表示将添加到之前的压缩字符串中。

压缩后得到的字符串为246个字符:

n;un-u1n1uxun8uxd0n1uxdu4n2uxd0n1uxd4un2uxd0nu1xd4u4xd0n1uxdu1d0ucuc0uxd0n1uxdunod0uiui0uxd0n1uxduo0d0ucuc0uxd0n1uxd4uiui0uxd0n1uxd4ucuc0uxd0n1uxd4uiui0uxd0n1uxdu1d0ucuc0uxd0nu1xdunod0uiui0uxd0n1uxduo0d0ucuc0uxd0n1uxd4u4xd0n1ux5un2ux1n2u6n3u1n;n(

压缩原字符串的Javascript函数为:

// 40 single URL-safe characters 
let singleCharList = [
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z",
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
]

// 23 single URL-safe characters for counting (0 is actually 1)
let singleCharCountList = [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"-",
".",
"_",
"~",
"(",
")",
"'",
"!",
"*",
":",
"@",
",",
";",
]

function compressString(stringToCompress) {
  var compressedString = ""
  var lastChar = ""
  var inARowCount = 0
  var maxInARowCount = singleCharCountList.length
  for (i = 0; i < stringToCompress.length + 1; ++i) {
let currentChar = stringToCompress.charAt(i)
if (currentChar === lastChar && inARowCount < maxInARowCount) {
  inARowCount += 1
} else {
  if (inARowCount > 0) {
    singleCharCount = singleCharCountList[inARowCount - 1]
    compressedString = compressedString + lastChar + singleCharCount
  } else {
    compressedString = compressedString + lastChar
  }
  inARowCount = 0
}
lastChar = currentChar
  }
  return compressedString
}

// To inflate the string back to its original 400 characters, I have: 

function inflateString(stringToInflate) {
  var inflatedString = ""
  var lastChar = ""
  for (i = 0; i < stringToInflate.length; ++i) {
let currentChar = stringToInflate.charAt(i)
// currentChar is part of singleCharCountList 
// and lastChar is part of singleCharList
if (singleCharCountList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) > -1) {
  var loop = singleCharCountList.indexOf(currentChar) + 1
  for (var j = 0; j < loop; ++j) {
    inflatedString = inflatedString + lastChar
  }
}
// currentChar is part of singleCharList
// and lastChar is part of singleCharCountList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharCountList.indexOf(lastChar) > -1) {
  inflatedString = inflatedString + currentChar
}
// currentChar is part of singleCharList
// and lastChar is part of singleCharList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) > -1) {
  inflatedString = inflatedString + currentChar
}
// current char is part of singleCharList
// and lastChar is not part of singleCharList or singleCharCountList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) == -1 &&
  singleCharCountList.indexOf(lastChar) == -1) {
  inflatedString = inflatedString + currentChar
}
lastChar = currentChar
  }
  return inflatedString
}


const str = 'nnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnuuunnnuxunnnnnnnnnnuxddnnnuxduuuuuunnnnuxddnnnuxddddddunnnnuxddnuuuxdddddduuuuuuxddnnnuxduuudducuccuxddnnnuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduiuiiuxddnnnuxdddddducuccuxddnnnuxdddddduiuiiuxddnnnuxduuudducuccuxddnuuuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduuuuuuxddnnnuxxxxxxxunnnnuxxxnnnnuuuuuuuunnnnnuuunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn';

const compressed = compressString(str);
console.log(compressed.length);
const inflate = inflateString(compressed)
console.log(str.length,inflateString.length)

我可以做得更好吗?我对我的示例中的 246 比较满意,但这可能会因原始内容而有很大差异。

首先使用标准压缩工具压缩字符串,例如zlib。您的示例使用 zlib 从 400 字节压缩到 93。然后将二进制结果编码为 75 个字符的限制集,这会将其扩展约 30%,从而产生约 120 个字节。仍然比原来的 400 显着压缩。

该编码可以通过多种方式完成,但最快的可能是反向霍夫曼编码,其中您使用二进制表示 75 个符号的平坦概率分布的霍夫曼代码序列。您最终得到 53 个 6 位代码和 22 个 7 位代码。这给出了每个符号 6.29 位,接近最佳的每个符号 6.23 位。

在 运行 长度编码上尝试开发自己的、不合标准的压缩方法是浪费时间。