设计像 TinyURL 这样的 URL 缩短服务
Designing a URL Shortening service like TinyURL
我正在阅读解释如何设计 url 缩短服务的在线文档。该网站是 https://www.educative.io/courses/grokking-the-system-design-interview 。
在实际编码部分 URL,他们说 ->
"We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for displaying. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8, or 10 characters."
"If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value).Since we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters."
我使用在线 MD5 哈希生成器 (http://onlinemd5.com/) and Base64 encoder (https://www.base64encode.org/) 来验证以上内容。我使用 "www.yahoo.com" 作为 MD5 哈希的输入字符串,输出为 1B03577ED104F16AADC00A639D33CB44 。然后我对其进行了 Base64 编码并得到了 MUIwMzU3N0VEMTA0RjE2QUFEQzAwQTYzOUQzM0NCNDQ= 以及 UTF-8 目标字符集和 Unix 换行符。
任何人都可以解释一下我做的是否正确吗?我看到字符数远远超过 21.
问题是您使用 MD5 的输出作为十六进制数字的字符串,然后对该字符串进行 base64 编码。没有理由对该字符串进行 base64 编码 - base64 编码适用于二进制数据。您可能想要做的是 base64 MD5 哈希的实际 128 位二进制值。这是一些 Python 代码,可以执行我认为您正在尝试执行的操作:
import hashlib, base64
text = "www.yahoo.com"
text_utf8 = text.encode('utf8')
md5 = hashlib.md5(text_utf8).digest()
b64 = base64.b64encode(md5)
print(b64)
这会得到结果 GwNXftEE8WqtwApjnTPLRA
,其长度符合您的预期。
这篇post帮助我真正理解了TinyUrl design背后的算法
这是示例 java 程序,如果有人正在寻找的话。
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Base64;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
public class TinyUrlAlgorithm {
public static void main(String[] args) throws NoSuchAlgorithmException, UnsupportedEncodingException {
MessageDigest messageDigest = MessageDigest.getInstance("MD5");
byte[] input = "https://www.educative.io/courses/grokking-the-system-design-interview".getBytes("UTF-8");//args[0].getBytes("UTF-16");//"https://www.educative.io/courses/grokking-the-system-design-interview".getBytes("UTF-8");
byte[] md5hash = messageDigest.digest(input);
Base64.Encoder encoder = Base64.getEncoder();
String encodeToString = encoder.encodeToString(md5hash);
int shortKeyType = 3;//Integer.parseInt(args[1]);
String tinyUrlKey = shortKeyType ==1 ? encodeToString.substring(0,6) : shortKeyType == 2 ? encodeToString.substring(0,8) : randomlySelect8Chars(encodeToString);
System.out.println("ShortKey --> " + tinyUrlKey);
}
//Fisher yates algorithm
private static String randomlySelect8Chars(String encodeToString) {
Random random = ThreadLocalRandom.current();
char[] encodedChars = encodeToString.toCharArray();
assert encodedChars.length == 21;
for(int i=20; i >=0; i--) {
int randomIndex = random.nextInt(i+1);
swap(encodedChars,randomIndex,i);
}
return new String(encodedChars,0,8);
}
private static void swap(char[] chars, int i, int j) {
char temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}
我正在阅读解释如何设计 url 缩短服务的在线文档。该网站是 https://www.educative.io/courses/grokking-the-system-design-interview 。
在实际编码部分 URL,他们说 -> "We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for displaying. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8, or 10 characters."
"If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value).Since we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters."
我使用在线 MD5 哈希生成器 (http://onlinemd5.com/) and Base64 encoder (https://www.base64encode.org/) 来验证以上内容。我使用 "www.yahoo.com" 作为 MD5 哈希的输入字符串,输出为 1B03577ED104F16AADC00A639D33CB44 。然后我对其进行了 Base64 编码并得到了 MUIwMzU3N0VEMTA0RjE2QUFEQzAwQTYzOUQzM0NCNDQ= 以及 UTF-8 目标字符集和 Unix 换行符。
任何人都可以解释一下我做的是否正确吗?我看到字符数远远超过 21.
问题是您使用 MD5 的输出作为十六进制数字的字符串,然后对该字符串进行 base64 编码。没有理由对该字符串进行 base64 编码 - base64 编码适用于二进制数据。您可能想要做的是 base64 MD5 哈希的实际 128 位二进制值。这是一些 Python 代码,可以执行我认为您正在尝试执行的操作:
import hashlib, base64
text = "www.yahoo.com"
text_utf8 = text.encode('utf8')
md5 = hashlib.md5(text_utf8).digest()
b64 = base64.b64encode(md5)
print(b64)
这会得到结果 GwNXftEE8WqtwApjnTPLRA
,其长度符合您的预期。
这篇post帮助我真正理解了TinyUrl design背后的算法 这是示例 java 程序,如果有人正在寻找的话。
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Base64;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
public class TinyUrlAlgorithm {
public static void main(String[] args) throws NoSuchAlgorithmException, UnsupportedEncodingException {
MessageDigest messageDigest = MessageDigest.getInstance("MD5");
byte[] input = "https://www.educative.io/courses/grokking-the-system-design-interview".getBytes("UTF-8");//args[0].getBytes("UTF-16");//"https://www.educative.io/courses/grokking-the-system-design-interview".getBytes("UTF-8");
byte[] md5hash = messageDigest.digest(input);
Base64.Encoder encoder = Base64.getEncoder();
String encodeToString = encoder.encodeToString(md5hash);
int shortKeyType = 3;//Integer.parseInt(args[1]);
String tinyUrlKey = shortKeyType ==1 ? encodeToString.substring(0,6) : shortKeyType == 2 ? encodeToString.substring(0,8) : randomlySelect8Chars(encodeToString);
System.out.println("ShortKey --> " + tinyUrlKey);
}
//Fisher yates algorithm
private static String randomlySelect8Chars(String encodeToString) {
Random random = ThreadLocalRandom.current();
char[] encodedChars = encodeToString.toCharArray();
assert encodedChars.length == 21;
for(int i=20; i >=0; i--) {
int randomIndex = random.nextInt(i+1);
swap(encodedChars,randomIndex,i);
}
return new String(encodedChars,0,8);
}
private static void swap(char[] chars, int i, int j) {
char temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}