Java 带有空字符串和子字符串的序列化
Java serialization with empty and substrings
查看了实现,还没有想到对此的解释,但也许这里有人会知道。
public static void main(String[] args) throws Exception {
List<String> emptyStrings = new ArrayList<String>();
List<String> emptySubStrings = new ArrayList<String>();
for (int i = 0; i < 20000; i++) {
String actuallyEmpty = "";
String subStringedEmpty = " ";
subStringedEmpty = subStringedEmpty.substring(0, 0);
emptyStrings.add(actuallyEmpty);
emptySubStrings.add(subStringedEmpty);
}
System.out.println("Substring test");
// Write to files
long time = System.currentTimeMillis();
writeObjectToFile(emptyStrings, "empty.list");
System.out.println("Time taken to write empty list " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
writeObjectToFile(emptySubStrings, "substring.list");
System.out.println("Time taken to write substring list " + (System.currentTimeMillis() - time));
//Read from files
time = System.currentTimeMillis();
List<String> readEmptyString = readObjectFromFile("empty.list");
System.out.println("Time taken to read empty list " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
List<String> readEmptySubStrings = readObjectFromFile("substring.list");
System.out.println("Time taken to read substring list " + (System.currentTimeMillis() - time));
}
private static void writeObjectToFile(Object o, String file) throws Exception {
FileOutputStream out = new FileOutputStream(file);
ObjectOutputStream oout = new ObjectOutputStream(out);
oout.writeObject(o);
oout.flush();
oout.close();
}
private static <T> T readObjectFromFile(String file) throws Exception {
ObjectInputStream ois = null;
try {
ois = new ObjectInputStream(new FileInputStream(file));
return (T) ois.readObject();
} finally {
ois.close();
}
}
最终这 2 个列表包含 20,000 个空字符串(一个列表包含 "" 空字符串,另一个包含由 substring(0,0) 生成的空字符串)。但是,如果您检查生成的序列化文件的大小(empty.list 和 substring.list),您会注意到 empty.list 包含更多数据。
我注意到反序列化这些子字符串对象的远程 EJB 的调用者似乎也有严重的性能问题。
empty.list
包含一个 String 对象和大量对它的引用。
substring.list
包含 2000 个字符串对象,它们的内容都是相等的。
您可以 "fix" 通过 intern()ing 字符串来做到这一点。
private void verify(String name, Supplier<String> stringSupplier) throws IOException, ClassNotFoundException {
List<String> inputStrings = new ArrayList<String>();
inputStrings.add(stringSupplier.get());
inputStrings.add(stringSupplier.get());
ByteArrayOutputStream boas = new ByteArrayOutputStream();
ObjectOutputStream emptyOut = new ObjectOutputStream(boas);
emptyOut.writeObject(inputStrings);
emptyOut.flush();
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(boas.toByteArray()));
List<String> returnedStrings = (List<String>)ois.readObject();
if(returnedStrings.get(0) == returnedStrings.get(1)) {
System.out.println(name + " contains the same object");
} else {
System.out.println(name + " contains DIFFERENT objects");
}
}
@Test
public void test() throws IOException, ClassNotFoundException {
verify("empty string", new Supplier<String>() {
@Override
public String get() {
return "";
}
});
verify("sub string", new Supplier<String>() {
@Override
public String get() {
String data = " ";
return data.substring(0, 0);
}
});
verify("intern()ed substring", new Supplier<String>() {
@Override
public String get() {
String data = " ";
return data.substring(0, 0).intern();
}
});
}
列表的大小不同,因为 java 使用一种机制来存储对同一对象的多个引用,如所述:
References to other objects (except in transient or static fields)
cause those objects to be written also. Multiple references to a
single object are encoded using a reference sharing mechanism so that
graphs of objects can be restored to the same shape as when the
original was written.
如果您查看生成的序列化文件,您将看到:
其中 1 个字符串为空:
empty.list:
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 01 77 04 00 00 00 01 74 00 00 78
字符串“”对应最后三个字节(00 00 78
)
substring.list
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 01 77 04 00 00 00 01 74 00 00 78
请注意,使用一个元素生成的文件是相同的。
但是如果我们想多次添加同一个对象,我们将面临其他行为。
使用该字符串的 2 倍查看相应的文件。
empty.list:
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 02 77 04 00 00 00 02 74 00 00 71 00 7e 00
02 78
substring.list
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 02 77 04 00 00 00 02 74 00 00 74 00 00 78
请注意,子字符串继续 "normal",两个不相关的字符串具有不同的引用。但是 empty 有一些额外的字节来处理相同引用的问题。
来自子字符串 (00 00 74 00 00 78
) 的六个字节与来自空列表 (00 00 71 00 7e 00 02 78
) 的八个字节
这是错误的,因为您添加的每个重复字符串都会添加更多额外的字节。因此,当您填满 arrayList 时,将会有很多额外的字节,以便可以按照原始方式进行重建。
如果你想知道为什么会有那个分享机制,建议你看一下这个问题:
What is the meaning of reference sharing in Serialization? How Enums are Serialized?
查看了实现,还没有想到对此的解释,但也许这里有人会知道。
public static void main(String[] args) throws Exception {
List<String> emptyStrings = new ArrayList<String>();
List<String> emptySubStrings = new ArrayList<String>();
for (int i = 0; i < 20000; i++) {
String actuallyEmpty = "";
String subStringedEmpty = " ";
subStringedEmpty = subStringedEmpty.substring(0, 0);
emptyStrings.add(actuallyEmpty);
emptySubStrings.add(subStringedEmpty);
}
System.out.println("Substring test");
// Write to files
long time = System.currentTimeMillis();
writeObjectToFile(emptyStrings, "empty.list");
System.out.println("Time taken to write empty list " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
writeObjectToFile(emptySubStrings, "substring.list");
System.out.println("Time taken to write substring list " + (System.currentTimeMillis() - time));
//Read from files
time = System.currentTimeMillis();
List<String> readEmptyString = readObjectFromFile("empty.list");
System.out.println("Time taken to read empty list " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
List<String> readEmptySubStrings = readObjectFromFile("substring.list");
System.out.println("Time taken to read substring list " + (System.currentTimeMillis() - time));
}
private static void writeObjectToFile(Object o, String file) throws Exception {
FileOutputStream out = new FileOutputStream(file);
ObjectOutputStream oout = new ObjectOutputStream(out);
oout.writeObject(o);
oout.flush();
oout.close();
}
private static <T> T readObjectFromFile(String file) throws Exception {
ObjectInputStream ois = null;
try {
ois = new ObjectInputStream(new FileInputStream(file));
return (T) ois.readObject();
} finally {
ois.close();
}
}
最终这 2 个列表包含 20,000 个空字符串(一个列表包含 "" 空字符串,另一个包含由 substring(0,0) 生成的空字符串)。但是,如果您检查生成的序列化文件的大小(empty.list 和 substring.list),您会注意到 empty.list 包含更多数据。
我注意到反序列化这些子字符串对象的远程 EJB 的调用者似乎也有严重的性能问题。
empty.list
包含一个 String 对象和大量对它的引用。
substring.list
包含 2000 个字符串对象,它们的内容都是相等的。
您可以 "fix" 通过 intern()ing 字符串来做到这一点。
private void verify(String name, Supplier<String> stringSupplier) throws IOException, ClassNotFoundException {
List<String> inputStrings = new ArrayList<String>();
inputStrings.add(stringSupplier.get());
inputStrings.add(stringSupplier.get());
ByteArrayOutputStream boas = new ByteArrayOutputStream();
ObjectOutputStream emptyOut = new ObjectOutputStream(boas);
emptyOut.writeObject(inputStrings);
emptyOut.flush();
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(boas.toByteArray()));
List<String> returnedStrings = (List<String>)ois.readObject();
if(returnedStrings.get(0) == returnedStrings.get(1)) {
System.out.println(name + " contains the same object");
} else {
System.out.println(name + " contains DIFFERENT objects");
}
}
@Test
public void test() throws IOException, ClassNotFoundException {
verify("empty string", new Supplier<String>() {
@Override
public String get() {
return "";
}
});
verify("sub string", new Supplier<String>() {
@Override
public String get() {
String data = " ";
return data.substring(0, 0);
}
});
verify("intern()ed substring", new Supplier<String>() {
@Override
public String get() {
String data = " ";
return data.substring(0, 0).intern();
}
});
}
列表的大小不同,因为 java 使用一种机制来存储对同一对象的多个引用,如所述:
References to other objects (except in transient or static fields) cause those objects to be written also. Multiple references to a single object are encoded using a reference sharing mechanism so that graphs of objects can be restored to the same shape as when the original was written.
如果您查看生成的序列化文件,您将看到:
其中 1 个字符串为空:
empty.list:
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 01 77 04 00 00 00 01 74 00 00 78
字符串“”对应最后三个字节(00 00 78
)
substring.list
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 01 77 04 00 00 00 01 74 00 00 78
请注意,使用一个元素生成的文件是相同的。
但是如果我们想多次添加同一个对象,我们将面临其他行为。 使用该字符串的 2 倍查看相应的文件。
empty.list:
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 02 77 04 00 00 00 02 74 00 00 71 00 7e 00
02 78
substring.list
ac ed 00 05 73 72 00 13 6a 61 76 61 2e 75 74 69
6c 2e 41 72 72 61 79 4c 69 73 74 78 81 d2 1d 99
c7 61 9d 03 00 01 49 00 04 73 69 7a 65 78 70 00
00 00 02 77 04 00 00 00 02 74 00 00 74 00 00 78
请注意,子字符串继续 "normal",两个不相关的字符串具有不同的引用。但是 empty 有一些额外的字节来处理相同引用的问题。
来自子字符串 (00 00 74 00 00 78
) 的六个字节与来自空列表 (00 00 71 00 7e 00 02 78
) 的八个字节
这是错误的,因为您添加的每个重复字符串都会添加更多额外的字节。因此,当您填满 arrayList 时,将会有很多额外的字节,以便可以按照原始方式进行重建。
如果你想知道为什么会有那个分享机制,建议你看一下这个问题:
What is the meaning of reference sharing in Serialization? How Enums are Serialized?