Python3、read/write压缩json对象from/togzip文件
Python 3, read/write compressed json objects from/to gzip file
对于 Python3,我在 @Martijn Pieters's code 之后是这样的:
import gzip
import json
# writing
with gzip.GzipFile(jsonfilename, 'w') as fout:
for i in range(N):
uid = "whatever%i" % i
dv = [1, 2, 3]
data = json.dumps({
'what': uid,
'where': dv})
fout.write(data + '\n')
但这会导致错误:
Traceback (most recent call last):
...
File "C:\Users\Think\my_json.py", line 118, in write_json
fout.write(data + '\n')
File "C:\Users\Think\Anaconda3\lib\gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
对正在发生的事情有什么想法吗?
这里有四步转换。
- 一个Python数据结构(嵌套字典、列表、字符串、数字、布尔值)
- 一个 Python 包含该数据结构的序列化表示的字符串 ("JSON")
- 包含该字符串(“UTF-8”)表示的字节列表
- 包含前一个字节列表(“gzip”)的较短表示形式的字节列表
所以让我们一步一步来。
import gzip
import json
data = []
for i in range(N):
uid = "whatever%i" % i
dv = [1, 2, 3]
data.append({
'what': uid,
'where': dv
}) # 1. data
json_str = json.dumps(data) + "\n" # 2. string (i.e. JSON)
json_bytes = json_str.encode('utf-8') # 3. bytes (i.e. UTF-8)
with gzip.open(jsonfilename, 'w') as fout: # 4. fewer bytes (i.e. gzip)
fout.write(json_bytes)
注意这里添加"\n"
是完全多余的。它不会破坏任何东西,但除此之外它没有任何用处。我添加它只是因为您的代码示例中有它。
阅读正好相反:
with gzip.open(jsonfilename, 'r') as fin: # 4. gzip
json_bytes = fin.read() # 3. bytes (i.e. UTF-8)
json_str = json_bytes.decode('utf-8') # 2. string (i.e. JSON)
data = json.loads(json_str) # 1. data
print(data)
当然可以合并步骤:
with gzip.open(jsonfilename, 'w') as fout:
fout.write(json.dumps(data).encode('utf-8'))
和
with gzip.open(jsonfilename, 'r') as fin:
data = json.loads(fin.read().decode('utf-8'))
提到的解决方案(谢谢,@Rafe)有一个很大的优势:因为编码是即时完成的,所以你不需要为生成的[=创建两个完整的中间字符串对象17=]。对于大对象,这可以节省内存。
with gzip.open(jsonfilename, 'wt', encoding='UTF-8') as zipfile:
json.dump(data, zipfile)
另外,读取解码也很简单:
with gzip.open(jsonfilename, 'rt', encoding='UTF-8') as zipfile:
my_object = json.load(zipfile)
对于 Python3,我在 @Martijn Pieters's code 之后是这样的:
import gzip
import json
# writing
with gzip.GzipFile(jsonfilename, 'w') as fout:
for i in range(N):
uid = "whatever%i" % i
dv = [1, 2, 3]
data = json.dumps({
'what': uid,
'where': dv})
fout.write(data + '\n')
但这会导致错误:
Traceback (most recent call last):
...
File "C:\Users\Think\my_json.py", line 118, in write_json
fout.write(data + '\n')
File "C:\Users\Think\Anaconda3\lib\gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
对正在发生的事情有什么想法吗?
这里有四步转换。
- 一个Python数据结构(嵌套字典、列表、字符串、数字、布尔值)
- 一个 Python 包含该数据结构的序列化表示的字符串 ("JSON")
- 包含该字符串(“UTF-8”)表示的字节列表
- 包含前一个字节列表(“gzip”)的较短表示形式的字节列表
所以让我们一步一步来。
import gzip
import json
data = []
for i in range(N):
uid = "whatever%i" % i
dv = [1, 2, 3]
data.append({
'what': uid,
'where': dv
}) # 1. data
json_str = json.dumps(data) + "\n" # 2. string (i.e. JSON)
json_bytes = json_str.encode('utf-8') # 3. bytes (i.e. UTF-8)
with gzip.open(jsonfilename, 'w') as fout: # 4. fewer bytes (i.e. gzip)
fout.write(json_bytes)
注意这里添加"\n"
是完全多余的。它不会破坏任何东西,但除此之外它没有任何用处。我添加它只是因为您的代码示例中有它。
阅读正好相反:
with gzip.open(jsonfilename, 'r') as fin: # 4. gzip
json_bytes = fin.read() # 3. bytes (i.e. UTF-8)
json_str = json_bytes.decode('utf-8') # 2. string (i.e. JSON)
data = json.loads(json_str) # 1. data
print(data)
当然可以合并步骤:
with gzip.open(jsonfilename, 'w') as fout:
fout.write(json.dumps(data).encode('utf-8'))
和
with gzip.open(jsonfilename, 'r') as fin:
data = json.loads(fin.read().decode('utf-8'))
提到的解决方案
with gzip.open(jsonfilename, 'wt', encoding='UTF-8') as zipfile:
json.dump(data, zipfile)
另外,读取解码也很简单:
with gzip.open(jsonfilename, 'rt', encoding='UTF-8') as zipfile:
my_object = json.load(zipfile)