Encoding/decoding python 个 CSV 和 JSON 个文件的故障排除
Encoding/decoding troubleshooting for python CSVs and JSON files
我最初使用以下方法转储了一个包含特定句子的文件:
with open(labelFile, "wb") as out:
json.dump(result, out,indent=4)
这句话在 JSON 中看起来像:
"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth \u00c3 cents \u00c2 $ \u00c2 `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .",
然后我继续通过以下方式加载它:
with open(sys.argv[1]) as sentenceFile:
sentenceFile = json.loads(sentenceFile.read())
对其进行处理,然后使用以下方法将其写入 CSV:
with open(sys.argv[2], 'wb') as csvfile:
fieldnames = ['x','y','z'
]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for sentence in sentence2locations2values:
sentence = unicode(sentence['parsedSentence']).encode("utf-8")
writer.writerow({'x': sentence})
这使得在 Excel 打开的 CSV 文件中的句子为 Mac:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
然后我开始将此从 Excel for Macs 转移到 Google Sheets,它是:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
请注意,略有不同,Â
已替换 Ã
。
然后标记它,将它带回 Excel 并持续 Mac,此时它又变成:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
我最初如何读取 CSV,其中包含像:
这样的句子
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
为以下值:
"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating 45,000 per year , is a significant contributor to its population growth \u00c3 cents \u00c2 $ \u00c2 `` a daily quota of 150 Mainland Chinese with family ties in Hong Kong are granted a `` one way permit '' .",
以便它与这个问题开头的原始 json 转储中的内容匹配?
编辑
我检查了一下, \u00c3
到 Ã
的编码,Google 表格中的格式,实际上是 Latin 8。
编辑
I 运行 enca
并看到原始转储文件是 7 位 ASCII 字符,而我的 CSV 是 unicode。所以我需要加载为 unicode 并转换为 7 位 ASCII?
我想出了解决办法。解决方案是将 CSV 文件从其原始格式(标识为 UTF-8
)解码,然后句子变为原始格式。所以:
csvfile = open(sys.argv[1], 'r')
fieldnames = ("x","y","z")
reader = csv.DictReader(csvfile, fieldnames)
next(reader)
for i,row in enumerate(reader):
row['x'] = row['x'].decode("utf-8")
发生的非常奇怪的事情是,当我在Excel 中为Mac 编辑CSV 文件并保存时,每次它似乎都转换为不同的编码。我警告其他用户,因为这非常令人头疼。
我最初使用以下方法转储了一个包含特定句子的文件:
with open(labelFile, "wb") as out:
json.dump(result, out,indent=4)
这句话在 JSON 中看起来像:
"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth \u00c3 cents \u00c2 $ \u00c2 `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .",
然后我继续通过以下方式加载它:
with open(sys.argv[1]) as sentenceFile:
sentenceFile = json.loads(sentenceFile.read())
对其进行处理,然后使用以下方法将其写入 CSV:
with open(sys.argv[2], 'wb') as csvfile:
fieldnames = ['x','y','z'
]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for sentence in sentence2locations2values:
sentence = unicode(sentence['parsedSentence']).encode("utf-8")
writer.writerow({'x': sentence})
这使得在 Excel 打开的 CSV 文件中的句子为 Mac:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
然后我开始将此从 Excel for Macs 转移到 Google Sheets,它是:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
请注意,略有不同,Â
已替换 Ã
。
然后标记它,将它带回 Excel 并持续 Mac,此时它又变成:
-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
我最初如何读取 CSV,其中包含像:
这样的句子-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth à cents  $  `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .
为以下值:
"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating 45,000 per year , is a significant contributor to its population growth \u00c3 cents \u00c2 $ \u00c2 `` a daily quota of 150 Mainland Chinese with family ties in Hong Kong are granted a `` one way permit '' .",
以便它与这个问题开头的原始 json 转储中的内容匹配?
编辑
我检查了一下, \u00c3
到 Ã
的编码,Google 表格中的格式,实际上是 Latin 8。
编辑
I 运行 enca
并看到原始转储文件是 7 位 ASCII 字符,而我的 CSV 是 unicode。所以我需要加载为 unicode 并转换为 7 位 ASCII?
我想出了解决办法。解决方案是将 CSV 文件从其原始格式(标识为 UTF-8
)解码,然后句子变为原始格式。所以:
csvfile = open(sys.argv[1], 'r')
fieldnames = ("x","y","z")
reader = csv.DictReader(csvfile, fieldnames)
next(reader)
for i,row in enumerate(reader):
row['x'] = row['x'].decode("utf-8")
发生的非常奇怪的事情是,当我在Excel 中为Mac 编辑CSV 文件并保存时,每次它似乎都转换为不同的编码。我警告其他用户,因为这非常令人头疼。