Python 包含单引号和双引号以及缩写的解析文件
Python parsing file that has single and double quotes, as well as contractions
我正在尝试解析一个文件,其中某些行可能包含单引号、双引号和缩写的组合。每个观察结果都包含一个字符串,如上所示。在尝试解析数据时,我 运行 在尝试解析评论时遇到了问题。例如:
\'text\' : \'This is the first time I've tried really "fancy food" at a...\'
或
\'text\' : \'I' be happy to go back "next hollidy"\'
使用简单的双替换预处理您的字符串 - 首先转义所有引号,然后用引号替换所有转义的 apostrophes - 这将简单地反转转义,例如:
# we'll define it as an object to keep the validity
src = "{\'text\' : \'This is the first time I've tried really \"fancy food\" at a...\'}"
# The double escapes are just so we can type it properly in Python.
# It's still the same underneath:
# {\'text\' : \'This is the first time I've tried really "fancy food" at a...\'}
preprocessed = src.replace("\"", "\\"").replace("\'", "\"")
# Now it looks like:
# {"text" : "This is the first time I've tried really \"fancy food\" at a..."}
现在是一个有效的 JSON(顺便说一下,还有一个 Python 字典)所以你可以继续解析它:
import json
parsed = json.loads(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
或:
import ast
parsed = ast.literal_eval(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
更新:
根据 posted 行,您实际上有一个 7 元素元组的(有效)表示,其中包含字典的字符串表示作为其第三个元素,您不需要预处理字符串。你需要的是首先评估元组,然后 post-用另一个评估级别处理内部 dict
,即:
import ast
# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
data = f.read()
data = ast.literal_eval(data) # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:] # .. and then the inner dict
# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4]) # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"]) # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3]) # Thursday: 7:00 AM – 9:00 PM
# etc.
话虽如此,我建议追踪生成此类数据的人并说服他们使用某种适当的序列化形式,即使是最基本的 JSON 也会比这更好。
我正在尝试解析一个文件,其中某些行可能包含单引号、双引号和缩写的组合。每个观察结果都包含一个字符串,如上所示。在尝试解析数据时,我 运行 在尝试解析评论时遇到了问题。例如:
\'text\' : \'This is the first time I've tried really "fancy food" at a...\'
或
\'text\' : \'I' be happy to go back "next hollidy"\'
使用简单的双替换预处理您的字符串 - 首先转义所有引号,然后用引号替换所有转义的 apostrophes - 这将简单地反转转义,例如:
# we'll define it as an object to keep the validity
src = "{\'text\' : \'This is the first time I've tried really \"fancy food\" at a...\'}"
# The double escapes are just so we can type it properly in Python.
# It's still the same underneath:
# {\'text\' : \'This is the first time I've tried really "fancy food" at a...\'}
preprocessed = src.replace("\"", "\\"").replace("\'", "\"")
# Now it looks like:
# {"text" : "This is the first time I've tried really \"fancy food\" at a..."}
现在是一个有效的 JSON(顺便说一下,还有一个 Python 字典)所以你可以继续解析它:
import json
parsed = json.loads(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
或:
import ast
parsed = ast.literal_eval(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
更新:
根据 posted 行,您实际上有一个 7 元素元组的(有效)表示,其中包含字典的字符串表示作为其第三个元素,您不需要预处理字符串。你需要的是首先评估元组,然后 post-用另一个评估级别处理内部 dict
,即:
import ast
# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
data = f.read()
data = ast.literal_eval(data) # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:] # .. and then the inner dict
# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4]) # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"]) # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3]) # Thursday: 7:00 AM – 9:00 PM
# etc.
话虽如此,我建议追踪生成此类数据的人并说服他们使用某种适当的序列化形式,即使是最基本的 JSON 也会比这更好。