如何解析 Python 代码同时保持字符串文字完全原样？

Question

我正在尝试遍历 Python 源代码中的所有字符串文字，同时能够分辨出每个字符串文字的类型。

不幸的是，如您在此示例中所见，ast.parse 不起作用：

[node.value.s for node in ast.parse('\'x\'; u\'x\'; b\'x\'; "x"; u"x"; b"x"').body]

输出为：

['x', 'x', b'x', 'x', 'x', b'x']

意思是我无法区分 '' 和 u'' 文字，或者 '' 和 ""，等等

如何解析 Python 源代码，同时保持原始文字与编写的完全一致？

是否有内置方式？

Answer 1

您要查找的信息不是 AST-level 信息。检查此类内容的适当级别是令牌级别，您可以为此使用 tokenize 模块。

tokenize API 非常笨拙 - 它需要一个行为类似于二进制 file-like 对象的 readline 方法的输入 - 所以你需要以二进制方式打开文件，如果你有一个字符串，你需要使用encode和io.BytesIO进行转换。

import tokenize
token_stream = tokenize.tokenize(input_file.readline)
for token in token_stream:
    if token.type == tokenize.STRING:
        do_whatever_with(token.string)

这是 Python 2 版本 - 函数名称不同，您必须按位置访问令牌信息，因为您得到的是常规元组而不是命名元组：

import tokenize
token_stream = tokenize.generate_tokens(input_file.readline)
for token_type, token_string, _, _, _ in token_stream:
    if token_type == tokenize.STRING:
        do_whatever_with(token_string)

如何解析 Python 代码同时保持字符串文字完全原样？

How to parse Python code while keeping string literals exactly as-is?

python

parsing

literals

abstract-syntax-tree

string-length