Python：用十六进制分隔符分割字节

Question

我正在处理几个二进制文件，我想解析存在的 UTF-8 字符串。

我目前有一个函数获取文件的起始位置，然后 returns 找到的字符串：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter != None and index != None):
       return file.read(size).explode('0x00000000')[index] #incorrect
   else:
       return file.read(size)

文件中的一些字符串被0x00 00 00 00分隔，是否可以像PHP的爆炸一样拆分这些字符串？我是 Python 的新手，因此欢迎任何有关代码改进的建议。

示例文件：

48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00 即 Hello World123，我通过用 | 条将其括起来来标记 00 00 00 00 分隔符。

所以：

str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'Hello World'

同样：

str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123'

Answer 1

首先您需要打开binary mode中的文件。

然后你 split str（或 bytes，取决于 Python 的版本）与四个零字节的分隔符 b'[=15=][=15=][=15=][=15=]'：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       return file.read(size).split(delimiter)[index]
   else:
       return file.read(size)

此外，您需要处理编码，因为 str_extract 只有 returns 二进制数据，而您的测试数据是 UTF-16 little endian，如 Martijn Pieters 注意：

>>> str_extract(file, 0x00, 0x20, b'[=11=][=11=][=11=][=11=]', 0).decode('utf-16-le')
u'Hello World'

此外：用 is not None 测试变量不是 None.

Answer 2

我假设您在这里使用的是 Python 2，但编写的代码可同时用于 Python 2 和 Python 3。

您有 UTF-16 数据，而不是 UTF-8。您可以将其作为二进制数据读取并使用 str.split() method:

在四个 NUL 字节上拆分

file.read(size).split(b'\x00' * 4)[index]

结果数据被编码为 UTF-16 little-endian（您可能省略也可能没有省略开头的 UTF-16 BOM；您可以使用以下方式解码数据：

result.decode('utf-16-le')

然而，这将失败，因为我们只是在最后一个 NUL 字节处截断了文本； Python 拆分找到的前 4 个 NUL，并且不会跳过作为文本一部分的最后一个 NUL 字节。

更好的办法是先解码为 Unicode，然后在 Unicode 双 NUL 代码点上拆分：

file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index]

把它作为一个函数放在一起就是：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       delimiter = delimiter.decode('utf-16-le')  # or pass in Unicode
       return file.read(size).decode('utf-16-le').split(delimiter)[index]
   else:
       return file.read(size).decode('utf-16-le')

with open('filename', 'rb') as fobj:
    result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0)

如果文件以 BOM 开头，请考虑以 UTF-16 格式打开文件，而不是以：

import io

with io.open('filename', 'r', encoding='utf16') as fobj:
    # ....

并删除显式解码。

Python 2 个演示：

>>> from io import BytesIO
>>> data = b'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00\x00\x00\x00\x001\x002\x003\x00'
>>> fobj = BytesIO(data)
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 0)
u'Hello World'
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 1)
u'123'

Python：用十六进制分隔符分割字节

Python: split bytes with a hexadecimal delimiter

python

hex

split