我可以绕过 Pandas/Python 中的硬编码并设置我选择的行终止符吗？

Question

我有一个超级脏的文本数据集。虽然各种列值是用制表符分隔的，但在所需的数据行中有许多换行符。所有数据条目均由硬性“\n”符号分隔。

我尝试将 lineterminator 参数设置为 '\n'，但它仍将换行符读取为新行。执行任何类型的正则表达式或相关操作很可能会导致制表符分隔丢失，我需要将我的数据加载到数据框中。由于数据集的大小，逐字逐行操作也不完全可行。

有没有办法让 Pandas 不将换行符作为新行读取，而仅在看到 '\n'[ 时转到新行=28=]?

我的数据快照： The unprocessed dataset

下面是对当前状态的快速浏览： current output

突出显示的红色框应该是一个条目。

Answer 1

您可以预处理到适当的 TSV，然后从那里读取它。使用 itertools.groupby 查找“\N”结尾。如果此文件还有其他问题，例如未转义内部制表符，则所有赌注均无效。

import itertools
import re

separator_re = re.compile(r"\s*\N\s*$", re.MULTILINE)

with open('other.csv') as infp:
    with open('other-conv.csv', 'w') as outfp:
        for hassep, subiter in itertools.groupby(infp, separator_re.search):
            if hassep:
                outfp.writelines("{}\n".format(separator_re.sub("",line))
                    for line in subiter)
            else:
                for line in subiter:
                    if line.endswith("\\n"):
                        line = line[:-2] + " "
                    else:
                        line = line.strip()
                    outfp.write(line)

我可以绕过 Pandas/Python 中的硬编码并设置我选择的行终止符吗？

Can I bypass the hard coding in Pandas/Python and set a line-terminator of my choice?

python

csv

data-analysis

pandas

data-cleaning