Python：以\n为换行符读取文件。文件还包含 \r\n

Question

我正在查看如下所示的 .CSV 文件：

Hello\r\n
my name is Alex\n
Hello\r\n
my name is John?\n

我正在尝试打开换行符定义为“\n”的文件:

with open(outputfile, encoding="ISO-8859-15", newline='\n') as csvfile:

我得到：

line1 = 'Hello'
line2 = 'my name is Alex'
line3 = 'Hello'
line4 = 'my name is John'

我想要的结果是：

line1 = 'Hello\r\nmy name is Alex'
line2 = 'Hello\r\nmy name is John'

你对如何解决这个问题有什么建议吗？提前致谢！

Answer 1

Python的分行算法不能如你所愿；以 \r\n 结尾的行也 以 \r 结尾。最多您可以将 newline 参数设置为 '\n' 或 '' 并重新加入以 \r\n 结尾的行\n。您可以使用生成器函数为您完成此操作：

def collapse_CRLF(fileobject):
    buffer = []
    for line in fileobject:
        if line.endswidth('\r\n'):
            buffer.append(line)
        else:
            yield ''.join(buffer) + line
            buffer = []
   if buffer:
       yield ''.join(buffer)

然后将其用作：

with collapse_CRLF(open(outputfile, encoding="ISO-8859-15", newline='')) as csvfile:

但是，如果这是 CSV 文件，那么您真的想使用 csv module. It handles files with a mix of \r\n and \n endings for you as it knows how to preserve bare newlines in RFC 4180 CSV files，已经：

import csv

with open(outputfile, encoding="ISO-8859-15", newline='') as inputfile:
    reader = csv.reader(inputfile)

请注意，在有效的 CSV 文件中，\r\n 是 行之间的分隔符 ，并且 \n 在列值中有效。因此，如果您出于某种原因不想在此处使用 csv 模块，您仍然希望使用 newline='\r\n'.

Answer 2

来自标准库中内置函数 open 的文档：

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

文件对象本身无法明确区分数据字节（在您的情况下）'\r\n' 与分隔符 '\n' - 这是字节解码器的权限。因此，作为选项之一，您可以编写自己的 decoder and use associated encoding as encoding of your text file. But this is a bit tedious and in case of small files it's much easier to use a more straightforward approach, using re 模块。 迭代大文件应该使用@Martijn Pieters提出的解决方案。

import re

with open('data.csv', 'tr', encoding="ISO-8859-15", newline='') as f:
    file_data = f.read()

# Approach 1:
lines1 = re.split(r'(?<!\r)\n', file_data)
if not lines1[-1]:
    lines1.pop()
# Approach 2:
lines2 = re.findall(r'(?:.+?(?:\r\n)?)+', file_data)
# Approach 3:
iterator_lines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+', file_data))

assert lines1 == lines2 == list(iterator_lines3)
print(lines1)

如果我们需要在每行末尾添加'\n'：

# Approach 1:
nlines1 = re.split(r'(?<!\r\n)(?<=\n)', file_data)
if not nlines1[-1]:
    nlines1.pop()
# Approach 2:
nlines2 = re.findall(r'(?:.+?(?:\r\n)?)+\n?', file_data)
# Approach 3:
iterator_nlines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+\n', file_data))

assert nlines1 == nlines2 == list(iterator_nlines3)
print(nlines1)

结果：

['Hello\r\nmy name is Alex', 'Hello\r\nmy name is John']
['Hello\r\nmy name is Alex\n', 'Hello\r\nmy name is John\n']

Answer 3

我相信您的回答是完全正确的并且技术先进。遗憾的是，CSV 文件根本不符合 RFC 4180 标准。

因此我将采用以下解决方案并更正我的临时字符“||”之后：

with open(outputfile_corrected, 'w') as correctedfile_handle:
    with open(outputfile, encoding="ISO-8859-15", newline='') as csvfile:
        csvfile_content = csvfile.read()
        csvfile_content_new = csvfile_content.replace('\r\n', '||')
    correctedfile_handle.write(csvfile_content_new)

（有人评论过，但回复已被删除）

Python：以\n为换行符读取文件。文件还包含 \r\n

Python: Reading a file by using \n as the newline character. File also contains \r\n

python

newline

file