有没有办法判断换行符是否在 Python 中拆分两个不同的单词？

Question

使用下面的代码，我将一些带有如下句子的 .csv 文件导入 Python:

df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)

例句：

I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS.      \n

虽然我可以毫无问题地删除由空格包围、位于单词中间或字符串末尾的换行符，但我不知道如何处理分隔单词的换行符。

我想要的输出如下：

目标句：

I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.

有没有办法让我在我的代码中指出换行符被两个不同的词包围？还是这个经典的垃圾进，垃圾出？

Answer 1

df = df[~df['Sentence'].str.contains("\n")]

Answer 2

经过一番挖掘，我想出了两个解决方案。

1. textwrap 包： 尽管 textwrap 包似乎通常用于视觉格式化（即告诉 UI 何时显示“...”以表示长字符串），它成功地识别了我遇到问题的 \n 模式。虽然仍然需要删除其他类型的额外空格，但这个包让我完成了 90% 的工作。

import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS.      \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME  NEW RESTAURANTS.      '

2。识别不同 \n 外观模式的函数： 我在学习 textwrap 之前想出的 'boil the ocean' 解决方案，但效果不佳。此函数查找定义为由两个单词（字母数字）字符包围的换行符的匹配项。对于所有匹配项，该函数在 NLTK 的 words.words() 列表中搜索换行符周围的每个字符串。如果两个字符串中至少有一个是该列表中的一个词，则将其视为两个单独的词。

这没有考虑必须添加到词表中的特定于域的词，或者像“about”这样的词，如果换行符显示为“[=”，则此函数将错误分类26=]”。出于这个原因，我会推荐 textwrap，但我仍然认为我会分享。

carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
    if carriage.search(sentence):
        if not wordword.search(sentence):
            sentence = re.sub(carriage, '', sentence)
        else:
            matches = re.findall(wordword, sentence)
            for match in matches:
                word1 = match[1].lower()
                word2 = match[2].lower()
                if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
                    sentence = sentence.replace(match[0], word1 + ' ' + word2)
                else:
                    sentence = sentence.replace(match[0], word1+word2)
            sentence = re.sub(carriage, '', sentence)
    display(sentence)       
    return sentence

有没有办法判断换行符是否在 Python 中拆分两个不同的单词？

Is there a way to tell if a newline character is splitting two distinct words in Python?

python

newline