拆分多个大小写连接的单词

Question

我发现了一些与此主题相关的问题。但是，我还没有找到一个解决方案来具体说明如何使用正则表达式将连接的单词（西班牙语）拆分为大写和小写。

我正在使用 PyPDF2 从多个 pdf 文件中提取文本。信息总是以相同的顺序排列。

在运行一个 PyPDF2 代码之后，我得到如下项目：

'MASCULINOFecha de NacimientoLugar de Nacimiento'
'CASADONivel Educativo'

在这两种情况下，项目都是来自 pdf 内容的关键词。我试图获得的输出应该是这样的（使用之前的例子）：

'MASCULINO'
'Fecha de Nacimiento'
'Lugar de Nacimiento'
'CASADO'
'Nivel Educativo'

我试过正则表达式模块来拆分特定的模式。到目前为止，这是我的代码：

pdfFile = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
for page in range(0, pdfReader.getNumPages()):
    text = pdfReader.getPage(page).extractText()
    for line in text.split(':'):
        pattern = re.compile(r'([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)')
        result = re.findall(pattern, line)
        print result

它拆分了一些项目，但不是全部。

是否有更好的正则表达式模式来拆分这些词？

欢迎任何解决问题的建议。谢谢

Answer 1

尝试使用 (?<=[A-Za-z])(?=[A-Z][a-z]) 并替换为 \n 或拆分。

这将检测大写或小写 AND 大写或小写之间的 zero-width。这似乎是这里的逻辑分隔符。

输入

MASCULINO|Fecha de Nacimiento|Lugar de Nacimiento
CASADO|Nivel Educativo

|表示匹配零宽度。

输出

MASCULINO
Fecha de Nacimiento
Lugar de Nacimiento
CASADO
Nivel Educativo

Regex101 Demo

正如 Wiktor 在评论中提到的那样

You cannot use re.split with an empty string matching regex. Use the PyPi regex module if you need split.

There is no bug of this kind in re.sub, it is used as a workaround: you insert unused characters into the string with re.sub, and then re.split with this character. Just choose some char that is sure to be absent from the input (usually a control character, or a character from the unused Unicode range).

在匹配的零宽度中替换 ~ 并在 ~ 上拆分将为您提供结果数组。

Python代码：

import re
line='MASCULINOFecha de NacimientoLugar de Nacimiento CASADONivel Educativo'
result = re.sub('(?<=[A-Za-z])(?=[A-Z][a-z])', '~', line,)
result = re.split('~', result)
print result

Ideone Demo

Answer 2

分裂于 \B(?=[A-Z][a-z])。它会找到大写字母后跟小写字母 not 前面有单词边界。

在测试用例中完成 222 个步骤 - see it here.

此致

拆分多个大小写连接的单词

Split multiple joined words with upper and lower case

python

regex

pdf

text-mining

python-2.7