使用正则表达式查找两种不同格式的页码

Question

我有很多字符串有两种可能的格式来显示页码：(pp. 4500-4503) 或只是 4500-4503（也可能有我只有一页的情况，所以 (pp. 113)或者只是 11 .

一些字符串的例子：

- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). 

- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.

我将此正则表达式用于第一种格式：

r"pp\. \d+-\d+"

这是第二个：

r"\d+-\d+"

他们都没有工作。我也想知道：有没有办法只使用一个正则表达式而不是创建两个？谢谢

Answer 1

此模式匹配您所有不同的格式：

(\(pp\.)? \d+(-\d+)?\)?

https://regex101.com/r/HV7rlJ/2

Answer 2

您可能会使用：

\(pp\.\s+\d+(?:-\d+)?\)|\b\d+(?:-\d+)?(?=(?:\s*,\s*\d+(?:-\d+)?)*\.)

说明

\(pp\.\s+\d+(?:-\d+)?\)
| 或
\b一个单词边界
\d+(?:-\d+)? 匹配 1+ 位和可选的 - 和 1+ 位
(?= 正面前瞻，断言右边的是
- (?: 非捕获组作为一个整体重复
  - \s*,\s* 匹配可选空白字符之间的逗号
  - \d+(?:-\d+)? 匹配 1+ 位和可选的 - 和 1+ 位
- )*关闭非捕获组并选择性重复
- \.
) 关闭前瞻

看到一个regex demo and a Python demo.

例子

import re

pattern = r"\(pp\.\s+\d+(?:-\d+)?\)|\b\d+(?:-\d+)?(?=(?:\s*,\s*\d+(?:-\d+)?)*\.)"

s = ("- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). \n\n"
            "- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.")

print(re.findall(pattern, s))

输出

['(pp. 81-95)', '41', '483-497']

使用正则表达式查找两种不同格式的页码

Use regex to find page numbers in two different format

python

regex