正则表达式使用 python 保存文本的特定部分(可以是多段)
Regex to save certain part(can be multiple paragraph) of text using python
我正在尝试构建一个由文档的特定部分组成的数据集。比如文档格式是这样的:
According to A :
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
According to B:
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
According to A :
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
According to C:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).
Here are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text.
According to B:
The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.
“根据 A”部分、“根据 B”部分和“根据 C”部分顺序可以任意(例如“根据 B”部分先出现,或者“根据 C”部分可以是第一个)。并且每个部分都可以出现多次。我只想将“根据 B”部分放入 dataset/dataframe。
我的第一个想法是删除所有“根据 A”和“根据 C”部分(用“”或空白替换)。
所以我尝试这个正则表达式模式:
#document is the text file
pattern = re.compile("According to A:(.*?)According to B:", flags=re.DOTALL)
find = re.findall(pattern , document)
if find :
if len(find) >= 1
for i in range(len(find)) :
document = document.replace(find[i], '')
etc
有没有更简单的方法来只保存“根据 B”部分?
要获得 According to B:
的所有部分,您可以使用:
^[^\S\r\n]*According to B:(?:\r?\n(?![^\S\r\n]*According to [A-Z][^\S\r\n]*:).*)*
说明
^
字符串开头
[^\S\r\n]*According to B:
(?:
非捕获组
\r?\n
匹配一个换行符
(?!
否定前瞻,断言该行不包含
[^\S\r\n]*According to [A-Z][^\S\r\n]*:
匹配 0+ 个空格而不匹配换行符,匹配 According to
和一个字符 [A-Z]
然后再匹配 0+ 个没有换行符的空格字符和 :
)
关闭前瞻
.*
匹配整行
)*
关闭群重复0+次
我正在尝试构建一个由文档的特定部分组成的数据集。比如文档格式是这样的:
According to A :
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
According to B:
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
According to A :
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
According to C:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).
Here are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text.
According to B:
The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.
“根据 A”部分、“根据 B”部分和“根据 C”部分顺序可以任意(例如“根据 B”部分先出现,或者“根据 C”部分可以是第一个)。并且每个部分都可以出现多次。我只想将“根据 B”部分放入 dataset/dataframe。
我的第一个想法是删除所有“根据 A”和“根据 C”部分(用“”或空白替换)。 所以我尝试这个正则表达式模式:
#document is the text file
pattern = re.compile("According to A:(.*?)According to B:", flags=re.DOTALL)
find = re.findall(pattern , document)
if find :
if len(find) >= 1
for i in range(len(find)) :
document = document.replace(find[i], '')
etc
有没有更简单的方法来只保存“根据 B”部分?
要获得 According to B:
的所有部分,您可以使用:
^[^\S\r\n]*According to B:(?:\r?\n(?![^\S\r\n]*According to [A-Z][^\S\r\n]*:).*)*
说明
^
字符串开头[^\S\r\n]*According to B:
(?:
非捕获组\r?\n
匹配一个换行符(?!
否定前瞻,断言该行不包含[^\S\r\n]*According to [A-Z][^\S\r\n]*:
匹配 0+ 个空格而不匹配换行符,匹配According to
和一个字符[A-Z]
然后再匹配 0+ 个没有换行符的空格字符和:
)
关闭前瞻
.*
匹配整行)*
关闭群重复0+次