使用正则表达式作为分词器?

Using regular expression as a tokenizer?

我正在尝试将我的语料库标记为句子。我尝试使用 spacy 和 nltk,但它们效果不佳,因为我的文本有点棘手。下面是我制作的一个人工样本,涵盖了我知道的所有边缘情况:

It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

我希望如何标记句子:

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised.
2) However, the High Court while enhancing the same from life to death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a rarest of rare case where extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4)15. In the light of the above discussion, while
 maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

这是我使用的正则表达式:

sent = re.split('(?<!\w\.\w.)(?<![A-Z]\.)(?<![1-9]\.)(?<![1-9]\.)(?<![v]\.)(?<![vs]\.)(?<=\.|\?) ',j)

我并不真正精通正则表达式,但我会手动输入条件,例如 vvs。我也忽略了在 te period 之前是否有一个数字,例如 15.

我面临的问题:

  1. 如果两个句子之间没有空隙,则无法正确拆分。
  2. 如果前面的单词大写,我也希望它忽略句号。例如 No.Mr.

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English. Reference

遵循此准则,以下函数使用多个正则表达式来解析您的句子 Modification of D Greenberg answer

代码

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    digits = "([0-9])"
    section = "(Section \d+)([.])(?= \w)"
    item_number = "(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations = "(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized = "\((.*?)\)"
    bracketed = "\[(.*?)\]"
    curly_bracketed = "\{(.*?)\}"
    enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\1<prd>",text)
    text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if "..." in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s" + alphabets + "[.] "," \1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\1<stop> \2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \1<stop> \2",text)
    text = re.sub(" "+suffixes+"[.]"," \1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \1<prd>",text)
    text = re.sub(section,"\1<prd>",text)
    text = re.sub(item_number,"\1<prd>",text)
    text = re.sub(abbreviations, "\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\1<prd>\2",text)
    text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences

用法

for index, token in enumerate(split_into_sentences(s), start = 1):
    print(f'{index}) {token}')

测试

1.输入

s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
'''

输出

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death  to one cannot be generalised.
2) However, the High Court while enhancing the same from life to  death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a  rarest of rare case where extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,  award of extreme penalty of death by the High Court is set aside and we restore the sentence of  life imprisonment as directed by the trial Court.

2。输入

s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''

输出

1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.
3) He'll take the No. 2 bus from the airport.
4) However, he may grab a taxi instead.

3。输入

s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''

输出

1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.

4.输入

s = '''He went to New York. He is 10 years old.'''

输出

1) He went to New York.
2) He is 10 years old.

5.输入

s = '''15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''

输出

1) 15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.

您是否在寻找以下正则表达式:

'(?<=[^A-Z][a-z]\w)[/.] '

解释:

  • [^A-Z][a-z]\w)[/.] --> 这将匹配所有不是以大写开头,后跟 '.' 的单词和 space.
  • (?<=....) --> 这将重置已经 selected 的内容,并且只是 select 接下来的内容,即 select '。 ' 仅。

现在可以拆分使用了:

sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)