如何仅在理想情况下使用 Google 表 REGEXEXTRACT 公式提取每个单元格的大写首字母多个单词并理想地忽略句子的第一个单词?

How To Extract Uppercase First Letter Multiple Words Per Cells Only Ideally Ignoring First Words of Sentences With Google Sheets REGEXEXTRACT Formula?

我正在尝试使用 google sheets.

中的 REGEXEXTRACT 公式从文本中提取所有首字母大写的单词

理想情况下,应该忽略句子的第一个单词,只提取所有首字母大写的后续单词。

其他关闭问题和公式:

我找到了另外两个问题和答案:

=ARRAYFORMULA(TRIM(IFERROR(REGEXREPLACE(IFERROR(REGEXEXTRACT(IFERROR(SPLIT(A2:A, CHAR(10))), "(.*) .*@")), "Mr. |Mrs. ", ""))))

=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))

它们很接近,但我无法将它们成功应用到我的项目中。

我使用的正则表达式模式:

我还发现这个正则表达式 [A-ZÖ][a-zö]+ 模式非常适合获取所有首字母大写的单词。

问题是它没有忽略句子的第一个单词。

其他Python解与GoogleSheets公式:

我还找到了这个 python 教程和脚本:

Proper Noun Extraction in Python using NLP in Python

# Importing the required libraries
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize

# Function to extract the proper nouns 

def ProperNounExtractor(text):
    
    print('PROPER NOUNS EXTRACTED :')
    
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        words = [word for word in words if word not in set(stopwords.words('english'))]
        tagged = nltk.pos_tag(words)
        for (word, tag) in tagged:
            if tag == 'NNP': # If the word is a proper noun
                print(word)

text =  """Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."""

# Calling the ProperNounExtractor function to extract all the proper nouns from the given text. 
ProperNounExtractor(text)

效果很好,但我在 Google Sheets 中这样做的想法是让大写首字母单词以 table 格式与文本相邻,以便更多方便参考。

问题总结:

在下面的示例 sheet 中,您将如何调整我的公式

=ARRAYFORMULA(IF(A1:A="","",REGEXEXTRACT(A1:A,"[A-ZÖ][a-zö]+")))

添加这些功能:

样本Sheet:

这是我的测试Sample Sheet

非常感谢您的帮助!

你可以使用

=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "" & char(10)), CHAR(10)))

或者,要确保作为句子边界匹配的 ?!. / ... 后跟一个大写字母:

=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+([[:upper:]])", CHAR(10) & ""), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "" & char(10)), CHAR(10)))

查看演示截图:

参见regex demo

首先,我们在 REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)) 的单元格中将文本拆分为句子。实际上,这只是将最后一句标点符号替换为换行符。

第二个 REGEXREPLACE 与匹配

的另一个正则表达式一起使用
  • (?m)^\s*[[:upper:]][[:alpha:]]* - string/line (^) 开头的大写单词 ([[:upper:]][[:alpha:]]*) 以及可选的空格 (\s*)
  • | - 或
  • .*? - 除换行字符外的任何零个或多个字符,尽可能少
  • ([[:upper:]][[:alpha:]]*|$) - 第 1 组 (</code>):一个大写字母 (<code>[[:upper:]]),然后是任意零个或多个字母 ([[:alpha:]]*),或字符串结尾($)

并用第 1 组值和换行符 LF 字符替换匹配项。然后,结果是带有换行符的 SPLIT

我的两分钱:

B1中的公式:

=INDEX(IF(A1:A<>"",SPLIT(REGEXREPLACE(A1:A,"(?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?)","|"),"|",1),""))

模式:(?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?) 表示:

  • (?: - 打开非捕获组以允许交替:
    • (?:^|[.?!]+)\s*\S+ - 嵌套的非捕获组允许起始行锚点 1+ 文字点或 question/exclamation 标记,后跟0+ 个空白字符和 1+ 个非空白字符;
    • | - 或者;
    • \b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b - 第一个捕获组,用于在单词边界之间捕获驼峰式字符串(带有可选的连字符);
    • | - 或者;
    • .+? - 任意 1+ 个字符(惰性);
    • ) - 关闭非捕获组。

这里的想法是使用 REGEXREPLACE() 将任何匹配项替换为对第一个捕获组的反向引用和管道符号(或任何不会出现在您的输入中的符号)和使用 SPLIT() 来分隔所有单词。请注意,重要的是使用函数的第 3 个参数来忽略空字符串。

INDEX() 将触发数组功能并溢出结果。我使用了嵌套的 IF() 语句来检查要跳过的空单元格。