提取其中包含特定 word/patterns 的句子

Extract sentences with specific word/patterns in it

我正在尝试提取其中包含“隐私|隐私”一词的句子。这些句子可以在我的数据框中的文本中找到。文本被保存为多个字符的列表。字符串,因为我正在处理一堆不同的文件。我无法让它与 grep 一起工作,但使用 gsub 让它工作。我现在遇到的问题是,它只提取文本的第一句,不包括接下来的句子。 csv_edgar$privacy_1A <- gsub(".*?([^\.]*(privacy|Privacy[^\.]*).*","\1", csv_edgar$item_1A, ignore.case=TRUE)。这就是我使用 atm 的代码。 文字:

The Company employs information technology systems to support its business, including ongoing phased implementation of an ERP system as part of business transformation on a worldwide basis over the next several years. Security breaches and other disruptions to the Company’s information technology infrastructure could interfere with the Company’s operations, compromise information belonging to the Company and its customers, suppliers, and employees, exposing the Company to liability which could adversely impact the Company’s business and reputation. In the ordinary course of business, the Company relies on information technology networks and systems, some of which are managed by third parties, to process, transmit and store electronic information, and to manage or support a variety of business processes and activities. Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls. Despite our cybersecurity measures (including employee and third-party training, monitoring of networks and systems, and maintenance of backup and protective systems) which are continuously reviewed and upgraded, the Company’s information technology networks and infrastructure may still be vulnerable to damage, disruptions or shutdowns due to attack by hackers or breaches, employee error or malfeasance, power outages, computer viruses, telecommunication or utility failures, systems failures, service providers including cloud services, natural disasters or other catastrophic events. It is possible for such vulnerabilities to remain undetected for an extended period, up to and including several years. While we have experienced, and expect to continue to experience, these types of threats to the Company’s information technology networks and infrastructure, none of them to date has had a material impact to the Company. There may be other challenges and risks as the Company upgrades and standardizes its ERP system on a worldwide basis. Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company’s reputation, which could adversely affect the Company’s business. Although the Company maintains insurance coverage for various cybersecurity risks, there can be no guarantee that all costs or losses incurred will be fully insured.

您可以将 str_extract_all 与交替使用:

regex <- "[A-Z][^.]+\b(?:Privacy|privacy)\b[^.]+\."
sentences <- str_extract_all(input, regex)[[1]]

[1] "Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls."
[2] "Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company<U+2019>s reputation, which could adversely affect the Company<U+2019>s business."

在上面的代码片段中,input 是您在问题中提供的示例文本。

建议 awk 命令:

awk '/[pP]rivacy/{print}' RS="." input.txt

提供样本的结果

 Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls
 Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company’s reputation, which could adversely affect the Company’s business