如何使用带 R 的正则表达式在文本文件中列出结果列表?

How to make a list of the findings in a text file by using Regular Expression with R?

我必须将文本字符向量中的所有参数转换为易于参考的格式:使用 R 的具有 3 列(演示者、时间和文本)的列表(抱歉,我应该更清楚).

例如,主持人应该是

# HARPER'S

时间应该是

# [Day 1, 9:00 A.M.]

和文本应该是参数中的其余部分。

我需要计算文本中参数的数量(

的每个开始
# HARPER'S [Day 1, 9:00 A.M.] 

是一个参数)。我想创建一个名为 'arguments' 的新列表对象,列表中的每个元素都是一个包含三个元素('presenter'、'time' 和 'text')的子列表。

然后将演示者姓名和时间提取到两个字符向量中(同时删除缩进),并在该参数的子列表中保留 'presenter' 元素和 'time' 元素。

This is the text: 
 [1] "HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was"  
  [2] "used to describe the work of brilliant students who explored and expanded the"    
  [3] "uses to which this new technology might be employed.  There was even talk of a"   
  [4] "\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark"  
  [5] "connotations, suggestion the actions of a criminal.  What is the hacker ethic,"   
  [6] "and does it survive?"                                                             
  [7] ""                                                                                 
  [8] "ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It"  
  [9] "survives in anyone excited by technology's power to turn many small,"             
 [10] "insignificant things into one vast, beautiful thing.  It is a fraud because"      
 [11] "there is nothing magical about computers that causes a user to undergo"           
 [12] "religious conversion and devote himself to the public good.  Early automobile"    
 [13] "inventors were hackers too.  At first the elite drove in luxury.  Later"          
 [14] "practically everyone had a car.  Now we have traffic jams, drunk drivers, air"    
 [15] "pollution, and suburban sprawl.  The old magic of an automobile occasionally"     
 [16] "surfaces, but we possess no delusions that it automatically invades the"          
 [17] "consciousness of anyone who sits behind the wheel.  Computers are power, and"     
 [18] "direct contact with power can bring out the best or worst in a person.  It's"     
 [19] "tempting to think that everyone exposed to the technology will be grandly"        
 [20] "inspired, but, alas, it just ain't so."                                           
 [21] ""                                                                                 
 [22] "BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is"     
 [23] "avoiding waste; insisting on using idle computer power -- often hacking into a"   
 [24] "system to do so, while taking the greatest precautions not to damage the"         
 [25] "system.  A second goal of many hackers is the free exchange of  technical"        
 [26] "information.  These hackers feel that patent and copyright restrictions slow"     
 [27] "down technological advances.  A third goal is the advancement of human"           
 [28] "knowledge for its own sake.  Often this approach is unconventional.  People we"   
 [29] "call crackers often explore systems and do mischief.  The are called hackers by"  
 [30] "the press, which doesn't understand the issues."                                  
 [31] ""                                                                                 
 [32] "KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the"    
 [33] "explorations of basement tinkerers were very local.  Once we all became"          
 [34] "connected, the work of these investigations rippled through the world.  today"    
 [35] "the hacking spirit is alive and kicking in video, satellite TV, and radio.  In"   
 [36] "some fields they are called chippers, because the modify and peddle altered"      
 [37] "chips.  Everything that was once said about \"phone phreaks\" can be said about"  
 [38] "them too."

我已经尝试计算参数的长度。

length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\[.*\])", text_data)), text =  regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data

列表的长度"arguments"应该是55。

输出示例为 example data output format

非常感谢您的帮助。

使用您想要捕获给定文本的方式,此正则表达式应该完成您的工作,因为它将演示者、时间和文本捕获到三组中,并使用 re.findall 找到所有文本并将它们放入一个列表,其中这三个信息中的每一个都作为列表中的单个元素存在于元组中。检查这个正则表达式演示,

(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)

Demo

示例 Python 代码,

import re

s = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

argument = re.findall(r'(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)', s)
print(argument)

打印包含具有三个项目 presentertimetext

的元组的列表
[("HARPER'S", '[Day 1, 9:00 A.M.]', 'When the computer was young, the word hacking was\nused to describe the work of brilliant students who explored and expanded the\nuses to which this new technology might be employed.  There was even talk of a\n"hacker ethic."  Somehow, in the succeeding years, the word has taken on dark\nconnotations, suggestion the actions of a criminal.  What is the hacker ethic,\nand does it survive? '), ('ADELAIDE', '[Day 1, 9:25 A.M.]', "the hacker ethic survives, and it is a fraud.  It\nsurvives in anyone excited by technology's power to turn many small,\ninsignificant things into one vast, beautiful thing.  It is a fraud because\nthere is nothing magical about computers that causes a user to undergo\nreligious conversion and devote himself to the public good.  Early automobile\ninventors were hackers too.  At first the elite drove in luxury.  Later\npractically everyone had a car.  Now we have traffic jams, drunk drivers, air\npollution, and suburban sprawl.  The old magic of an automobile occasionally\nsurfaces, but we possess no delusions that it automatically invades the\nconsciousness of anyone who sits behind the wheel.  Computers are power, and\ndirect contact with power can bring out the best or worst in a person.  It's\ntempting to think that everyone exposed to the technology will be grandly\ninspired, but, alas, it just ain't so."), ('BRAND', '[Day 1, 9:54 A.M.]', "The hacker ethic involves several things.  One is\navoiding waste; insisting on using idle computer power -- often hacking into a\nsystem to do so, while taking the greatest precautions not to damage the\nsystem.  A second goal of many hackers is the free exchange of  technical\ninformation.  These hackers feel that patent and copyright restrictions slow\ndown technological advances.  A third goal is the advancement of human\nknowledge for its own sake.  Often this approach is unconventional.  People we\ncall crackers often explore systems and do mischief.  The are called hackers by\nthe press, which doesn't understand the issues."), ('KK', '[Day 1, 11:19 A.M.]', 'The hacker ethic went unnoticed early on because the\nexplorations of basement tinkerers were very local.  Once we all became\nconnected, the work of these investigations rippled through the world.  today\nthe hacking spirit is alive and kicking in video, satellite TV, and radio.  In\nsome fields they are called chippers, because the modify and peddle altered\nchips.  Everything that was once said about "phone phreaks" can be said about\nthem too.')]
import re
matchObj = re.search( r'(.*?)\[(.*?)\](.*\s)', line)
print(matchObj.group(1))
print(matchObj.group(2))
print(matchObj.group(3))

这可能有帮助 如果你想改变一些逻辑,你可以使用组提取字符,你可以在“()”括号

library(magrittr)
library(data.table)

text2df <- function(text) {
    idx <- c(1, which(text == ""), length(text))
    apply(matrix(c(idx[-length(idx)], idx[-1]), ncol = 2), 1, function(id1_id2) {
        presenter_text <- text[id1_id2[1]:id1_id2[2]]
        first_row <- paste(presenter_text[1:2], collapse = "") # presenter_text[1] can be ''
        presenter_name <- strsplit(first_row, split = " [", fixed = T)[[1]][1]
        presentation_time <- strsplit(first_row, split = "]: ", fixed = T)[[1]][1] %>% 
            gsub(paste0(presenter_name, " ["), "", ., fixed = T)
        presentation_text <- paste(c(
            gsub(paste0(presenter_name, " [", presentation_time, "]:"), "", first_row, fixed = T) %>% 
                stringi::stri_trim_left() # remove leading spaces
            , presenter_text[3:length(presenter_text)] %>% .[!is.na(.)] # filter NA if only one row of text
        ), collapse = "")
        data.table(presenter = presenter_name, time = presentation_time, text = presentation_text)
    }) %>% rbindlist
}

这是您的输入:

text_data = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

使用 regex:

提取三个变量
import re
argument = re.findall("(?P<presenter>[A-Z|']+).\[(?P<time>\w.+)\].\s+(?P<text>[\w\W]*?)(?=\n\n|\Z)",text_data)

以防万一你想把它们变成字典:

mydict = {'presenter':[],'time':[],'text':[]}
for i in argument:
    mydict['presenter'].append(i[0])
    mydict['time'].append(i[1])
    mydict['text'].append(i[2])

或者如果您想将它们保存在 csv 文件中:

import csv
with open("filename.csv","w") as mycsv:
    writers = csv.writer(mycsv)
    header = ['presenter','time','text']
    writers.writerow(header)
    for item in argument:
        writers.writerow(item)

要加载您的 csv 文件:

import pandas as pd
df = pd.read_csv("filename.csv")
df

输出:

   presenter |  time              | text
--------------------------------------------------------------------------------------
0   HARPER'S |  Day 1, 9:00 A.M.  | When the computer was young, the word hacking ...
1   ADELAIDE |  Day 1, 9:25 A.M.  | the hacker ethic survives, and it is a fraud. ...
2   BRAND    |  Day 1, 9:54 A.M.  | The hacker ethic involves several things. One...
3   KK       |  Day 1, 11:19 A.M. | The hacker ethic went unnoticed early on becau...