Python 使用线切割提取文本
Python extract text with line cuts
我正在使用 Python 3.7 并且有一个 test.txt 文件如下所示:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
“ ”.
</FONT>
我需要提取 "be between"(第 4 行)到 "per share"(第 7 行)之后的所有内容。这是我的代码 运行:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
我首先找到 "be between" 然后要求附加该行,但问题是接下来的所有内容都被删除了,因为它在以下行中。
我想要的输出是:
['of our common stock is expected to be between $ and $ per share']
我该怎么做?
非常感谢您。
您需要决定何时向 price
添加一行:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
这只是一个草图,因此您可能需要稍微调整一下
html.unescape
and re.search
特征的正确方法:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
输出:
[' of our common stock is expected to be between $ and $ per share']
这也有效:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
输出:
['of our common stock is expected to be between $ and $ per share']
肮脏的做法:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
这是另一个简单的解决方案:
它将所有行收集到 1 个长字符串中,检测 'be between'
的起始索引、'per share'
的结束索引,然后获取适当的子字符串。
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
输出:
['be between $and $per share']
这也有效:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
输出:
"be between $and $per share"
我正在使用 Python 3.7 并且有一个 test.txt 文件如下所示:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
“ ”.
</FONT>
我需要提取 "be between"(第 4 行)到 "per share"(第 7 行)之后的所有内容。这是我的代码 运行:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
我首先找到 "be between" 然后要求附加该行,但问题是接下来的所有内容都被删除了,因为它在以下行中。
我想要的输出是:
['of our common stock is expected to be between $ and $ per share']
我该怎么做? 非常感谢您。
您需要决定何时向 price
添加一行:
is_capturing = False
is_inside_per_share = False
for line in f:
if "be between" in line and "per share" in line:
price.append(line)
is_capturing = False
elif "be between" in line:
is_capturing = True
elif "per share" in line:
# CAUTION: possible off-by-one error
price.append(line[:line.find('per share') + len('per share')].rstrip().replace(' ',''))
is_capturing = False
is_inside_per_share = False
elif line.strip().endswith("per"):
is_inside_per_share = True
elif line.strip().startswith("share") and is_inside_per_share:
# CAUTION: possible off-by-one error
price.append(line[:line.find('share') + len('share')].rstrip().replace(' ',''))
is_inside_per_share = False
is_capturing = False
if is_capturing:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
这只是一个草图,因此您可能需要稍微调整一下
html.unescape
and re.search
特征的正确方法:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
输出:
[' of our common stock is expected to be between $ and $ per share']
这也有效:
import re
with open('test.txt','r') as f:
txt = f.read()
start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace(' ','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])
输出:
['of our common stock is expected to be between $ and $ per share']
肮脏的做法:
price = []
with open("test.txt", 'r') as f:
for i,line in enumerate(f):
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
if i > 3 and i <= 6:
price.append(line.rstrip().replace(' ',''))
print(str(price).split('.')[0]+"]")
这是另一个简单的解决方案:
它将所有行收集到 1 个长字符串中,检测 'be between'
的起始索引、'per share'
的结束索引,然后获取适当的子字符串。
from re import search
price = []
with open("test.txt", 'r') as f:
one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace(' ','')
start_index = search('be between', one_line_txt).span()[0]
end_index = search('per share', one_line_txt).span()[1]
print(price.append(one_line_txt[start_index:end_index]))
输出:
['be between $and $per share']
这也有效:
import re
price = []
with open("test.txt", 'r') as f:
for line in f:
price.append(line.rstrip().replace(' ',''))
text_file = " ".join(price)
be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)
输出:
"be between $and $per share"