Python 使用线切割提取文本

Python extract text with line cuts

我正在使用 Python 3.7 并且有一个 test.txt 文件如下所示:

<P align="left">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
&#147;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#148;.
</FONT>

我需要提取 "be between"(第 4 行)到 "per share"(第 7 行)之后的所有内容。这是我的代码 运行:

price = []
with open("test.txt", 'r') as f:
    for line in f:
        if "be between" in line:
            price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'
print(price)
['of our common stock is expected to be between']

我首先找到 "be between" 然后要求附加该行,但问题是接下来的所有内容都被删除了,因为它在以下行中。

我想要的输出是:

['of our common stock is expected to be between $ and $ per share']

我该怎么做? 非常感谢您。

您需要决定何时向 price 添加一行:

is_capturing = False
is_inside_per_share = False
for line in f:
    if "be between" in line and "per share" in line:
        price.append(line)
        is_capturing = False
    elif "be between" in line:
        is_capturing = True
    elif "per share" in line:
        # CAUTION: possible off-by-one error
        price.append(line[:line.find('per share') + len('per share')].rstrip().replace('&nbsp;',''))
        is_capturing = False
        is_inside_per_share = False
    elif line.strip().endswith("per"):
        is_inside_per_share = True
    elif line.strip().startswith("share") and is_inside_per_share:
        # CAUTION: possible off-by-one error
        price.append(line[:line.find('share') + len('share')].rstrip().replace('&nbsp;',''))
        is_inside_per_share = False
        is_capturing = False

    if is_capturing:
        price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'

这只是一个草图,因此您可能需要稍微调整一下

html.unescape and re.search 特征的正确方法:

import re
from html import unescape

price_texts = []
with open("test.txt", 'r') as f:
    content = unescape(f.read())
    m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
    if m:
        price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))

print(price_texts)

输出:

[' of our common stock is expected to be between $ and $ per share']

这也有效:

import re

with open('test.txt','r') as f:
   txt = f.read()

start = re.search('\n(.*?)be between\n',txt)
end = re.search('per(.*?)share',txt,re.DOTALL)
output = txt[start.span()[1]:end.span()[0]].replace('&nbsp;','').replace('\n','').replace('and',' and ')
print(['{} {} {}'.format(start.group().replace('\n',''),output,end.group().replace('\n', ' '))])

输出:

['of our common stock is expected to be between $ and $ per share']

肮脏的做法:

   price = []
    with open("test.txt", 'r') as f:
        for i,line in enumerate(f):
            if "be between" in line:
                price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'
            if i > 3 and i <= 6:
                price.append(line.rstrip().replace('&nbsp;',''))
    print(str(price).split('.')[0]+"]")

这是另一个简单的解决方案: 它将所有行收集到 1 个长字符串中,检测 'be between' 的起始索引、'per share' 的结束索引,然后获取适当的子字符串。

    from re import search
    price = []
    with open("test.txt", 'r') as f:
        one_line_txt = ''.join(f.readlines()).replace('\n', ' ').replace('&nbsp;','')
    start_index = search('be between', one_line_txt).span()[0]
    end_index = search('per share', one_line_txt).span()[1]
    print(price.append(one_line_txt[start_index:end_index]))

输出:

['be between $and $per share']

这也有效:

import re

price = []    
with open("test.txt", 'r') as f:
    for line in f:
        price.append(line.rstrip().replace('&nbsp;',''))
text_file = " ".join(price)

be_start = re.search("be between", text_file).span()[0]
share_end = re.search("per share", text_file).span()[1]
final_file = text_file[be_start:share_end]
print(final_file)

输出:

"be between $and $per share"