Python:拉粗体文字和后面的文字
Python: pulling bold text and the text that follows
使用下面的 html 我想提取 2 位数据并将它们添加到 python 中的列表中。每个粗体文本都是他的马名,后面是评论。
<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.
<br>
<br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.
She saw it out well and it´ll be interesting to see how she copes with a rise.
<br>
<br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.
<br>
<br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.
<br>
<br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]
<br>
<br>
<div id="resultRaceReport" class="hide"></div>
</div>
根据上面的输出,我希望它看起来像下面这样
[LADY MAKFI, showed vastly improved form to shed her maiden tag on
this seasonal debut for a new yard. The filly offered little for Tony
Martin last year, but did show some ability on her debut and is
evidently capable when fresh. She saw it out well and it´ll be
interesting to see how she copes with a rise.]
[Weardiditallgorong, went down fighting over this longer trip and
probably improved again on her last-time-out second at Bath. This was
her best effort yet on the AW.]
[Chauvelin, in second-time blinkers, turned in his most encouraging
effort for some time and is certainly well treated on his best form.]
[Happy Jack, not for the first time travelled easily until making
heavy weather of it when asked for his effort. [David Orton]]
但我只是不确定如何获得所需的输出(更多背后的逻辑)
我目前使用 lxml 来抓取内容,需要将粗体(马名)与我的 table 相匹配,这样我才能将评论(粗体后的文本)添加到我的数据库
使用 lxml:
h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>"""
from lxml import html
x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
# find bold tags by class name
for b in div.xpath(".//b[@class='black']"):
# get bold text
print(b.text)
# get text between current bold up to next br tag.
print(b.xpath("./following::text()[1]"))
会给你:
LADY MAKFI
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.']
Weardiditallgorong
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.']
Chauvelin
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.']
Happy Jack
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
如果您希望将所有内容完全按照张贴的方式放在一个列表中:
from lxml import html
x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
out = [b.text + "," + b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")]
这给你:
[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.',
'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
与直接使用 lxml 相比,我更喜欢 Beautiful Soup 的 api。我可以完全避免 xpath 并只写 python.
import bs4
soup = bs4.BeautifulSoup(document, 'lxml')
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')]
输出:
['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n She saw it out well and it´ll be interesting to see how she copes with a rise.',
'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
使用下面的 html 我想提取 2 位数据并将它们添加到 python 中的列表中。每个粗体文本都是他的马名,后面是评论。
<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.
<br>
<br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.
She saw it out well and it´ll be interesting to see how she copes with a rise.
<br>
<br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.
<br>
<br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.
<br>
<br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]
<br>
<br>
<div id="resultRaceReport" class="hide"></div>
</div>
根据上面的输出,我希望它看起来像下面这样
[LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.]
[Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.]
[Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.]
[Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]]
但我只是不确定如何获得所需的输出(更多背后的逻辑)
我目前使用 lxml 来抓取内容,需要将粗体(马名)与我的 table 相匹配,这样我才能将评论(粗体后的文本)添加到我的数据库
使用 lxml:
h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>"""
from lxml import html
x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
# find bold tags by class name
for b in div.xpath(".//b[@class='black']"):
# get bold text
print(b.text)
# get text between current bold up to next br tag.
print(b.xpath("./following::text()[1]"))
会给你:
LADY MAKFI
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.']
Weardiditallgorong
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.']
Chauvelin
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.']
Happy Jack
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
如果您希望将所有内容完全按照张贴的方式放在一个列表中:
from lxml import html
x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
out = [b.text + "," + b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")]
这给你:
[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.',
'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
与直接使用 lxml 相比,我更喜欢 Beautiful Soup 的 api。我可以完全避免 xpath 并只写 python.
import bs4
soup = bs4.BeautifulSoup(document, 'lxml')
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')]
输出:
['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n She saw it out well and it´ll be interesting to see how she copes with a rise.',
'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']