使用特定格式的 python 在 CSV 文件中抓取 HTML table
Scraping HTML table in CSV file using python in specific format
我想以特定格式从 link 中提取以下 HTML table 内容。
HTML网页代码:
<table>
<thead>
<tr>
<th>name</th>
<th>brand</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://abcd.com"><span style="color: #000; min-width: 160px;">abcd</span></a></td>
<td><a href="http://abcd.com" target="_blank"><span style="color: #000;">abcd123</span></a></td>
<td><a href="http://abcd.com" target="_blank"><span style="color: #000;">abcd 123 (1g)</span></a><br/></td>
</tr>
<tr>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000; min-width: 160px;">efgh</span></a></td>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000;">efgh456</span></a></td>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000;">efgh 456 (2g)</span></a><br/></td>
</tr>
<tr>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000; min-width: 160px;">ijkl</span></a></td>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000;">ijkl789</span></a></td>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000;">ijkl 789 (3g)</span></a><br/></td>
</tr>
</tbody>
</table>
CSV 文件中要求的输出格式如下:
Link、名称、品牌、描述
http://abcd.com,abcd,abcd123,abcd 123 (1g)
http://efgh.com,efgh,efgh456,efgh 456 (2g)
http://ijkl.com,ijkl,ijkl789,ijkl789(3g)
下面是我的代码:
rows = doc.xpath("//table")
for tr in rows:
tds = tr.xpath("//td")
for td in tds:
Link = td.xpath("//td[1]/a/@href")
name = td.xpath("//td[1]//text()")
brand = td.xpath("//td[2]//text()")
description = td.xpath("//td[3]//text()")
results = []
results.append(Link)
results.append(name)
results.append(brand)
results.append(description)
for result in results:
writer.writerow(result)
在这里,我不知道如何在 CSV 中获取上述特定格式的数据。
试试下面的方法。每个 xpath returns 一个列表,因此您可以将它们附加在一起以创建您的行:
from lxml import html
doc = html.fromstring(html_text)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for tr in doc.xpath("//table"):
tds = tr.xpath("//td")
for td in tds:
Link = td.xpath("//td[1]/a/@href")
name = td.xpath("//td[1]//text()")
brand = td.xpath("//td[2]//text()")
description = td.xpath("//td[3]//text()")
csv_output.writerow(Link + name + brand + description)
为您提供如下所示的 CSV 文件:
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
您可以使用 BeautifulSoup
:
from bs4 import BeautifulSoup as soup
import csv
with open('filename.csv', 'w') as f:
write = csv.writer(f)
header = ['Link']+[i.text for i in soup(data, 'html.parser').find_all('th')]
final_results = [[[b.find('a')['href'], b.text] for b in i.find_all('td')] for i in soup(data, 'html.parser').find_all('tr')][1:]
write.writerows([header]+[[b[0][0], *[i[-1] for i in b]] for b in final_results])
输出:
Link,name,brand,description
http://abcd.com,abcd,abcd123,abcd 123 (1g)
http://efgh.com,efgh,efgh456,efgh 456 (2g)
http://ijkl.com,ijkl,ijkl789,ijkl 789 (3g)
我想以特定格式从 link 中提取以下 HTML table 内容。
HTML网页代码:
<table>
<thead>
<tr>
<th>name</th>
<th>brand</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://abcd.com"><span style="color: #000; min-width: 160px;">abcd</span></a></td>
<td><a href="http://abcd.com" target="_blank"><span style="color: #000;">abcd123</span></a></td>
<td><a href="http://abcd.com" target="_blank"><span style="color: #000;">abcd 123 (1g)</span></a><br/></td>
</tr>
<tr>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000; min-width: 160px;">efgh</span></a></td>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000;">efgh456</span></a></td>
<td><a href="http://efgh.com" target="_blank"><span style="color: #000;">efgh 456 (2g)</span></a><br/></td>
</tr>
<tr>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000; min-width: 160px;">ijkl</span></a></td>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000;">ijkl789</span></a></td>
<td><a href="http://ijkl.com" target="_blank"><span style="color: #000;">ijkl 789 (3g)</span></a><br/></td>
</tr>
</tbody>
</table>
CSV 文件中要求的输出格式如下:
Link、名称、品牌、描述
http://abcd.com,abcd,abcd123,abcd 123 (1g)
http://efgh.com,efgh,efgh456,efgh 456 (2g)
http://ijkl.com,ijkl,ijkl789,ijkl789(3g)
下面是我的代码:
rows = doc.xpath("//table")
for tr in rows:
tds = tr.xpath("//td")
for td in tds:
Link = td.xpath("//td[1]/a/@href")
name = td.xpath("//td[1]//text()")
brand = td.xpath("//td[2]//text()")
description = td.xpath("//td[3]//text()")
results = []
results.append(Link)
results.append(name)
results.append(brand)
results.append(description)
for result in results:
writer.writerow(result)
在这里,我不知道如何在 CSV 中获取上述特定格式的数据。
试试下面的方法。每个 xpath returns 一个列表,因此您可以将它们附加在一起以创建您的行:
from lxml import html
doc = html.fromstring(html_text)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for tr in doc.xpath("//table"):
tds = tr.xpath("//td")
for td in tds:
Link = td.xpath("//td[1]/a/@href")
name = td.xpath("//td[1]//text()")
brand = td.xpath("//td[2]//text()")
description = td.xpath("//td[3]//text()")
csv_output.writerow(Link + name + brand + description)
为您提供如下所示的 CSV 文件:
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
http://abcd.com,http://efgh.com,http://ijkl.com,abcd,efgh,ijkl,abcd123,efgh456,ijkl789,abcd 123 (1g),efgh 456 (2g),ijkl 789 (3g)
您可以使用 BeautifulSoup
:
from bs4 import BeautifulSoup as soup
import csv
with open('filename.csv', 'w') as f:
write = csv.writer(f)
header = ['Link']+[i.text for i in soup(data, 'html.parser').find_all('th')]
final_results = [[[b.find('a')['href'], b.text] for b in i.find_all('td')] for i in soup(data, 'html.parser').find_all('tr')][1:]
write.writerows([header]+[[b[0][0], *[i[-1] for i in b]] for b in final_results])
输出:
Link,name,brand,description
http://abcd.com,abcd,abcd123,abcd 123 (1g)
http://efgh.com,efgh,efgh456,efgh 456 (2g)
http://ijkl.com,ijkl,ijkl789,ijkl 789 (3g)