在 div 标签中获取 HTML Python Scrapy
Get HTML inside div tag Python Scrapy
我在 scrapy 中有这个蜘蛛
import scrapy
class ProductDetailSpider(scrapy.Spider):
name = 'productdetail'
allowed_domains = ['bisabeli.id']
start_urls = [
"https://bisabeli.id/index.php?route=product/product&product_id=7874"
]
def parse(self, response):
description = response.css('div.producttab div#tab-description div#collapse-description').get().strip()
yield {
'description': description,
}
当我 运行 这个蜘蛛在控制台中
scrapy crawl productdetail
结果是:
{
"description": "<div id=\"co
llapse-description\" class=\"desc-collapse showup\">\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t</div>"}
我要获取的是div
标签内的所有标签('<div id="collapse-description" class="desc-collapse showup">'....'</div>'
),
如何编写代码?
更新
我想要的结果是这样的:
{
"description": "\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t"}
更新 v2
正在尝试一些代码:
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']").get()
结果:
{'description': '<div id="co
llapse-description" class="desc-collapse showup">\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t</div>'}
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']/text()").get()
结果:
{'description': '\n\t\t\t\t\
t\t\t\t\t\t'}
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']//text()").get()
结果:
{'description': '\n\t\t\t\t\
t\t\t\t\t\t'}
还是...不是我想要的结果...
旧版本:
此刻我发现只有 .re(".+")
得到所有(或多或少)列表
[opening_tag, item, item, ..., closing_tag]
如果我跳过第一个和最后一个元素并使用 "".join()
那么我可以获得内部 HTML.
# without `get()`
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']")
description = description.re('.+')
description = "".join(description[1:-1]).strip()
编辑:
旧版本之所以有效,是因为 HTML 中有 \n
- 它将 HTML 拆分为 \n
中的列表
["<opening tag>text", "children", "<closing_tag>"]
当没有\n
时旧版本创建一个字符串
[ "<opening tag> text children <closing_tag>" ]
类似于get()
。
此外每个项目都有结构
<tag> text children_tags </tag> tail
当 \n
介于 text
和 children_tags 之间时 then old version skips
text` 来自最终结果。
适用于不同 HTML 的代码是
import lxml.html
html = '<body>hello<span>good</span>world</body>'
tree = lxml.html.fromstring(html)
text = tree.text or ''
children = [lxml.html.tostring(x).decode() for x in item.getchildren()]
inner_html = text + "".join(children).strip()
print(inner_html)
结果:
hello<span>good</span>world
似乎 scrapy
已经在 Selectors 中使用了 lxml
,所以 lxml
应该已经安装了。
最小工作代码:
import scrapy
import lxml.html
def get_inner_html(html):
tree = lxml.html.fromstring(html)
text = tree.text or '' # to skip `None`
children = [lxml.html.tostring(x).decode() for x in tree.getchildren()]
inner_html = text + "".join(children).strip()
return inner_html
class ProductDetailSpider(scrapy.Spider):
name = 'productdetail'
allowed_domains = ['bisabeli.id']
start_urls = [
"https://bisabeli.id/index.php?route=product/product&product_id=7874"
]
def parse(self, response):
print('--- example 1 ---')
html = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']").get()
results = get_inner_html(html)
print(results.strip())
print('--- example 2 ---')
html = '<body>hello<span>good</span>world</body>'
results = get_inner_html(html)
print(results.strip())
我在 scrapy 中有这个蜘蛛
import scrapy
class ProductDetailSpider(scrapy.Spider):
name = 'productdetail'
allowed_domains = ['bisabeli.id']
start_urls = [
"https://bisabeli.id/index.php?route=product/product&product_id=7874"
]
def parse(self, response):
description = response.css('div.producttab div#tab-description div#collapse-description').get().strip()
yield {
'description': description,
}
当我 运行 这个蜘蛛在控制台中
scrapy crawl productdetail
结果是:
{
"description": "<div id=\"co
llapse-description\" class=\"desc-collapse showup\">\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t</div>"}
我要获取的是div
标签内的所有标签('<div id="collapse-description" class="desc-collapse showup">'....'</div>'
),
如何编写代码?
更新
我想要的结果是这样的:
{
"description": "\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t"}
更新 v2
正在尝试一些代码:
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']").get()
结果:
{'description': '<div id="co
llapse-description" class="desc-collapse showup">\n\t\t\t\t\t\t\t\t\t\t<p>Deskripsi Realme X2 Pro 12GB/256GB - Pre Order - Biru</p><p>barang repack</p><p>Apa itu Repack :</p><p>Kartu garansi Konsumen dpt lembar pembeli lalu penjual ambil bagian KG l
embar penjual, utk kami bantu daftarkan klaim warranty nya nanti 1 tahun</p><p><br></p><p><br></p><p>Garansi resmi 1 tahun</p><p><br></p><p>OS Android</p><p>OS ver Android 9.0 (Pie); ColorOS 6.1</p><p>SIM Nano SIM , Dual SIM , Dual Standby</p><p>CPU
Qualcomm SDM855 Snapdragon 855+ (7nm)</p><p>Octa-core</p><p>Kecepatan CPU 2.96 GHz (1x2.96 GHz Kryo 485 & 3x2.42 GHz Kryo 485 & 4x1.8 GHz Kryo 485)</p><p>Storage 64GB , 128GB , 256GB</p><p>RAM 8GB , 12GB , 6GB</p><p>External Storage No</p><
p>Battery 4000mAh</p><p>50W SuperVOOC Flash Charge</p><p>Ukuran Layar 6.5 inches</p><p>Resolusi FHD+ 2400 x 1080 pixels at 402 ppi</p><p>Super AMOLED, 90Hz display</p><p>Network Tipe 2G , 3G , 4G (LTE)</p><p>2G GSM: 850/900/1800/1900</p><p>3G WCDMA:
B1/B2/B4/B5/B6/B8/B19</p><p>4G (LTE) LTE FDD: B1/B2/B3/B4/B5/B7/B8/B12/B17/B18/B19/B20/B26/B28</p><p>TD-LTE: B34/B38/B39/B40/B41</p><p>Speed HSPA 42.2/11.5 Mbps, LTE-A</p><p>Kamera Utama 64MP + 13MP + 8MP + 2MP</p><p>Kamera Depan 16MP</p><p>Fitur W
i-Fi , Hotspot/Tethering , GPS , Bluetooth , Flash , Fingerprint Scanner , NFC , 3.5mm Headphone Jack , Quad Cameras</p><p>Ukuran Dimensi 161 x 75.7 x 8.7 mm</p>\n\t\t\t\t\t\t\t\t\t</div>'}
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']/text()").get()
结果:
{'description': '\n\t\t\t\t\
t\t\t\t\t\t'}
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']//text()").get()
结果:
{'description': '\n\t\t\t\t\
t\t\t\t\t\t'}
还是...不是我想要的结果...
旧版本:
此刻我发现只有 .re(".+")
得到所有(或多或少)列表
[opening_tag, item, item, ..., closing_tag]
如果我跳过第一个和最后一个元素并使用 "".join()
那么我可以获得内部 HTML.
# without `get()`
description = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']")
description = description.re('.+')
description = "".join(description[1:-1]).strip()
编辑:
旧版本之所以有效,是因为 HTML 中有 \n
- 它将 HTML 拆分为 \n
["<opening tag>text", "children", "<closing_tag>"]
当没有\n
时旧版本创建一个字符串
[ "<opening tag> text children <closing_tag>" ]
类似于get()
。
此外每个项目都有结构
<tag> text children_tags </tag> tail
当 \n
介于 text
和 children_tags 之间时 then old version skips
text` 来自最终结果。
适用于不同 HTML 的代码是
import lxml.html
html = '<body>hello<span>good</span>world</body>'
tree = lxml.html.fromstring(html)
text = tree.text or ''
children = [lxml.html.tostring(x).decode() for x in item.getchildren()]
inner_html = text + "".join(children).strip()
print(inner_html)
结果:
hello<span>good</span>world
似乎 scrapy
已经在 Selectors 中使用了 lxml
,所以 lxml
应该已经安装了。
最小工作代码:
import scrapy
import lxml.html
def get_inner_html(html):
tree = lxml.html.fromstring(html)
text = tree.text or '' # to skip `None`
children = [lxml.html.tostring(x).decode() for x in tree.getchildren()]
inner_html = text + "".join(children).strip()
return inner_html
class ProductDetailSpider(scrapy.Spider):
name = 'productdetail'
allowed_domains = ['bisabeli.id']
start_urls = [
"https://bisabeli.id/index.php?route=product/product&product_id=7874"
]
def parse(self, response):
print('--- example 1 ---')
html = response.xpath("//div[@class='producttab']//div[@id='tab-description']//div[@id='collapse-description']").get()
results = get_inner_html(html)
print(results.strip())
print('--- example 2 ---')
html = '<body>hello<span>good</span>world</body>'
results = get_inner_html(html)
print(results.strip())