如何从 Scrapy 输出中删除 \n \t 但保留 HTML 标签
How to remove \n \t from Scrapy output but leave HTML tags there
我是 Scrapy 的新手 Python。
尽管如此,我已经创建了一个蜘蛛程序来为我提取所需的信息。
唯一的问题是我无法从输出中删除 \n \t 符号并同时在其位置留下 html 标签。
例如:
我当前的输出是:
{'specification': ['<div class="col-lg-5 model__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<ul class="offer__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li class="offer__spec-elem">\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--left muted">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Бренд</span>\n\t\t\t\t\t\t\t\t\t\t\t</div>\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--right">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Huawei</span>\n\t\t\t\t\t\t\t\t\t\t\t</div> ...']}
期望的输出:
{'specification': ['<div class="col-lg-5 model__spec"><ul class="offer__spec"><li class="offer__spec-elem"><div class="offer__spec-elem--left muted"><span>Бренд</span></div><div class="offer__spec-elem--right"><span>Huawei</span></div> ...']}
我的脚本:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
yield {
'specification': quote.getall()
}
我尝试使用 'normalize-space' 但它删除了 \t \n 以及所有 html 标签,我得到了原始文本
def parse(self, response):
for quote in response.xpath('normalize-space(descendant::div[@class="col-lg-5 model__spec"])'):
yield {
'specification': quote.getall()
}
输出:
{'specification': ['Бренд Huawei Емкость аккумулятора 3340 мАч Диагональ экрана 6.1 Процессор HiSilicon Kirin 710 Количество ядер процессора 8 Частота процессора 2.2 ГГц Встроенная память 128 ГБ Оперативная память 4 ГБ Беспроводные коммуникации 3G, 4G(LTE), Bluetooth, GPS, NFC, Wi-Fi, ГЛОНАСС Стандарт связи 3G (WCDMA/UMTS), 4G (LTE), GSM Все характеристики']}
提前致谢。
你可以用 strip():
def parse(self, response):
for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
yield {
'specification': quote.get().strip()
}
试试这个:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('(descendant::div[@class="col-lg-5 model__spec"])'):
quote = quote.getall()
quote = [i.replace("\t", "").replace("\n", "") for i in quote]
yield {
'specification': quote
}
我是 Scrapy 的新手 Python。
尽管如此,我已经创建了一个蜘蛛程序来为我提取所需的信息。 唯一的问题是我无法从输出中删除 \n \t 符号并同时在其位置留下 html 标签。
例如:
我当前的输出是:
{'specification': ['<div class="col-lg-5 model__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<ul class="offer__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li class="offer__spec-elem">\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--left muted">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Бренд</span>\n\t\t\t\t\t\t\t\t\t\t\t</div>\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--right">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Huawei</span>\n\t\t\t\t\t\t\t\t\t\t\t</div> ...']}
期望的输出:
{'specification': ['<div class="col-lg-5 model__spec"><ul class="offer__spec"><li class="offer__spec-elem"><div class="offer__spec-elem--left muted"><span>Бренд</span></div><div class="offer__spec-elem--right"><span>Huawei</span></div> ...']}
我的脚本:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
yield {
'specification': quote.getall()
}
我尝试使用 'normalize-space' 但它删除了 \t \n 以及所有 html 标签,我得到了原始文本
def parse(self, response):
for quote in response.xpath('normalize-space(descendant::div[@class="col-lg-5 model__spec"])'):
yield {
'specification': quote.getall()
}
输出:
{'specification': ['Бренд Huawei Емкость аккумулятора 3340 мАч Диагональ экрана 6.1 Процессор HiSilicon Kirin 710 Количество ядер процессора 8 Частота процессора 2.2 ГГц Встроенная память 128 ГБ Оперативная память 4 ГБ Беспроводные коммуникации 3G, 4G(LTE), Bluetooth, GPS, NFC, Wi-Fi, ГЛОНАСС Стандарт связи 3G (WCDMA/UMTS), 4G (LTE), GSM Все характеристики']}
提前致谢。
你可以用 strip():
def parse(self, response):
for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
yield {
'specification': quote.get().strip()
}
试试这个:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('(descendant::div[@class="col-lg-5 model__spec"])'):
quote = quote.getall()
quote = [i.replace("\t", "").replace("\n", "") for i in quote]
yield {
'specification': quote
}