如何在 Python Scrapy 上获取文本
How can I get text on Python Scrapy
import scrapy
class WanikaniSpider(scrapy.Spider):
name = 'japandict'
allowed_domains = ['www.japandict.com']
start_urls = ['https://www.japandict.com/lists/jlpt5k']
def parse(self, response):
kanjiler = response.xpath("//div[@class='row']/div/div/div")
for kanji in kanjiler:
kanjiicon= kanji.xpath("//div[@class='row']/div/div/div/a/div/span")
yield{
'kanjiicon': kanjiicon
}
我创造了这样的蜘蛛。我想把 kanjiicon
作为文本。但是当我使用 .get
.extract
方法时,它返回空。
我该如何解决?
我正在获取输出。
代码:
import scrapy
class WanikaniSpider(scrapy.Spider):
name = 'japandict'
allowed_domains = ['www.japandict.com']
start_urls = ['https://www.japandict.com/lists/jlpt5k']
def parse(self, response):
kanjiler = response.xpath('//*[@class="d-inline-block w-100 text-muted"]')
for kanji in kanjiler:
kanjiicon= kanji.xpath('.//*[@class="xlarge text-normal me-4"]/text()').get().replace('\n','').strip()
yield {
'kanjiicon': kanjiicon
}
输出:
{'kanjiicon': '右'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '雨'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '円'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '下'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '何'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '火'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '外'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '学'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '間'}
您需要将字符串解码为 utf-8,ascii 不包含日语字符。
试试这样的:
kanjiicon = kanjiicon.decode('utf-8')
import scrapy
class WanikaniSpider(scrapy.Spider):
name = 'japandict'
allowed_domains = ['www.japandict.com']
start_urls = ['https://www.japandict.com/lists/jlpt5k']
def parse(self, response):
kanjiler = response.xpath("//div[@class='row']/div/div/div")
for kanji in kanjiler:
kanjiicon= kanji.xpath("//div[@class='row']/div/div/div/a/div/span")
yield{
'kanjiicon': kanjiicon
}
我创造了这样的蜘蛛。我想把 kanjiicon
作为文本。但是当我使用 .get
.extract
方法时,它返回空。
我该如何解决?
我正在获取输出。
代码:
import scrapy
class WanikaniSpider(scrapy.Spider):
name = 'japandict'
allowed_domains = ['www.japandict.com']
start_urls = ['https://www.japandict.com/lists/jlpt5k']
def parse(self, response):
kanjiler = response.xpath('//*[@class="d-inline-block w-100 text-muted"]')
for kanji in kanjiler:
kanjiicon= kanji.xpath('.//*[@class="xlarge text-normal me-4"]/text()').get().replace('\n','').strip()
yield {
'kanjiicon': kanjiicon
}
输出:
{'kanjiicon': '右'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '雨'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '円'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '下'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '何'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '火'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '外'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '学'}
2021-08-22 05:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.japandict.com/lists/jlpt5k>
{'kanjiicon': '間'}
您需要将字符串解码为 utf-8,ascii 不包含日语字符。
试试这样的:
kanjiicon = kanjiicon.decode('utf-8')