select 的正确标签和属性是什么?
What are the correct tags and properties to select?
我想抓取网站 (http://theschoolofkyiv.org/participants/220/dan-acostioaei) 以仅提取艺术家的姓名和传记。当我定义标签和属性时,它没有任何我想看到的文本。
我正在使用 scrapy 抓取网站。对于其他网站,它工作正常。我已经测试了我的代码,但似乎无法定义正确的标签或属性。你能看看我的代码吗?
这是我用来抓取网站的代码。 (我不明白为什么Whosebug总是强制我输入不相关的文本。我已经解释了我想说的。)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[@id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
这是我得到的输出:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}
简单说明(假设您已经知道 Tony Montana 提到的 AJAX 请求):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item
我想抓取网站 (http://theschoolofkyiv.org/participants/220/dan-acostioaei) 以仅提取艺术家的姓名和传记。当我定义标签和属性时,它没有任何我想看到的文本。
我正在使用 scrapy 抓取网站。对于其他网站,它工作正常。我已经测试了我的代码,但似乎无法定义正确的标签或属性。你能看看我的代码吗?
这是我用来抓取网站的代码。 (我不明白为什么Whosebug总是强制我输入不相关的文本。我已经解释了我想说的。)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[@id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
这是我得到的输出:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}
简单说明(假设您已经知道 Tony Montana 提到的 AJAX 请求):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item