如何从 URL 域调用正确的 class
How to call correct class from URL Domain
我目前一直致力于创建一个网络爬虫,我想在其中调用正确的 class 从给定的 URL.
中抓取网络元素
目前我创建了:
import sys
import tldextract
import requests
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
# return Scraper.scrapers[k.domain]()
# or
return cls.scrapers[k.domain]()
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(s):
print(s)
# FIXME Scrape the correct values for BBC
return "Scraped BBC News"
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
scraper.scrape(requests.get(url))
我现在想做的是,如果 BBC 是域名,那么它应该进入 class BBCScraper(Scraper):
并且因为我们调用 scraper.scrape(requests.get(url))
它应该抓取网络BBCScraper -> scrape -> Return web elements
内的元素
但是我在尝试 运行 它输出的这个脚本时确实遇到了问题:
Outprint >>> return cls.scrapers[k.domain]() KeyError: 'bbc'
我想知道如何才能根据为 for_url
class 方法
指定的域调用正确的 class
问题是 k.domain
returns bbc
而你写了 url = 'bbc.co.uk'
所以这些解决方案之一
- 使用
url = 'bbc.co.uk'
和 k.registered_domain
- 使用
url = 'bbc'
和 k.domain
并在scrape
方法中添加一个参数来获取响应
from abc import abstractmethod
import requests
import tldextract
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain]()
@abstractmethod
def scrape(self, content: requests.Response):
pass
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(self, content: requests.Response):
return "Scraped BBC News"
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape(requests.get(url))
print(r) # Scraped BBC News
改善
我建议将 url
存储在属性中以将 requests.get
放在 scrape
中,这样 main
中的代码就更少了
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.domain] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain](url)
@abstractmethod
def scrape(self):
pass
class BBCScraper(Scraper):
domain = 'bbc.co.uk'
def __init__(self, url):
self.url = url
def scrape(self):
rep = requests.Response = requests.get(self.url)
content = rep.text # ALL HTML CONTENT
return "Scraped BBC News" + content[:20]
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape()
print(r) # Scraped BBC News<!DOCTYPE html><html
我目前一直致力于创建一个网络爬虫,我想在其中调用正确的 class 从给定的 URL.
中抓取网络元素目前我创建了:
import sys
import tldextract
import requests
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
# return Scraper.scrapers[k.domain]()
# or
return cls.scrapers[k.domain]()
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(s):
print(s)
# FIXME Scrape the correct values for BBC
return "Scraped BBC News"
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
scraper.scrape(requests.get(url))
我现在想做的是,如果 BBC 是域名,那么它应该进入 class BBCScraper(Scraper):
并且因为我们调用 scraper.scrape(requests.get(url))
它应该抓取网络BBCScraper -> scrape -> Return web elements
但是我在尝试 运行 它输出的这个脚本时确实遇到了问题:
Outprint >>> return cls.scrapers[k.domain]() KeyError: 'bbc'
我想知道如何才能根据为 for_url
class 方法
问题是 k.domain
returns bbc
而你写了 url = 'bbc.co.uk'
所以这些解决方案之一
- 使用
url = 'bbc.co.uk'
和k.registered_domain
- 使用
url = 'bbc'
和k.domain
并在scrape
方法中添加一个参数来获取响应
from abc import abstractmethod
import requests
import tldextract
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.url] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain]()
@abstractmethod
def scrape(self, content: requests.Response):
pass
class BBCScraper(Scraper):
url = 'bbc.co.uk'
def scrape(self, content: requests.Response):
return "Scraped BBC News"
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape(requests.get(url))
print(r) # Scraped BBC News
改善
我建议将 url
存储在属性中以将 requests.get
放在 scrape
中,这样 main
class Scraper:
scrapers = {}
def __init_subclass__(scraper_class):
Scraper.scrapers[scraper_class.domain] = scraper_class
@classmethod
def for_url(cls, url):
k = tldextract.extract(url)
return cls.scrapers[k.registered_domain](url)
@abstractmethod
def scrape(self):
pass
class BBCScraper(Scraper):
domain = 'bbc.co.uk'
def __init__(self, url):
self.url = url
def scrape(self):
rep = requests.Response = requests.get(self.url)
content = rep.text # ALL HTML CONTENT
return "Scraped BBC News" + content[:20]
if __name__ == "__main__":
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(url)
r = scraper.scrape()
print(r) # Scraped BBC News<!DOCTYPE html><html