如何使用 python 从网页中抓取作者姓名和作者 url
how to scrape author name and author url from a webpage using python
我正在尝试从以下网页中抓取作者姓名和作者url。
https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive
我正在使用以下代码;
author_flag = 0
divs = soup.find_all('h2')
for div in divs:
author = div.find('a')
if(author is not None):
author_art.append(author.text)
author_url.append('https://medium.com'+ author.get('href'))
aurhor_flag = 1
break
if(author_flag==0):
author_art.append('Author information missing')
author_url.append('Author Url information missing')
谁能看看我做错了什么?因为这段代码没有选择任何东西。
它只是返回空白列表。
完整代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
data = pd.read_csv('url_technology.csv')
author_art = []
author_url = []
for i in range(1):
try:
author_flag = 0
divs = soup.find_all('meta')
for div in divs:
author = div.find('span')
if(author is not None):
author_art.append(author.text)
author_url.append('https://medium.com'+author.get('href'))
aurhor_flag = 1
break
if(author_flag==0):
author_art.append('Author information missing')
author_url.append('Author Url information missing')
except:
print('no data found')
author_art = pd.DataFrame(title)
author_url = pd.DataFrame(url)
res = pd.concat([author_art, author_art] , axis=1)
res.columns = ['Author_Art', 'Author_url']
res.to_csv('combined1.csv')
print('File created successfully')
https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive---------0-----------------------
https://medium.com/job-advice-for-software-engineers/what-i-want-and-dont-want-to-see-on-your-software-engineering-resume-cbc07913f7f6?source=tag_archive---------1-----------------------
https://itnext.io/load-testing-using-apache-jmeter-af189dd6f805?source=tag_archive---------2-----------------------
https://medium.com/s/story/black-mirror-bandersnatch-a-study-guide-c46dfe9156d?source=tag_archive---------3-----------------------
https://medium.com/fast-company/the-worst-design-crimes-of-2018-56f32b027bb7?source=tag_archive---------4-----------------------
https://towardsdatascience.com/make-your-pictures-beautiful-with-a-touch-of-machine-learning-magic-31672daa3032?source=tag_archive---------5-----------------------
https://medium.com/hackernoon/the-state-of-ruby-2019-is-it-dying-509160a4fb92?source=tag_archive---------6-----------------------
如何获取作者姓名和作者 URL 的一种可能性是解析页面中嵌入的 Ld+Json 数据:
import json
import requests
from bs4 import BeautifulSoup
url = "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one('[type="application/ld+json"]').contents[0])
# uncomment this to print all LD+JSON data:
# print(json.dumps(data, indent=4))
print("Author:", data["author"]["name"])
print("URL:", data["author"]["url"])
打印:
Author: Eric Elliott
URL: https://medium.com/@_ericelliott
编辑:returns 作者 Name/URL:
的函数
import json
import requests
from bs4 import BeautifulSoup
def get_author_name_url(medium_url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(
soup.select_one('[type="application/ld+json"]').contents[0]
)
return data["author"]["name"], data["author"]["url"]
url_list = [
"https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20",
]
for url in url_list:
name, url = get_author_name_url(url)
print("Author:", name)
print("URL:", url)
我已经启动了一个名为 medium-apis
的 python 包来完成此类任务。
- 安装
medium-apis
pip install medium-apis
获取 RapidAPI 密钥。 See how
运行代码:
from medium_apis import Medium
medium = Medium('YOUR_RAPIDAPI_KEY')
def get_author(url):
url_without_parameters = url.split('?')[0]
article_id = url_without_parameters.split('-')[-1]
article = medium.article(article_id=article_id)
author = article.author
author.save_info()
return author
urls = [
"https://nishu-jain.medium.com/medium-apis-documentation-3384e2d08667",
]
for url in urls:
author = get_author(url)
print('Author: ', author.fullname)
print('Profile URL: ', f'https://medium.com/@{author.username}')
我正在尝试从以下网页中抓取作者姓名和作者url。
https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive
我正在使用以下代码;
author_flag = 0
divs = soup.find_all('h2')
for div in divs:
author = div.find('a')
if(author is not None):
author_art.append(author.text)
author_url.append('https://medium.com'+ author.get('href'))
aurhor_flag = 1
break
if(author_flag==0):
author_art.append('Author information missing')
author_url.append('Author Url information missing')
谁能看看我做错了什么?因为这段代码没有选择任何东西。 它只是返回空白列表。
完整代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
data = pd.read_csv('url_technology.csv')
author_art = []
author_url = []
for i in range(1):
try:
author_flag = 0
divs = soup.find_all('meta')
for div in divs:
author = div.find('span')
if(author is not None):
author_art.append(author.text)
author_url.append('https://medium.com'+author.get('href'))
aurhor_flag = 1
break
if(author_flag==0):
author_art.append('Author information missing')
author_url.append('Author Url information missing')
except:
print('no data found')
author_art = pd.DataFrame(title)
author_url = pd.DataFrame(url)
res = pd.concat([author_art, author_art] , axis=1)
res.columns = ['Author_Art', 'Author_url']
res.to_csv('combined1.csv')
print('File created successfully')
https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive---------0----------------------- https://medium.com/job-advice-for-software-engineers/what-i-want-and-dont-want-to-see-on-your-software-engineering-resume-cbc07913f7f6?source=tag_archive---------1----------------------- https://itnext.io/load-testing-using-apache-jmeter-af189dd6f805?source=tag_archive---------2----------------------- https://medium.com/s/story/black-mirror-bandersnatch-a-study-guide-c46dfe9156d?source=tag_archive---------3----------------------- https://medium.com/fast-company/the-worst-design-crimes-of-2018-56f32b027bb7?source=tag_archive---------4----------------------- https://towardsdatascience.com/make-your-pictures-beautiful-with-a-touch-of-machine-learning-magic-31672daa3032?source=tag_archive---------5----------------------- https://medium.com/hackernoon/the-state-of-ruby-2019-is-it-dying-509160a4fb92?source=tag_archive---------6-----------------------
如何获取作者姓名和作者 URL 的一种可能性是解析页面中嵌入的 Ld+Json 数据:
import json
import requests
from bs4 import BeautifulSoup
url = "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one('[type="application/ld+json"]').contents[0])
# uncomment this to print all LD+JSON data:
# print(json.dumps(data, indent=4))
print("Author:", data["author"]["name"])
print("URL:", data["author"]["url"])
打印:
Author: Eric Elliott
URL: https://medium.com/@_ericelliott
编辑:returns 作者 Name/URL:
的函数import json
import requests
from bs4 import BeautifulSoup
def get_author_name_url(medium_url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(
soup.select_one('[type="application/ld+json"]').contents[0]
)
return data["author"]["name"], data["author"]["url"]
url_list = [
"https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20",
]
for url in url_list:
name, url = get_author_name_url(url)
print("Author:", name)
print("URL:", url)
我已经启动了一个名为 medium-apis
的 python 包来完成此类任务。
- 安装
medium-apis
pip install medium-apis
获取 RapidAPI 密钥。 See how
运行代码:
from medium_apis import Medium
medium = Medium('YOUR_RAPIDAPI_KEY')
def get_author(url):
url_without_parameters = url.split('?')[0]
article_id = url_without_parameters.split('-')[-1]
article = medium.article(article_id=article_id)
author = article.author
author.save_info()
return author
urls = [
"https://nishu-jain.medium.com/medium-apis-documentation-3384e2d08667",
]
for url in urls:
author = get_author(url)
print('Author: ', author.fullname)
print('Profile URL: ', f'https://medium.com/@{author.username}')