如何使用 python 从网页中抓取作者姓名和作者 url

Question

我正在尝试从以下网页中抓取作者姓名和作者url。

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive

我正在使用以下代码；

    author_flag = 0
    divs = soup.find_all('h2')
    for div in divs:
        author = div.find('a')
        if(author is not None):
            author_art.append(author.text)
            author_url.append('https://medium.com'+ author.get('href'))
            aurhor_flag = 1
            break
        if(author_flag==0):
            author_art.append('Author information missing')
            author_url.append('Author Url information missing')

谁能看看我做错了什么？因为这段代码没有选择任何东西。它只是返回空白列表。

完整代码：

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re 

data = pd.read_csv('url_technology.csv')


author_art = []
author_url = []


for i in range(1): 
    try:   
    
        author_flag = 0
        divs = soup.find_all('meta')
        for div in divs:
            author = div.find('span')
            if(author is not None):
                author_art.append(author.text)
               
              author_url.append('https://medium.com'+author.get('href'))
                aurhor_flag = 1
                break
            if(author_flag==0):
                author_art.append('Author information missing')
                author_url.append('Author Url information missing')


    except:  
        print('no data found')
    
author_art = pd.DataFrame(title)
author_url = pd.DataFrame(url)


res = pd.concat([author_art, author_art] , axis=1)
res.columns = ['Author_Art', 'Author_url']
res.to_csv('combined1.csv')
print('File created successfully')

https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20?source=tag_archive---------0----------------------- https://medium.com/job-advice-for-software-engineers/what-i-want-and-dont-want-to-see-on-your-software-engineering-resume-cbc07913f7f6?source=tag_archive---------1----------------------- https://itnext.io/load-testing-using-apache-jmeter-af189dd6f805?source=tag_archive---------2----------------------- https://medium.com/s/story/black-mirror-bandersnatch-a-study-guide-c46dfe9156d?source=tag_archive---------3----------------------- https://medium.com/fast-company/the-worst-design-crimes-of-2018-56f32b027bb7?source=tag_archive---------4----------------------- https://towardsdatascience.com/make-your-pictures-beautiful-with-a-touch-of-machine-learning-magic-31672daa3032?source=tag_archive---------5----------------------- https://medium.com/hackernoon/the-state-of-ruby-2019-is-it-dying-509160a4fb92?source=tag_archive---------6-----------------------

Answer 1

如何获取作者姓名和作者 URL 的一种可能性是解析页面中嵌入的 Ld+Json 数据：

import json
import requests
from bs4 import BeautifulSoup

url = "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one('[type="application/ld+json"]').contents[0])

# uncomment this to print all LD+JSON data:
# print(json.dumps(data, indent=4))

print("Author:", data["author"]["name"])
print("URL:", data["author"]["url"])

打印：

Author: Eric Elliott
URL: https://medium.com/@_ericelliott

编辑：returns 作者 Name/URL:

的函数

import json
import requests
from bs4 import BeautifulSoup


def get_author_name_url(medium_url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(
        soup.select_one('[type="application/ld+json"]').contents[0]
    )
    return data["author"]["name"], data["author"]["url"]


url_list = [
    "https://medium.com/javascript-scene/top-javascript-frameworks-and-topics-to-learn-in-2019-b4142f38df20",
]

for url in url_list:
    name, url = get_author_name_url(url)
    print("Author:", name)
    print("URL:", url)

Answer 2

我已经启动了一个名为 medium-apis 的 python 包来完成此类任务。

安装medium-apis

pip install medium-apis

获取 RapidAPI 密钥。 See how
运行代码：

from medium_apis import Medium

medium = Medium('YOUR_RAPIDAPI_KEY')

def get_author(url):
  url_without_parameters = url.split('?')[0]
  article_id = url_without_parameters.split('-')[-1]

  article = medium.article(article_id=article_id)
  author = article.author

  author.save_info()

  return author

urls = [
  "https://nishu-jain.medium.com/medium-apis-documentation-3384e2d08667",
]

for url in urls:
  author = get_author(url)
  print('Author: ', author.fullname)
  print('Profile URL: ', f'https://medium.com/@{author.username}')

Github 回购：https://github.com/weeping-angel/medium-apis

如何使用 python 从网页中抓取作者姓名和作者 url

how to scrape author name and author url from a webpage using python

python

web-scraping

python-2.7