coreyms.com 的网络抓取

Question

当我使用 BeautifulSoup 删除网站 coreyms.com 的 post 时，即 post 的标题、日期、内容和 youtube link ]s，我面临这个问题：所有 posts 除了一个包含 youtube link。因此，当我废弃数据时，len(videolink)=9 和 len(heading),len(date),len(content)=10。如何通过在不存在 youtube link 的 post 中插入 NaN 来使 len(videolink)=10？

给出代码供参考：

from bs4 import BeautifulSoup
import requests
page7=requests.get('https://coreyms.com/')
page7
soup7=BeautifulSoup(page7.content)
soup7

heading=[]

for i in soup7.find_all('h2',class_='entry-title'):
    heading.append(i.text)
    
heading

date=[]

for i in soup7.find_all('time',class_='entry-time'):
    date.append(i.text)
    
date

content=[]

for i in soup7.find_all('div',class_='entry-content'):
    content.append(i.text)
    
content

videolink=[]

for i in soup7.find_all('iframe',class_='youtube-player'):
    videolink.append(i['src'])
    
videolink

print(len(heading),len(date),len(content),len(videolink))

Answer 1

重新思考您处理数据的方式并远离这些过多的列表。相反，使用 dict 或字典列表等结构化方法来持久化数据（该结构也可以简单地转换为数据框）

只需遍历所有文章并检查所需信息是否可用 - 如果没有将其值设置为 None 或您想要设置的任何内容：

data = []

for a in soup7.find_all('article'):     
    data.append({
        'heading':a.h2.text,
        'date':a.find('time',class_='entry-time').text,
        'content':a.find('div',class_='entry-content').text,
        'videolink':vl['src'] if (vl := a.find('iframe',class_='youtube-player')) else None
    })

示例

from bs4 import BeautifulSoup
import requests
page7=requests.get('https://coreyms.com/')
page7
soup7=BeautifulSoup(page7.content)
soup7

data = []

for a in soup7.find_all('article'):     
    data.append({
        'heading':a.h2.text,
        'date':a.find('time',class_='entry-time').text,
        'content':a.find('div',class_='entry-content').text,
        'videolink':vl['src'] if (vl := a.find('iframe',class_='youtube-player')) else None
    })

print(data)

输出

[{'heading': 'Python Tutorial: Zip Files – Creating and Extracting Zip Archives', 'date': 'November 19, 2019', 'content': '\nIn this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…\n\n', 'videolink': 'https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'}, {'heading': 'Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey', 'date': 'October 17, 2019', 'content': '\nIn this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…\n\n\n\n', 'videolink': 'https://www.youtube.com/embed/_P7X8tMplsw?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'}, {'heading': 'Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module', 'date': 'September 21, 2019', 'content': '\nIn this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also look at how to process multiple high-resolution images at the same time using a ProcessPoolExecutor from the concurrent.futures module. Let’s get started…\n\n\n\n', 'videolink': 'https://www.youtube.com/embed/fKl2JW_qrso?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'},...]

coreyms.com 的网络抓取

Web scraping of coreyms.com

python

beautifulsoup

web-scraping

示例

输出