如何 select 此 html 中的特定标签?

How to select a specific tag within this html?

我如何select此页面中的所有标题

http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext

例如:我正在尝试获取与此类似的所有行:

AFAS C1001 Introduction to African-American Studies. 3 points.

main_page 从这里遍历整个学校 类 这样我就可以像上面那样获取所有标题:

http://bulletin.columbia.edu/columbia-college/departments-instruction/  

for page in main_page:
    sub_abbrev = page.find("div", {"class": "courseblock"})

我有这段代码,但我不知道如何 select 第一个 child 的所有 ('strong') 标签。 使用最新python和美汤4至web-scrape。 Lmk 如果还有什么需要的。 谢谢

使用 courseblock class 迭代元素,然后,对于每门课程,获取 courseblocktitle class 的元素。使用 select() and select_one() methods 的工作示例:

import requests
from bs4 import BeautifulSoup


url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

for course in soup.select(".courseblock"):
    title = course.select_one("p.courseblocktitle").get_text(strip=True)
    print(title)

打印:

AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.

来自@double_j的一个很好的后续问题:

In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.

我想使用 get_text() methodseparator 参数,但这也会在最后一个点之前添加一个额外的 space。相反,我会通过 str.join():

加入 strong 元素文本
for course in soup.select(".courseblock"):
    title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
    print(title)