为什么我的网络抓取功能会返回意外的内容？

Question

我的目标：尝试构建一个函数； def retrieve_title(html) 期望作为输入，字符串 html 和 returns 标题元素。

我已导入 beautifulsoup 来完成此任务。感谢任何指导，因为我还在学习。

我尝试的功能：

def retrieve_title(html):
    soup = [html]
    result = soup.title.text
    return(result)

使用函数：

html = '<title>Jack and the bean stalk</title><header>This is a story about x y z</header><p>talk to you later</p>'
print(get_title(html))

意外结果：

"AttributeError: 'list' object has no attribute 'title'"

预期结果：

"Jack and the beanstalk"

Answer 1

你必须调用函数

print(retrieve_title(html))

Answer 2

Jack and the bean stalk 是紧跟在 title tag 之后的文本节点，因此要抓住它，您可以应用 .find(text=True)

 html = '''
    <title>
     Jack and the beanstalk     
    </title>
    <header>
     This is a story about x y z
    </header>
    <p>
     Once upon a time
    </p>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'html.parser')
    
    #print(soup.prettify())
    
    title=soup.title.find(text=True)
    print(title)

输出：

 Jack and the beanstalk

为什么我的网络抓取功能会返回意外的内容？

Why is my webs scraping function returning something unexpected?

python

beautifulsoup

web-scraping