根据美丽汤中的 link 声明，仅解析 url

Question

我试图最终 parse 从页面中取出 url，如果它满足特定条件，其中 xx_web_job_alt_keywords 中的关键字之一在 job.get_text()。

xx_good_jobs = []
xx_web_job_alt_keywords = ['Website']
# <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
each_job_link_details = soup.find_all('a', class_='result-title hdrlnk')

for job in each_job_link_details:
    if xx_web_job_alt_keywords in job.get_text():
        #append '//mywebsite.com/web/123.html' to list:xx_good_jobs 
        xx_good_jobs.append(xx_web_job_alt_keywords.get('href',None))

你觉得这怎么样？

Answer 1

或者，您可以使用更明确的方法 searching function:

xx_web_job_alt_keywords = ['Website']

def desired_links(tag):
    """Filters 'header' links having desired keywords in the text."""

    class_attribute = tag.get('class', [])
    is_header_link = tag.name == 'a' and 'result-title' in class_attribute and 'hdrlnk' in class_attribute

    link_text = tag.get_text()
    has_keywords = any(keyword.lower() in link_text.lower() for keyword in xx_web_job_alt_keywords)

    return is_header_link and has_keywords

xx_good_jobs = [link['href'] for link in soup.find_all(desired_links)]

请注意，我们正在使用 any() built-in function 来检查文本中是否有任何关键字。另请注意，我们降低关键字和文本以处理大小写差异。

演示：

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...:     <div>
   ...:         <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="596618166
   ...: 8">Print business magazine's website management</a>
   ...:         <a class="result-title hdrlnk" href="//mywebsite.com/web/456.html" data-id="1234">Som
   ...: e other header link</a>
   ...:     </div>"""

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: xx_web_job_alt_keywords = ['Website']

In [5]: def desired_links(tag):
   ...:     """Filters 'header' links having desired keywords in the text."""
   ...: 
   ...:     class_attribute = tag.get('class', [])
   ...:     is_header_link = tag.name == 'a' and 'result-title' in class_attribute and 'hdrlnk' in cl
   ...: ass_attribute
   ...: 
   ...:     link_text = tag.get_text()
   ...:     has_keywords = any(keyword.lower() in link_text.lower() for keyword in xx_web_job_alt_key
   ...: words)
   ...: 
   ...:     return is_header_link and has_keywords
   ...: 

In [6]: xx_good_jobs = [link['href'] for link in soup.find_all(desired_links)]

In [7]: xx_good_jobs
Out[7]: [u'//mywebsite.com/web/123.html']

Answer 2

import bs4, re
#keywords = ['Website', 'Website', 'business']
html = '''<a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
        <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
        <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>'''
soup = bs4.BeautifulSoup(html, 'lxml')

keywords = ['Website', 'Website', 'business']
regex = '|'.join(keywords)
for a in soup.find_all('a', class_="result-title hdrlnk", text=re.compile(regex,re.IGNORECASE)):
    print(a.get('href'))

输出：

//mywebsite.com/web/123.html
//mywebsite.com/web/123.html
//mywebsite.com/web/123.html

编辑：

keywords = ['Website', 'Website', 'business']

regex = '|'.join(keywords)

输出：

'Website|Website|business'

只需使用 regex 和 | 来匹配 a 标签中的多个关键字。

编辑 2：

keyword_lists = [['Website', 'Website', 'business'], ['Website1', 'Website1', 'business1'], ['Website2', 'Website2', 'business2']]
sum(keyword_lists, [])

输出：

['Website',
 'Website',
 'business',
 'Website1',
 'Website1',
 'business1',
 'Website2',
 'Website2',
 'business2']

根据美丽汤中的 link 声明，仅解析 url

Parse just url, based on link declaration in beautiful soup

python

parsing

beautifulsoup

html-parsing

python-3.x