我需要从 class 文本中提取 id
I need to extract id from th class text
我有下面的 HTML 代码:
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
我需要使用 beatiful soup (31121/ 31301 提取 class 描述中显示的每个产品的 ID / 28416 是 ID)
我该怎么做?
遍历您的选择提取 class
属性,遍历其 类 并选择 class
以 post-
:
开头
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
或
[c.split('-')[-1] for e in soup.select('div[class*="post-"]') for c in e['class'] if c.startswith('post-')]
例子
html = '''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
'''
soup = BeautifulSoup(html)
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
输出
['31121', '31301', '28416']
- Select 所有以 post- 开头的 div。
- 遍历 div 的所有 class 名称以过滤掉以 post-.
开头的 class 名称
- 将 post id 添加到列表中。
import re
html_attr='''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_attr, 'html.parser')
div_list = soup.find_all('div', {"class": re.compile("^post-")})
id_list = []
for div in div_list:
post_id = [name.split('-')[1] for name in div['class'] if name.startswith('post-')][0]
id_list.append(post_id)
print(id_list)
输出
['31121', '31301', '28416']
我有下面的 HTML 代码:
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
我需要使用 beatiful soup (31121/ 31301 提取 class 描述中显示的每个产品的 ID / 28416 是 ID) 我该怎么做?
遍历您的选择提取 class
属性,遍历其 类 并选择 class
以 post-
:
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
或
[c.split('-')[-1] for e in soup.select('div[class*="post-"]') for c in e['class'] if c.startswith('post-')]
例子
html = '''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
'''
soup = BeautifulSoup(html)
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
输出
['31121', '31301', '28416']
- Select 所有以 post- 开头的 div。
- 遍历 div 的所有 class 名称以过滤掉以 post-. 开头的 class 名称
- 将 post id 添加到列表中。
import re
html_attr='''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_attr, 'html.parser')
div_list = soup.find_all('div', {"class": re.compile("^post-")})
id_list = []
for div in div_list:
post_id = [name.split('-')[1] for name in div['class'] if name.startswith('post-')][0]
id_list.append(post_id)
print(id_list)
输出
['31121', '31301', '28416']