Python 拆分字符串并将它们转换为通知空字段的列表

Question

我花了一整天的时间试图解决这个问题，但我没有找到解决方案，所以我希望你能帮助我。我已经尝试从网站上提取数据。但问题是我不知道如何拆分列表以便将 500g 转换为 500,g。问题是在网站上有时数量是 1，有时是 1 1/2 公斤或某物。现在我需要将其转换为 CSV 文件，然后转换为 MySQL 数据库。我最后想要的是一个包含以下列的 CSV 文件：成分 ID、成分、数量和成分的数量单位。例如： 0，肉，500，克。这是我已经从 this 网站提取数据的代码：

import re
from bs4 import BeautifulSoup
import requests
import csv

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []

def read_recipes():
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content, "lxml")
        for ingredient in soup2.select('.td-left'):
            menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                ingredients.append([id2, *[ingredient.get_text(strip=True)]])
            else:
                ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])

        read_recipes()

希望大家能帮帮我谢谢！

Answer 1

似乎包含分数的字符串使用 1/2 等的 unicode 符号，所以我认为开始的一个好方法是通过查找特定的 code 并将其传递给 str.replace()。拆分此示例的单位和金额很容易，因为它们由 space 分隔。但如果您遇到其他组合，可能有必要对此进行更多概括。以下代码适用于此特定示例：

import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []


for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
    soup2 = BeautifulSoup(requests.get(url).content)
    for ingredient in soup2.select('.td-left'):
        # get rid of multiple spaces and replace 1/2 unicode character
        raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
        # split into unit and number
        splitlist = raw_string.split(" ")
        menge.append(splitlist[0])
        if len(splitlist) == 2:
            einheit.append(splitlist[1])
        else:
            einheit.append('')
    for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
        if ingredient.name == 'h3':
            continue
        else:
            ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])

result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge

输出：

 >>> result
     ID                                        Ingredients   unit amount
 0    0  Beinscheibe(n), vom Rind, ca. 4 cm dick geschn...             4
 1    0                                               Mehl         etwas
 2    0                                         Zwiebel(n)             1
 3    0                                   Knoblauchzehe(n)             2
 4    0                                         Karotte(n)             1
 5    0                                     Lauchstange(n)             1
 6    0                                    Staudensellerie           0.5
 7    0                                Tomate(n), geschält   Dose      1
 8    0                                        Tomatenmark     EL      1
 9    0                              Rotwein zum Ablöschen
 10   0                       Rinderfond oder Fleischbrühe  Liter    0.5
 11   0                                Olivenöl zum Braten
 12   0                                     Gewürznelke(n)             2
 13   0                                       Pimentkörner            10
 14   0                                  Wacholderbeere(n)             5
 15   0                                      Pfefferkörner
 16   0                                               Salz
 17   0                    Pfeffer, schwarz, aus der Mühle
 18   0                                            Thymian
 19   0                                           Rosmarin
 20   0                            Zitrone(n), unbehandelt             1
 21   0                                   Knoblauchzehe(n)             2
 22   0                                    Blattpetersilie   Bund      1

Python 拆分字符串并将它们转换为通知空字段的列表

Python splitting strings and convert them to a list that notices empty fields

python

mysql

csv

beautifulsoup

export-to-csv