Python 拆分字符串并将它们转换为通知空字段的列表
Python splitting strings and convert them to a list that notices empty fields
我花了一整天的时间试图解决这个问题,但我没有找到解决方案,所以我希望你能帮助我。我已经尝试从网站上提取数据。但问题是我不知道如何拆分列表以便将 500g 转换为 500,g。问题是在网站上有时数量是 1,有时是 1 1/2 公斤或某物。现在我需要将其转换为 CSV 文件,然后转换为 MySQL 数据库。我最后想要的是一个包含以下列的 CSV 文件:成分 ID、成分、数量和成分的数量单位。例如:
0,肉,500,克。这是我已经从 this 网站提取数据的代码:
import re
from bs4 import BeautifulSoup
import requests
import csv
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
def read_recipes():
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content, "lxml")
for ingredient in soup2.select('.td-left'):
menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
ingredients.append([id2, *[ingredient.get_text(strip=True)]])
else:
ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
read_recipes()
希望大家能帮帮我谢谢!
似乎包含分数的字符串使用 1/2 等的 unicode 符号,所以我认为开始的一个好方法是通过查找特定的 code 并将其传递给 str.replace()
。拆分此示例的单位和金额很容易,因为它们由 space 分隔。但如果您遇到其他组合,可能有必要对此进行更多概括。
以下代码适用于此特定示例:
import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content)
for ingredient in soup2.select('.td-left'):
# get rid of multiple spaces and replace 1/2 unicode character
raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
# split into unit and number
splitlist = raw_string.split(" ")
menge.append(splitlist[0])
if len(splitlist) == 2:
einheit.append(splitlist[1])
else:
einheit.append('')
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
continue
else:
ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge
输出:
>>> result
ID Ingredients unit amount
0 0 Beinscheibe(n), vom Rind, ca. 4 cm dick geschn... 4
1 0 Mehl etwas
2 0 Zwiebel(n) 1
3 0 Knoblauchzehe(n) 2
4 0 Karotte(n) 1
5 0 Lauchstange(n) 1
6 0 Staudensellerie 0.5
7 0 Tomate(n), geschält Dose 1
8 0 Tomatenmark EL 1
9 0 Rotwein zum Ablöschen
10 0 Rinderfond oder Fleischbrühe Liter 0.5
11 0 Olivenöl zum Braten
12 0 Gewürznelke(n) 2
13 0 Pimentkörner 10
14 0 Wacholderbeere(n) 5
15 0 Pfefferkörner
16 0 Salz
17 0 Pfeffer, schwarz, aus der Mühle
18 0 Thymian
19 0 Rosmarin
20 0 Zitrone(n), unbehandelt 1
21 0 Knoblauchzehe(n) 2
22 0 Blattpetersilie Bund 1
我花了一整天的时间试图解决这个问题,但我没有找到解决方案,所以我希望你能帮助我。我已经尝试从网站上提取数据。但问题是我不知道如何拆分列表以便将 500g 转换为 500,g。问题是在网站上有时数量是 1,有时是 1 1/2 公斤或某物。现在我需要将其转换为 CSV 文件,然后转换为 MySQL 数据库。我最后想要的是一个包含以下列的 CSV 文件:成分 ID、成分、数量和成分的数量单位。例如: 0,肉,500,克。这是我已经从 this 网站提取数据的代码:
import re
from bs4 import BeautifulSoup
import requests
import csv
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
def read_recipes():
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content, "lxml")
for ingredient in soup2.select('.td-left'):
menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
ingredients.append([id2, *[ingredient.get_text(strip=True)]])
else:
ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
read_recipes()
希望大家能帮帮我谢谢!
似乎包含分数的字符串使用 1/2 等的 unicode 符号,所以我认为开始的一个好方法是通过查找特定的 code 并将其传递给 str.replace()
。拆分此示例的单位和金额很容易,因为它们由 space 分隔。但如果您遇到其他组合,可能有必要对此进行更多概括。
以下代码适用于此特定示例:
import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content)
for ingredient in soup2.select('.td-left'):
# get rid of multiple spaces and replace 1/2 unicode character
raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
# split into unit and number
splitlist = raw_string.split(" ")
menge.append(splitlist[0])
if len(splitlist) == 2:
einheit.append(splitlist[1])
else:
einheit.append('')
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
continue
else:
ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge
输出:
>>> result
ID Ingredients unit amount
0 0 Beinscheibe(n), vom Rind, ca. 4 cm dick geschn... 4
1 0 Mehl etwas
2 0 Zwiebel(n) 1
3 0 Knoblauchzehe(n) 2
4 0 Karotte(n) 1
5 0 Lauchstange(n) 1
6 0 Staudensellerie 0.5
7 0 Tomate(n), geschält Dose 1
8 0 Tomatenmark EL 1
9 0 Rotwein zum Ablöschen
10 0 Rinderfond oder Fleischbrühe Liter 0.5
11 0 Olivenöl zum Braten
12 0 Gewürznelke(n) 2
13 0 Pimentkörner 10
14 0 Wacholderbeere(n) 5
15 0 Pfefferkörner
16 0 Salz
17 0 Pfeffer, schwarz, aus der Mühle
18 0 Thymian
19 0 Rosmarin
20 0 Zitrone(n), unbehandelt 1
21 0 Knoblauchzehe(n) 2
22 0 Blattpetersilie Bund 1