使用网站上下文映射不同的列值
Map different column values with website context
我有这样一个数据框:
df1 = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else","exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()
index index1 index2 index3
0 EXEC sp_delete_job NaN EXEC sp_droplogin EXEC sp_droplogin
1 exec sp_add_job NaN EXEC sp_delete_job EXEC sp_delete_job
2 something else NaN NaN exec sp_add_job
3 exec sp_add_jobserver exec sp_delete_job something else exec sp_delete_job
我想要的是将列值映射到该站点的描述
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15
所以例如这个值 EXEC sp_droplogin
可以映射到这里的描述
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15
因此输出将如下所示:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under
that login name.
1 EXEC sp_delete_job
2 exec sp_add_job
3 exec sp_delete_job
4 exec sp_add_jobserver
并且必须对其他列值执行相同的操作。
执行此操作的最佳方法是什么?用 BeautifulSoup?
你能提供一些 ideas/direction/code 等吗?
您可以为每个 index
条目调用一个函数,并将其替换为 requests
beautifulsoup
查找的结果:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def description(value):
name = value.split(' ')[1].replace('_', '-')
url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
return [p.text for p in div.find_all('p')][3]
df = pd.DataFrame({
"index": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})
df['index'] = df['index'].map(description)
print(df)
这将按如下方式更改您的数据框:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1 Deletes a job.
2 Adds a new job executed by the SQL Agent service.
3 Deletes a job.
4 Targets the specified job at the specified server.
先取value
例如EXEC sp_droplogin
并在 space 上拆分。然后取第二部分 sp_droplogin
并将 _
替换为 URL.
所需的 -
根据name
创建一个合适的URL。
使用requests.get()
从Microsoft站点获取相应的HTML。
找到包含描述的 <div class='content'>
。
在 div 内,找到所有 <p>
元素并提取每个元素的文本。第四个条目包含所需的文本。 Return那个。
如果有 None
个值,您需要对此进行测试并 return 一个合适的值:
def description(value):
if value:
.........existing code......
else:
return "Not found"
对于您更新的示例,我建议您使用字典来保存每个请求的结果,以避免多次查找相同的值。
您可以使用 .applymap()
到 运行 数据框中所有项目的函数。
最后,如果 value
不是以 exec
开头,那么只需 return 值不变(或您喜欢的任何值)
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
procedures = {} # cache of results
def description(value):
if value.lower().startswith("exec "):
name = value.lower().split(' ')[1].replace('_', '-')
if name in procedures: # already seen?
return procedures[name]
else:
url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
text = [p.text for p in div.find_all('p')][3]
procedures[name] = text
return text
else:
return value
df = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else", "exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df = df.applymap(description)
print(df)
我有这样一个数据框:
df1 = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else","exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()
index index1 index2 index3
0 EXEC sp_delete_job NaN EXEC sp_droplogin EXEC sp_droplogin
1 exec sp_add_job NaN EXEC sp_delete_job EXEC sp_delete_job
2 something else NaN NaN exec sp_add_job
3 exec sp_add_jobserver exec sp_delete_job something else exec sp_delete_job
我想要的是将列值映射到该站点的描述 https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15
所以例如这个值 EXEC sp_droplogin
可以映射到这里的描述
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15
因此输出将如下所示:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under
that login name.
1 EXEC sp_delete_job
2 exec sp_add_job
3 exec sp_delete_job
4 exec sp_add_jobserver
并且必须对其他列值执行相同的操作。
执行此操作的最佳方法是什么?用 BeautifulSoup?
你能提供一些 ideas/direction/code 等吗?
您可以为每个 index
条目调用一个函数,并将其替换为 requests
beautifulsoup
查找的结果:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def description(value):
name = value.split(' ')[1].replace('_', '-')
url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
return [p.text for p in div.find_all('p')][3]
df = pd.DataFrame({
"index": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})
df['index'] = df['index'].map(description)
print(df)
这将按如下方式更改您的数据框:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1 Deletes a job.
2 Adds a new job executed by the SQL Agent service.
3 Deletes a job.
4 Targets the specified job at the specified server.
先取
所需的value
例如EXEC sp_droplogin
并在 space 上拆分。然后取第二部分sp_droplogin
并将_
替换为 URL.-
根据
name
创建一个合适的URL。使用
requests.get()
从Microsoft站点获取相应的HTML。找到包含描述的
<div class='content'>
。在 div 内,找到所有
<p>
元素并提取每个元素的文本。第四个条目包含所需的文本。 Return那个。
如果有 None
个值,您需要对此进行测试并 return 一个合适的值:
def description(value):
if value:
.........existing code......
else:
return "Not found"
对于您更新的示例,我建议您使用字典来保存每个请求的结果,以避免多次查找相同的值。
您可以使用 .applymap()
到 运行 数据框中所有项目的函数。
最后,如果 value
不是以 exec
开头,那么只需 return 值不变(或您喜欢的任何值)
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
procedures = {} # cache of results
def description(value):
if value.lower().startswith("exec "):
name = value.lower().split(' ')[1].replace('_', '-')
if name in procedures: # already seen?
return procedures[name]
else:
url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
text = [p.text for p in div.find_all('p')][3]
procedures[name] = text
return text
else:
return value
df = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else", "exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df = df.applymap(description)
print(df)