使用网站上下文映射不同的列值

Map different column values with website context

我有这样一个数据框:

df1 = pd.DataFrame({   
    "index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else","exec sp_add_jobserver"],
    "index1": ["NaN",  "NaN",  "NaN", "exec sp_delete_job"],
    "index2": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "NaN", "something else"],
    "index3": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()

      index                 index1                  index2          index3
0   EXEC sp_delete_job       NaN                EXEC sp_droplogin   EXEC sp_droplogin
1   exec sp_add_job          NaN                EXEC sp_delete_job  EXEC sp_delete_job
2   something else           NaN                  NaN                   exec sp_add_job
3   exec sp_add_jobserver    exec sp_delete_job   something else    exec sp_delete_job

我想要的是将列值映射到该站点的描述 https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15

所以例如这个值 EXEC sp_droplogin 可以映射到这里的描述 https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15

因此输出将如下所示:

    index
    0   Removes a SQL Server login. This prevents access to an instance of SQL Server under 
 that login name.
    1   EXEC sp_delete_job
    2   exec sp_add_job
    3   exec sp_delete_job
    4   exec sp_add_jobserver

并且必须对其他列值执行相同的操作。

执行此操作的最佳方法是什么?用 BeautifulSoup?

你能提供一些 ideas/direction/code 等吗?

您可以为每个 index 条目调用一个函数,并将其替换为 requests beautifulsoup 查找的结果:

import pandas as pd
import requests
from bs4 import BeautifulSoup

def description(value):   
    name = value.split(' ')[1].replace('_', '-')
    url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, "html.parser")
    div = soup.find('div', class_="content")
    return [p.text for p in div.find_all('p')][3]


df = pd.DataFrame({   
    "index": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})


df['index'] = df['index'].map(description)
print(df)

这将按如下方式更改您的数据框:

                                                                                                  index
0  Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1                                                                                        Deletes a job.
2                                                     Adds a new job executed by the SQL Agent service.
3                                                                                        Deletes a job.
4                                                    Targets the specified job at the specified server.
  1. 先取value例如EXEC sp_droplogin 并在 space 上拆分。然后取第二部分 sp_droplogin 并将 _ 替换为 URL.

    所需的 -
  2. 根据name创建一个合适的URL。

  3. 使用requests.get()从Microsoft站点获取相应的HTML。

  4. 找到包含描述的 <div class='content'>

  5. 在 div 内,找到所有 <p> 元素并提取每个元素的文本。第四个条目包含所需的文本。 Return那个。

如果有 None 个值,您需要对此进行测试并 return 一个合适的值:

def description(value):
    if value:
        .........existing code......
    else:
        return "Not found"

对于您更新的示例,我建议您使用字典来保存每个请求的结果,以避免多次查找相同的值。

您可以使用 .applymap() 到 运行 数据框中所有项目的函数。

最后,如果 value 不是以 exec 开头,那么只需 return 值不变(或您喜欢的任何值)

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

procedures = {}     # cache of results

def description(value):   
    if value.lower().startswith("exec "):
        name = value.lower().split(' ')[1].replace('_', '-')
        
        if name in procedures:  # already seen?
            return procedures[name]
        else:
            url = f"https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
            req = requests.get(url)
            soup = BeautifulSoup(req.content, "html.parser")
            div = soup.find('div', class_="content")
            text = [p.text for p in div.find_all('p')][3]
            procedures[name] = text
            return text
    else:
        return value


df = pd.DataFrame({   
    "index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else", "exec sp_add_jobserver"],
    "index1": ["NaN", "NaN",  "NaN", "exec sp_delete_job"],
    "index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
    "index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})

df = df.applymap(description)
print(df)