数据框中的单列需要分成 3 列

Question

我最近经常使用堆栈溢出，感谢这里的社区。

我有一个我一直在研究的代码，它终于开始看起来应该是我有一个我无法克服的故障。

我从一个站点提取了不同州的数据，它们看起来都一样 table，但其中包含不同的数据。我不得不更改 BeautifulSoup 编码以使循环工作，但现在我有一个非常丑陋的列，其中包含所有数据。在 python.

中很容易看出哪条线到哪里，但真的不知道如何开始

如有任何帮助，我们将不胜感激。

states = ["Washington", "Oregon"]

period = "2020"

num_states = len(states)

state_list = []

df = pd.DataFrame()
#df.columns['COUNTY','PAYMENT','TOTAL ACRES']

for state in states:
    driver = webdriver.Chrome(executable_path = 'C:/webdrivers/chromedriver.exe')
    driver.get('https://www.nbc.gov/pilt/counties.cfm')
    driver.implicitly_wait(20)
    state_s = driver.find_element(By.NAME, 'state_code')
    drp = Select(state_s)
    drp.select_by_visible_text(state)
    year_s = driver.find_element(By.NAME, 'fiscal_yr')
    drp = Select(year_s)
    drp.select_by_visible_text(period)
    driver.implicitly_wait(10)
    link = driver.find_element(By.NAME, 'Search')
    link.click()
    url = driver.current_url
    page = requests.get(url)
    #dfs  = pd.read_html(addrss)[2]
    # Get the html
    soup = BeautifulSoup(page.text, 'lxml')
    table = soup.findAll('table')[2]
    headers = []

    for i in table.find_all('th'):
        title = i.text.strip()
        headers.append(title)
    


    for row in table.find_all('tr')[1:]:
        data = row.find_all('td')
        row_data = [td.text.strip() for td in data]
        length = len(df)

df

输出：

           0
0    ADAMS COUNTY
1         ,408
2          21,337
0   ASOTIN COUNTY
1        4,550
..            ...
1         ,627
2          58,311
0           TOTAL
1     ,321,995
2      31,312,205

[228 rows x 1 columns]

Answer 1

您可以设置自定义索引，然后使用 unstack() 并重命名。这是基于假设上图中的 headers 与目标数据集匹配，（即索引以 3 的倍数重复）

df1 = df.set_index(df.groupby(level=0)\
                   .cumcount(),append=True).stack()\
                   .unstack(0)\
                   .rename(columns={0 : 'County', 1: 'Price', 2 : 'Population?'})


print(df1)

                County            Price      Population?
0 1       ADAMS COUNTY          ,408           21,337
1 1      ASOTIN COUNTY         4,550           58,311
2 1     ANOTHER COUNTY          ,627       31,312,205
3 1              TOTAL      ,321,995              NaN

数据框中的单列需要分成 3 列

single column in dataframe needs to be broken up into 3 columns

selenium

beautifulsoup

multiple-columns

pandas