Python - pandas xls 导入 - 删除特定行时遇到困难 +

Python - pandas xls import - difficulties removing certain row +

[迷你康达,python 3]

我的数据.xls下载:(密码:stack) Download .xls

0) 您会注意到我的 xls 文件在第一行有很大的合并单元格,在第 2 行和第 3 行也有一些合并单元格。这是一个问题吗?如果这是一个问题 - 我能以某种方式取消合并吗?

1) 我想删除此 xls 的第一行,因为对我来说没有重要信息。我想问题是该行已合并?我想为此使用 df = df.drop([0]),但它没有删除这个巨大的第一行,而是删除了列为 headers 的行(从 "ID klienta" 开始)。这是为什么?

2) 在我摆脱第一行之后,我想处理来自不同列的一些数字(在我的示例中,我想将数据与 "Stav" 列分开)。我怎么做?我在某处看到可以仅通过其 header 名称(字符串)对 rows/columns 进行索引。例如,我想使用 header "Stav" 将数据与列分开: Stav = df['Stav']

到目前为止我的代码是:

import pandas as pd
import numpy as np

print("\n\n*********************************************")
print("My xls processing script\n")
print("*********************************************\n")

#load data 
df = pd.read_excel("file.xls")

#My unsucessful attempt to get rid of first row 
#uncomment this and it will remove the second row instead of the first row
#df = df.drop([0])

#print preview of 6 rows 5 columnts
print(df.iloc[0:5, 0:4])
print("\n\n")

#My unsuccessful attempt to get column date with header 'ID'
Stav = df['Stav']
print(Stav)

控制台输出:

(xls_env) C:\Users\Slavek\Documents\PythonScripts>python xld_proj.py

*********************************************
My xls processing script

*********************************************

  Lidé, které jsem podpořil                 Unnamed: 1 Unnamed: 2  Unnamed: 3
0                ID klienta                      Název       Stav  ID příběhu
1                       NaN                        NaN        NaN         NaN
2               zonky214882                       Jeep   na cestě      181187
3               zonky235862  Notebook k práci i relaxu   na cestě      206317
4               zonky230378               Dětský pokoj  v pořádku      199686



Traceback (most recent call last):
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Stav'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "xld_proj.py", line 20, in <module>
    Stav = df['Stav']
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Stav'

我想你想要 header 函数选项读入

df = pd.read_excel("file.xls", header =[0,1,2])

然后您可以删除不需要的 headers:

 df.columns = df.columns.droplevel([0,1])

或类似的东西。 sheet 有点乱,因为变量名分散在两个子 header 中。我会清理它,以便它们都在同一条线上。

或保留所有 header 并查看此处: How do I change or access pandas MultiIndex column headers?

查看您输入的 excel 文件的屏幕截图以及打印的数据框,您遇到的问题可能是由于第二行和第三行中的合并单元格所致。

我建议使用文档 (Link Here) 中概述的 pandas.DataFrame.to_excel 的一些参数。特别是,headerskiprows 应该对您有所帮助。

我在下面提供了一个示例,我在其中创建了一个 excel 文件 (.xlsx),它复制了合并单元格的问题。然后我将 .xlsx 复制为 .xls 并使用 pandas.DataFrame.to_excel 和拼写的 headerskiprows 阅读它。

import pandas as pd
import numpy as np
import shutil

# Creating a dataframe and saving as test.xlsx in current directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', startrow=3, index=False, 
header=False)
wb  = writer.book
ws = writer.sheets['Sheet1']

ws.merge_range('A1:C1', 'Large Merged Cell in first Row')
ws.merge_range('A2:A3', 'A')
ws.merge_range('B2:B3', 'B')
ws.merge_range('C2:C3', 'C')

wb.close()

print(df)
#copying test.xlsx as a .xls file
shutil.copy(r"test.xlsx" , r"test.xls")

new_df = pd.read_excel('test.xls', header = 0, skiprows = [0,2])
print(new_df)

预期 test.xls 文件:

print(new_df) 应该显示:

          A         B         C
0  1.242498  0.512675 -1.370710
1  0.060366 -0.467702 -1.420735
2 -0.198547  0.042364  0.915423
3  0.340909  0.749019  0.272871
4  2.633348 -1.343251 -0.248733
5  0.892257  0.371924  0.023415
6 -0.809030 -0.633796  0.449373
7  0.322960  2.073352  1.362657
8 -0.848093  1.848489  0.813144
9  2.718069 -0.540174  1.411980