.xlsx 到 csv,包括更改数字格式和验证数据
.xlsx to csv including changing number format and validating data
我受困于一个验证 .xlsx 文件数据的小项目。每列都有不同类型的数据,从数字到纯文本、日期等。
我想要完成的是打开一个 .xlsx 文件,更改数字格式(例如:通过应用 Excel-numformats('@' 表示字符串,'dd.mm.yyyy' 表示日期,'0' 表示整数,'0.00' 表示小数,等等))。在下一步中,我将通过遍历每个单元格和字符(我将字符替换为字典)来验证每列的数据,然后将其保存为具有 Utf-8 编码的 csv。
目前我正在使用 xlrd 模块打开采用 utf-8 编码 (encoding_overwrite) 的 .xlsx 文件,然后使用 xlsxwriter 更改数字格式。问题是我必须将更改后的数据保存为 .xlsx 文件,然后再次使用 xlrd 重新打开它,验证字符,然后使用 unicode csv 模块将其保存为 .csv。但我想跳过再次保存文件的步骤,然后重新打开它。
我也已经尝试过 openpyxl,但是打开 "big" 个文件和操作数据所花费的时间太长了。
有什么方法可以将 xlsxwriter 的工作簿 class 的数据 class 传送到 xlrd 或者 xlsxwriter 是否有遍历其数据并更改其值的方法(我在两个库的文档)?或者也许有一个更强大的库适合我的情况。
请赐教
代码示例:
# Opening .xlsx-File (xlrd)
input_file = open_workbook(curr_path + "/" + filename + '.xlsx', encoding_override="cp1252")</code><br>
# Creating the dictionaries for replace
def create_dicts(file_flag):
global dict_letters
global dict_numbers
global dict_special
global dict_control
global dict_greek
if file_flag <> 'sonstige':
dict_letters = create_dict("Buchstaben")
dict_numbers = create_dict("Zahlen")
dict_special = create_dict("Sonderzeichen")
dict_control = create_dict("Kontrollzeichen")
dict_greek = create_dict("Griechisch")
else:
dict_all = create_dict("All")
print "Dicts created"
def create_dict(ws):
keys =[]
values =[]
cell = ''
dict_xl_ws = ''
# Either create a dictionary containing all sheets or one for each sheet (depending on the parameter ws)
if ws == "All":
for curr_sheet in range(dict_xl_wb.nsheets):
dict_xl_ws = dict_xl_wb.sheet_by_index(curr_sheet)
for curr_row in range(dict_xl_ws.nrows):
for curr_col in [0,1]:
if str(dict_xl_ws.cell_value(curr_row, curr_col)) not in skip_list:
if curr_col == 0:
keys.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
elif curr_col == 1:
values.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
else:
dict_xl_ws = dict_xl_wb.sheet_by_name(ws)
for curr_row in range(dict_xl_ws.nrows):
for curr_col in [0,1]:
if str(dict_xl_ws.cell_value(curr_row, curr_col)) not in skip_list:
if curr_col == 0:
keys.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
elif curr_col == 1:
values.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
return dict(zip(keys,values))
# Calling the create_dicts() function
create_dicts(file_flag)
# Creating Workbook and Worksheet from class (xlsxwriter)
test = xlsxwriter.Workbook("test.xlsx")
test_ws = test.add_worksheet("TEST")
# Defining the number formats
text_format = test.add_format({'num_format': '@'})
integer_format = test.add_format({'num_format': '0'})
double_format = test.add_format({'num_format': '0.00'})
date_format = test.add_format({'num_format': 'DD.MM.YYYY'})
# Creating dictionary (key = start column; value = end column)
integer_dict = {0:1,4:5,8:9,16:16,19:19,25:26,28:29,33:33,35:35,42:42,44:46,48:49}
for key,value in integer_dict.iteritems():
# Applying the formats for each column in test_ws
test_ws.set_column(key, value,20, integer_format)
# After that I'd like to iterate through the data of xlsxwriter's workbook/worksheet class and change the data by replacing characters with those in the dictionaries. That part is already coded, but I need a proper library to work with
# Creating a csv-file to write the validated data into
output_file = codecs.open(curr_path + "/" + filename + '_OUT.csv','wb', encoding='utf-8')
正如@lenz 正确指出的那样,XLRD 有一种方法可以将单元格的值更改为 excel 日期。
所以我使用 xlrd.xldate.xldate_as_datetime() 函数来更改值。解决方案如下:
date_variable = xlrd.xldate.xldate_as_datetime(ws.cell_value(cell_row, cell_col), 0).strftime('%d.%m.%Y')
为了正确地将浮点数更改为整数,我使用了以下代码片段:
integer_variable = '{0:g}'.format(Decimal(ws.cell_value(cell_row, cell_col)))
我希望这两行对我有所帮助!
我受困于一个验证 .xlsx 文件数据的小项目。每列都有不同类型的数据,从数字到纯文本、日期等。
我想要完成的是打开一个 .xlsx 文件,更改数字格式(例如:通过应用 Excel-numformats('@' 表示字符串,'dd.mm.yyyy' 表示日期,'0' 表示整数,'0.00' 表示小数,等等))。在下一步中,我将通过遍历每个单元格和字符(我将字符替换为字典)来验证每列的数据,然后将其保存为具有 Utf-8 编码的 csv。
目前我正在使用 xlrd 模块打开采用 utf-8 编码 (encoding_overwrite) 的 .xlsx 文件,然后使用 xlsxwriter 更改数字格式。问题是我必须将更改后的数据保存为 .xlsx 文件,然后再次使用 xlrd 重新打开它,验证字符,然后使用 unicode csv 模块将其保存为 .csv。但我想跳过再次保存文件的步骤,然后重新打开它。 我也已经尝试过 openpyxl,但是打开 "big" 个文件和操作数据所花费的时间太长了。
有什么方法可以将 xlsxwriter 的工作簿 class 的数据 class 传送到 xlrd 或者 xlsxwriter 是否有遍历其数据并更改其值的方法(我在两个库的文档)?或者也许有一个更强大的库适合我的情况。
请赐教
代码示例:
# Opening .xlsx-File (xlrd)
input_file = open_workbook(curr_path + "/" + filename + '.xlsx', encoding_override="cp1252")</code><br>
# Creating the dictionaries for replace
def create_dicts(file_flag):
global dict_letters
global dict_numbers
global dict_special
global dict_control
global dict_greek
if file_flag <> 'sonstige':
dict_letters = create_dict("Buchstaben")
dict_numbers = create_dict("Zahlen")
dict_special = create_dict("Sonderzeichen")
dict_control = create_dict("Kontrollzeichen")
dict_greek = create_dict("Griechisch")
else:
dict_all = create_dict("All")
print "Dicts created"
def create_dict(ws):
keys =[]
values =[]
cell = ''
dict_xl_ws = ''
# Either create a dictionary containing all sheets or one for each sheet (depending on the parameter ws)
if ws == "All":
for curr_sheet in range(dict_xl_wb.nsheets):
dict_xl_ws = dict_xl_wb.sheet_by_index(curr_sheet)
for curr_row in range(dict_xl_ws.nrows):
for curr_col in [0,1]:
if str(dict_xl_ws.cell_value(curr_row, curr_col)) not in skip_list:
if curr_col == 0:
keys.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
elif curr_col == 1:
values.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
else:
dict_xl_ws = dict_xl_wb.sheet_by_name(ws)
for curr_row in range(dict_xl_ws.nrows):
for curr_col in [0,1]:
if str(dict_xl_ws.cell_value(curr_row, curr_col)) not in skip_list:
if curr_col == 0:
keys.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
elif curr_col == 1:
values.append(str(dict_xl_ws.cell_value(curr_row, curr_col)).upper())
return dict(zip(keys,values))
# Calling the create_dicts() function
create_dicts(file_flag)
# Creating Workbook and Worksheet from class (xlsxwriter)
test = xlsxwriter.Workbook("test.xlsx")
test_ws = test.add_worksheet("TEST")
# Defining the number formats
text_format = test.add_format({'num_format': '@'})
integer_format = test.add_format({'num_format': '0'})
double_format = test.add_format({'num_format': '0.00'})
date_format = test.add_format({'num_format': 'DD.MM.YYYY'})
# Creating dictionary (key = start column; value = end column)
integer_dict = {0:1,4:5,8:9,16:16,19:19,25:26,28:29,33:33,35:35,42:42,44:46,48:49}
for key,value in integer_dict.iteritems():
# Applying the formats for each column in test_ws
test_ws.set_column(key, value,20, integer_format)
# After that I'd like to iterate through the data of xlsxwriter's workbook/worksheet class and change the data by replacing characters with those in the dictionaries. That part is already coded, but I need a proper library to work with
# Creating a csv-file to write the validated data into
output_file = codecs.open(curr_path + "/" + filename + '_OUT.csv','wb', encoding='utf-8')
正如@lenz 正确指出的那样,XLRD 有一种方法可以将单元格的值更改为 excel 日期。 所以我使用 xlrd.xldate.xldate_as_datetime() 函数来更改值。解决方案如下:
date_variable = xlrd.xldate.xldate_as_datetime(ws.cell_value(cell_row, cell_col), 0).strftime('%d.%m.%Y')
为了正确地将浮点数更改为整数,我使用了以下代码片段:
integer_variable = '{0:g}'.format(Decimal(ws.cell_value(cell_row, cell_col)))
我希望这两行对我有所帮助!