将 excel 文件读取到 pandas 数据帧时处理数据类型问题

Question

我有一个 excel(.xlsx) 文件，其中包含以下列

Location    Month       Desc            Position    Budget
EUR         1/1/2020    In Europe       Right       34%
AUS         1/1/2020    In Australia    Left        >22%

在 pandas df 中阅读此文件时，我在预算栏中遇到了问题。出现以下错误：

field Budget: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
Could not convert '>22%' with type str: tried to convert to double

我正在尝试使用此代码：

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel(parent_path+'file1.xlsx', sheet_name='Sheet1')

fileSchema = StructType([
  StructField("Location", StringType()),
  StructField("Month", DateType()),
  StructField("Desc", StringType()),
  StructField("Position", StringType()),
  StructField("Budget", StringType())])

pdf.fillna('')
df = spark.createDataFrame(pdf)

df.show()

我需要阅读多个 excel 文件。如何处理这里的数据类型问题？任何建议

Answer 1

看来您可以使用自定义 converter:

def bcvt(x):
    return float(x.replace('>','').replace('%',''))/100

dfd = pd.read_csv(r'd:\jchtempnew\t1.csv', converters={'Budget': bcvt})

dfd 

  Location     Month          Desc Position  Budget
0      EUR  1/1/2020     In Europe    Right    0.34
1      AUS  1/1/2020  In Australia     Left    0.22

（根据 @user128029 建议更新）

将 excel 文件读取到 pandas 数据帧时处理数据类型问题

handle datatype issue while reading excel file to pandas dataframe

python

excel

pandas

apache-spark

pyspark