PySpark：如何判断数据框的列类型

Question

假设我们有一个名为 df 的数据框。我知道有使用 df.dtypes 的方法。但是我更喜欢类似于

的东西

type(123) == int # note here the int is not a string

我想知道是否有类似的东西：

type(df.select(<column_name>).collect()[0][1]) == IntegerType

基本上我想知道如何直接从dataframe中获取class如IntegerType, StringType的对象，然后进行判断。

谢谢！

Answer 1

TL;DR 使用外部数据类型（普通 Python 类型）测试值，内部数据类型（DataType 子类）测试模式。

首先 - 你不应该使用

type(123) == int

在处理继承的 Python 中检查类型的正确方法是

isinstance(123, int)

说完了，我们来谈谈

Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.

这不是它的工作原理。 DataTypes 描述模式（内部表示）而不是值。外部类型，是一个普通的Python对象，所以如果内部类型是IntegerType，那么外部类型就是int等等，按照Spark SQL Programming guide中定义的规则。

IntegerType（或其他 DataTypes）实例存在的唯一地方是您的模式：

from pyspark.sql.types import *

df = spark.createDataFrame([(1, "foo")])

isinstance(df.schema["_1"].dataType, LongType)
# True
isinstance(df.schema["_2"].dataType, StringType)
# True

_1, _2 = df.first()

isinstance(_1, int)
# True
isinstance(_2, str)
# True

Answer 2

试试怎么样：

df.printSchema()

这将 return 类似于：

root
 |-- id: integer (nullable = true)
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: integer (nullable = true)
 |-- col4: date (nullable = true)
 |-- col5: long (nullable = true)

PySpark：如何判断数据框的列类型

PySpark: How to judge column type of dataframe

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql