在 polars 中搜索 DataFrame

Searching a DataFrame in polars

我正在尝试编写一个小 python 脚本,该脚本读取具有以下架构的 .parquet 文件:

a b c d
0 x 2 y
2 1 x z

该脚本采用以下参数:

然后它在给定的列中搜索给定的搜索字符串和 returns 包含给定列中给定值的 DataFrame 的整行。

我现在的问题是如何正确编写搜索,因为在当前的实现中,如果我尝试搜索 dtype 不同于 utf8 的列,我会收到以下错误:RuntimeError: Any(SchemaMisMatch("Series dtype UInt64 != utf8"))

程序执行如下所示:pyton ./pqtmgr.py -f './test.parquet' -c 'a' -s '2'

#!/usr/bin/python

# Imports
import polars
import argparse


### MAIN ###
# Main
def main():
    arguments = parse_arguments()

    dataframe = polars.read_parquet(arguments.files_input)

    dataframe = dataframe_search(arguments, dataframe)


### MISC ###
# Search DataFrame and return a result DataFrame
def dataframe_search(arguments, dataframe) -> polars.DataFrame:
    dataframes = []

    for column in arguments.columns:
        for search in arguments.search:
            dataframes.append(
                dataframe.filter(
                    (polars.col(column).str.contains(search))
                )
            )

    return polars.concat(dataframes, True, "diagonal")

### ARGUMENTS ###
# Parse given arguments
def parse_arguments():
    parser = argparse.ArgumentParser(
        prog='pqtmgr.py'
    )

    # Add argument to take an input file
    parser.add_argument(
        '-f',
        '--file-input',
        dest='fils_input',
        help='''
        Takes one filepath as input file which will be searched
        ''',
        required=True
    )

    # Add argument to take a list of columns to search
    parser.add_argument(
        '-c',
        '--columns',
        dest='columns',
        help='''
            Accepts one or multiple columns that will be searched
        ''',
        nargs='*',
        required=True
    )

    # Add argument to search the given strings
    parser.add_argument(
        '-s',
        '--search',
        dest='search',
        help='''
            Accepts one or more strings or regular expression that are searched for in the given columns
        ''',
        nargs='*'
    )


# Execute Main
if __name__ == '__main__':
     main()

假设 search 始终是一个字符串,如果您想在所有列中搜索,最简单的方法是在放入 str 命名空间之前简单地转换为 Utf8。一个简短的例子:

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": ["hello", "world", "everyone"]})
search = "hello"

df["b"].str.contains(search)  # this works
df["a"].str.contains(search)  # this fails, as "a" is not of type Utf8
df["a"].cast(pl.Utf8).str.contains(search)  # this works