如果搜索的字符串包含字符串开头的子字符串，Pandas string.contains 不起作用

Question

我正在使用 str.contains 搜索列包含特定字符串作为子字符串的行

df[df['col_name'].str.contains('find_this')]

这 return 是 'find_this' 在字符串中某处的所有行。然而，在 df['col_name'] 中的字符串以 'find_this' 开始的罕见但重要的情况下，该行未被上述查询 return 编辑。

str.contains() returns false where it should return true.

任何帮助将不胜感激，谢谢！

编辑我已按要求添加了一些示例数据。 Image of dataframe. 我想更新 'Eqvnt_id' 列，例如，'Course_ID' 列包含 AAS 102 的行都具有相同的 'Eqvnt_id' 值。

为此，我需要能够在 'Course_ID' 中的字符串中搜索 'AAS 102'，以便找到适当的行。但是，当我这样做时：

df[df['Course_ID'].str.contains('AAS 102')]

具有'AAS 102 (ENGL 102, JST 102, REL 102)'的行没有出现在查询中！

数据类型都是对象。我试过映射它们并将它们应用于字符串类型，但它对查询的成功没有影响。

图像中的数据可以在https://github.com/isaachowen/Whosebugquestionfiles

找到

Answer 1

您可以改用 pandas.Series.str.find() - 它 returns 找到字符串的索引 - 如果它位于开头，则返回的索引将为 0。如果未找到字符串, 它 returns -1.

df[df['col_name'].str.find('find_this') != -1]

如果有帮助请告诉我！

Answer 2

TLDR：尝试使用 pandas.Series.str.normalize()，尝试不同的 Unicode 形式，直到问题得到解决。 'NFKC' 对我有用。

问题与我正在执行的列中的数据格式有关...

df['column'].str.contains('substring')

...正在运行。使用 pandas.Series.str.normalize() 函数有效。 Link here. Sometimes, under some circumstances that I can't deliberately recreate, the strings would have '\xa0' and '\n' appended to them at the beginning or the end of the string. This post 有助于解决如何处理该问题。在 post 之后，我 for-loop 遍历每个字符串列并更改 unicode 形式，直到找到有效的东西：'NFKC'.

如果搜索的字符串包含字符串开头的子字符串，Pandas string.contains 不起作用

Pandas string.contains doesn't work if searched string contains the substring at the beginning of the string

python

string

substring

contains

pandas