直接从源数据中提取的字符串似乎与源数据中的字符串不匹配

Question

我有一个字符串无法评估为与自身匹配。我正在尝试根据列中的 8 个可能值之一做一个简单的子集，

out <- df[df$`Var name` == "string",]

我已经用不同的字符串让它工作了多次，但由于某种原因这个字符串失败了。我尝试使用以下四种途径从源中获取确切的字符串（认为可能存在某些字符编码问题），但没有成功。即使当我显式调用我知道包含该字符串的单元格并将其复制到评估语句中时，它也会失败

> df[i,j]
[1] "string"
df[i,j]=="string"  # pasted from above line

我不明白如何明确粘贴刚刚给出的输出但它不匹配。

## attempts to get exact string to paste into subset statement    
# from dput 
"IF APPLICABLE – Which of the following best characterizes the expectations with"

# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"

# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with

# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with

我不知道这里会发生什么。我明确地直接从数据中绘制字符串，但它仍然无法评估为真。背景中是否发生了我没有看到的事情？我是不是忽略了一些非常简单的事情？

编辑：

我基于另一种方式进行子集化，下面是我正在做的输入和实际示例：

> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with", 
    `Item Response` = "It was required.", orgchar_group = "locale", 
    `Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475, 
    `Graphs note` = NA_character_, `Report note` = NA_character_, 
    `Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_, 
    statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem", 
"Item Response", "orgchar_group", "Org Characteristic", "N", 
"percent", "Graphs note", "Report note", "Other note", "subsig", 
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE

Answer 1

事实证明它实际上是一个不可打印的字符，感谢评论者通过 1) 提出建议和 2) 证明它对他们有用来帮助我解决这个问题。

我能够使用 (& here) and here 的见解来解决这个问题。

我使用 grep 命令（来自@Tyler Rinker）来确定我的字符串中实际上有一个非 ASCII 字符，并使用 stringi 命令（来自@hadley）来确定是哪种字符。然后我使用@Josh O'Brien 的基本解决方案将其删除。原来是heiphen。

# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")  
[1] "IF APPLICABLE  Which of the following best characterizes the expectations with"

# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")

仍然没有足够的理解来解释它发生的原因，但现在已经修复了。

直接从源数据中提取的字符串似乎与源数据中的字符串不匹配

String pulled directly from source data seems to not match string in source data

string

r

string-matching

编辑：