尝试清理 R 数据帧中的字符向量时遇到问题(UTF-8 编码问题)
Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)
我在在线手动提取数据后清理数据集时遇到一些问题 - 我猜这些是编码问题。我在尝试删除 "Athlete" 列 cels 中的 "U+00A0" 以及运算符括号时遇到问题。我查了下对应的UTF-8编码,是"No-Break-Space"的。我也不确定如何替换其他 UTF-8 字符以使名称清晰易读 - 例如让 U+008A 显示为 Š.
数据子集
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
这是我迄今为止尝试过但没有成功的方法:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)
以下将删除不间断 space。
head2007decathlon$Athlete <- gsub(pattern="<U\+00A0>",replacement="",x=head2007decathlon$Athlete)
不确定如何转换其他字符。一个问题可能是代码不完全是 R 视为 UTF-8 的格式。
一个例子:
iconv('\u008A', from="UTF-8", to="LATIN1")
这似乎有效果,与尝试转换 U+008A
相反。虽然
输出是:
[1] "\x8a"
不是你想要的角色。希望这能有所帮助。
我在在线手动提取数据后清理数据集时遇到一些问题 - 我猜这些是编码问题。我在尝试删除 "Athlete" 列 cels 中的 "U+00A0" 以及运算符括号时遇到问题。我查了下对应的UTF-8编码,是"No-Break-Space"的。我也不确定如何替换其他 UTF-8 字符以使名称清晰易读 - 例如让 U+008A 显示为 Š.
数据子集
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
这是我迄今为止尝试过但没有成功的方法:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)
以下将删除不间断 space。
head2007decathlon$Athlete <- gsub(pattern="<U\+00A0>",replacement="",x=head2007decathlon$Athlete)
不确定如何转换其他字符。一个问题可能是代码不完全是 R 视为 UTF-8 的格式。
一个例子:
iconv('\u008A', from="UTF-8", to="LATIN1")
这似乎有效果,与尝试转换 U+008A
相反。虽然
输出是:
[1] "\x8a"
不是你想要的角色。希望这能有所帮助。