R-更改数据框中列的编码?
R- Changing encoding of column in dataframe?
我正在尝试更改数据框中列的编码。
stri_enc_mark(data_updated$text)
# [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
# [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
# [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
# [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"
当我尝试转换它时,它没有抛出错误,但对向量仍然没有影响:
d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
# [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
# [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
# [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
# [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"
有什么建议吗?
我在 Windows 7、32 位。添加数据片段。
> Encoding(data_updated$text[1:35])
[1] "UTF-8" "unknown" "unknown" "UTF-8" "unknown" "unknown" "UTF-8"
[8] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8" "unknown" "UTF-8"
[15] "unknown" "UTF-8" "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
[22] "UTF-8" "unknown" "UTF-8" "UTF-8" "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8" "unknown" "unknown" "unknown" "UTF-8" "UTF-8"
数据看起来像这样。
> data_updated$text[1:35]
[1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
[2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma http://t.co/wGdiS5OlRq"
[3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"
[4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…"
[5] "Reliance Jio Infocomm: Indian carrier raises 0 million loan for 4G rollout http://t.co/B2aWlkmwXz"
[6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"
[7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"
[8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
[9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…"
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"
这些是推文,我认为 "http://" 链接在此处指示编码,因为它们具有 "wGdiS5OlRq" 之类的表达式。为了进行分析,我使用正则表达式删除了这些标签。但是要将原始数据存储在数据库中,我需要这些推文。 MongoDB 没有问题,但 RDBMS 会引发问题。
看来我们可以在将向量转换为Factor然后再转换回字符向量之后使用conv()函数来转换编码。说实话有点奇怪
以防有人仍然卡住:我使用了 Encoding()。
for (col in colnames(mydataframe)){
Encoding(mydataframe[[col]]) <- "UTF-8"}
我发现 stringi::stri_enc_toascii()
非常有用,可以解决我的问题。
我在
发布了一个案例
我正在尝试更改数据框中列的编码。
stri_enc_mark(data_updated$text)
# [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
# [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
# [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
# [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"
当我尝试转换它时,它没有抛出错误,但对向量仍然没有影响:
d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
# [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
# [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
# [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
# [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"
有什么建议吗?
我在 Windows 7、32 位。添加数据片段。
> Encoding(data_updated$text[1:35])
[1] "UTF-8" "unknown" "unknown" "UTF-8" "unknown" "unknown" "UTF-8"
[8] "UTF-8" "UTF-8" "unknown" "unknown" "UTF-8" "unknown" "UTF-8"
[15] "unknown" "UTF-8" "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
[22] "UTF-8" "unknown" "UTF-8" "UTF-8" "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8" "unknown" "unknown" "unknown" "UTF-8" "UTF-8"
数据看起来像这样。
> data_updated$text[1:35]
[1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
[2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma http://t.co/wGdiS5OlRq"
[3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"
[4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…"
[5] "Reliance Jio Infocomm: Indian carrier raises 0 million loan for 4G rollout http://t.co/B2aWlkmwXz"
[6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"
[7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"
[8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
[9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…"
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"
这些是推文,我认为 "http://" 链接在此处指示编码,因为它们具有 "wGdiS5OlRq" 之类的表达式。为了进行分析,我使用正则表达式删除了这些标签。但是要将原始数据存储在数据库中,我需要这些推文。 MongoDB 没有问题,但 RDBMS 会引发问题。
看来我们可以在将向量转换为Factor然后再转换回字符向量之后使用conv()函数来转换编码。说实话有点奇怪
以防有人仍然卡住:我使用了 Encoding()。
for (col in colnames(mydataframe)){
Encoding(mydataframe[[col]]) <- "UTF-8"}
我发现 stringi::stri_enc_toascii()
非常有用,可以解决我的问题。
我在