如何删除日语字符？

Question

我有一些来自调查数据的日文字符数据。一些调查问题和答案（多项选择）以英语和日语提供，例如Very rarely かなりまれ"。在这种情况下，删除重复的日语很有帮助。如何做到这一点？我只想删除日文，而不是任何其他特殊字符。

Answer 1

最简单的方法是只保留 ASCII 字符。这可以通过将非 ASCII 替换为空字符串（例如 str_replace_all("æøå かな", "[^0-F]", "")）并删除任何生成的空格来完成。但是，如果想保留一般的特殊符号，这种方法是行不通的。在那种情况下，可能只想删除日语（包括中文汉字）符号。这可以通过 unicode 块范围匹配来完成。我在这里 http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml, but Wikipedia lists them as well e.g. https://en.wikipedia.org/wiki/Katakana_(Unicode_block).

找到了日语相关的块

这是一个现成的函数（需要 tidyverse 和 assertthat）：

str_rm_jap = function(x) {
  #we replace japanese blocks with nothing, and clean any double whitespace from this
  #reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
  x %>% 
    #japanese style punctuation
    str_replace_all("[\u3000-\u303F]", "") %>% 
    #katakana
    str_replace_all("[\u30A0-\u30FF]", "") %>% 
    #hiragana
    str_replace_all("[\u3040-\u309F]", "") %>% 
    #kanji
    str_replace_all("[\u4E00-\u9FAF]", "") %>% 
    #remove excess whitespace
    str_replace_all("  +", " ") %>% 
    str_trim()
}

#tests
assert_that(
  #positive tests
  "Very rarely かなりまれ" %>% str_rm_jap() %>% equals("Very rarely"),
  "Comments ノートとコメント" %>% str_rm_jap() %>% equals("Comments"),

  #negative tests
  "Danish ok! ÆØÅ" %>% str_rm_jap() %>% equals("Danish ok! ÆØÅ")
)

Answer 2

你可以用它来去掉平假名和片假名：

replace(/[\u30a0-\u30ff\u3040-\u309f]/g, '')

另请参阅：JavaScript to replace Chinese characters

如何删除日语字符？

How do I remove Japanese characters?

string

r

cjk