使用 Python 过滤掉文本中的所有非汉字字符 3
Filtering out all non-kanji characters in a text with Python 3
我有一个文本,其中包含拉丁字母和日文字符(平假名、片假名和汉字)。
我想过滤掉所有拉丁字符、平假名和片假名,但我不确定如何优雅地执行此操作。
我的直接方法是过滤掉每个拉丁字母表中的每个字母 hiragana/katakana,但我相信有更好的方法。
我猜我必须使用正则表达式,但我不太确定如何去做。字母是否以某种方式分类为罗马字母、日语、中文等?
如果是的话,我能以某种方式使用它吗?
这里是一些示例文本:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
程序应该只return汉字(汉字)像这样:
`私、人,方,皆`
感谢 reddit 上的 Olsgaarddk,我找到了答案。
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))
我有一个文本,其中包含拉丁字母和日文字符(平假名、片假名和汉字)。
我想过滤掉所有拉丁字符、平假名和片假名,但我不确定如何优雅地执行此操作。 我的直接方法是过滤掉每个拉丁字母表中的每个字母 hiragana/katakana,但我相信有更好的方法。
我猜我必须使用正则表达式,但我不太确定如何去做。字母是否以某种方式分类为罗马字母、日语、中文等? 如果是的话,我能以某种方式使用它吗?
这里是一些示例文本:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
程序应该只return汉字(汉字)像这样:
`私、人,方,皆`
感谢 reddit 上的 Olsgaarddk,我找到了答案。
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))