如何在 python 中使用正则表达式标记所有货币符号?
How to tokenize all currency symbols using Regex in python?
我想使用带有正则表达式的 NLTK tokenize 来标记所有货币符号。
例如这是我的句子:
The price of it is .00.
The price of it is RM5.00.
The price of it is €5.00.
我使用了这种正则表达式模式:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
但正如我们所见,它只考虑了 $.
我尝试按照 What is regex for currency symbol? 中的说明使用 \p{Sc}
,但它仍然不适合我。
尝试用带空格的货币符号填充编号,然后标记化:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is .00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']
我想使用带有正则表达式的 NLTK tokenize 来标记所有货币符号。
例如这是我的句子:
The price of it is .00.
The price of it is RM5.00.
The price of it is €5.00.
我使用了这种正则表达式模式:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
但正如我们所见,它只考虑了 $.
我尝试按照 What is regex for currency symbol? 中的说明使用 \p{Sc}
,但它仍然不适合我。
尝试用带空格的货币符号填充编号,然后标记化:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is .00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']