通过在 python 3 中使用算术和逻辑运算符保留某些单词来标记单词?
Tokenizing words by preserving certain words with arithmetic and logical operators in python 3?
在对大型语料库中的多个句子进行分词时,我需要保留某些单词的原始形式,例如 .Net, C#, C++
。我还想删除标点符号(.,!_-()=*&^%$@~
等),但需要保留 .net, .htaccess, .htpassword, c++
等字样
我已经尝试了 nltk.word_tokenize
和 nltk.regexp_tokenize
,但我没有得到预期的输出。
请帮我解决上述问题。
代码:
import nltk
from nltk import regexp_tokenize
from nltk.corpus import stopwords
def pre_data():
tokenized_sentences = nltk.sent_tokenize(tokenized_raw_data)
sw0 = (stopwords.words('english'))
sw1 = ["i.e", "dxint", "hrangle", "idoteq", "devs", "zero"]
sw = sw0 + sw1
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|\d|[^.+#\w a-z]", gaps=True)] for word in tokenized_sentences]
print(tokens)
pre_data()
tokenized_raw_data是一个普通的文本文件。它包含多个句子,中间有空格,由 .blog、.net、c++、c#、asp.net、.htaccess 等词组成
示例
['.blog is a generic top-level domain intended for use by blogs'.,
'C# is a general-purpose, multi-paradigm programming language'.,
'C++ is object-oriented programming language'.]
此解决方案涵盖了给定的示例并保留了 C++
、C#
asp.net
等词,同时删除了正常的标点符号。
import nltk
paragraph = (
'.blog is a generic top-level domain intended for use by blogs. '
'C# is a general-purpose, multi-paradigm programming language. '
'C++ is object-oriented programming language. '
'asp.net is something very strange. '
'The most fascinating language is c#. '
'.htaccess makes my day!'
)
def pre_data(raw_data):
tokenized_sentences = nltk.sent_tokenize(raw_data)
tokens = [nltk.regexp_tokenize(sentence, pattern='\w*\.?\w+[#+]*') for sentence in tokenized_sentences]
return tokens
tokenized_data = pre_data(paragraph)
print(tokenized_data)
出来
[['.blog', 'is', 'a', 'generic', 'top', 'level', 'domain', 'intended', 'for', 'use', 'by', 'blogs'],
['C#', 'is', 'a', 'general', 'purpose', 'multi', 'paradigm', 'programming', 'language'],
['C++', 'is', 'object', 'oriented', 'programming', 'language'],
['asp.net', 'is', 'something', 'very', 'strange'],
['The', 'most', 'fascinating', 'language', 'is', 'c#'],
['.htaccess', 'makes', 'my', 'day']]
但是,这个简单的正则表达式可能不适用于您文本中的所有技术术语。提供更通用解决方案的完整示例。
在对大型语料库中的多个句子进行分词时,我需要保留某些单词的原始形式,例如 .Net, C#, C++
。我还想删除标点符号(.,!_-()=*&^%$@~
等),但需要保留 .net, .htaccess, .htpassword, c++
等字样
我已经尝试了 nltk.word_tokenize
和 nltk.regexp_tokenize
,但我没有得到预期的输出。
请帮我解决上述问题。
代码:
import nltk
from nltk import regexp_tokenize
from nltk.corpus import stopwords
def pre_data():
tokenized_sentences = nltk.sent_tokenize(tokenized_raw_data)
sw0 = (stopwords.words('english'))
sw1 = ["i.e", "dxint", "hrangle", "idoteq", "devs", "zero"]
sw = sw0 + sw1
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|\d|[^.+#\w a-z]", gaps=True)] for word in tokenized_sentences]
print(tokens)
pre_data()
tokenized_raw_data是一个普通的文本文件。它包含多个句子,中间有空格,由 .blog、.net、c++、c#、asp.net、.htaccess 等词组成
示例
['.blog is a generic top-level domain intended for use by blogs'.,
'C# is a general-purpose, multi-paradigm programming language'.,
'C++ is object-oriented programming language'.]
此解决方案涵盖了给定的示例并保留了 C++
、C#
asp.net
等词,同时删除了正常的标点符号。
import nltk
paragraph = (
'.blog is a generic top-level domain intended for use by blogs. '
'C# is a general-purpose, multi-paradigm programming language. '
'C++ is object-oriented programming language. '
'asp.net is something very strange. '
'The most fascinating language is c#. '
'.htaccess makes my day!'
)
def pre_data(raw_data):
tokenized_sentences = nltk.sent_tokenize(raw_data)
tokens = [nltk.regexp_tokenize(sentence, pattern='\w*\.?\w+[#+]*') for sentence in tokenized_sentences]
return tokens
tokenized_data = pre_data(paragraph)
print(tokenized_data)
出来
[['.blog', 'is', 'a', 'generic', 'top', 'level', 'domain', 'intended', 'for', 'use', 'by', 'blogs'],
['C#', 'is', 'a', 'general', 'purpose', 'multi', 'paradigm', 'programming', 'language'],
['C++', 'is', 'object', 'oriented', 'programming', 'language'],
['asp.net', 'is', 'something', 'very', 'strange'],
['The', 'most', 'fascinating', 'language', 'is', 'c#'],
['.htaccess', 'makes', 'my', 'day']]
但是,这个简单的正则表达式可能不适用于您文本中的所有技术术语。提供更通用解决方案的完整示例。