对 txt 文件进行词形还原并仅替换词形还原的单词
Lemmatizing txt file and replacing only lemmatized words
无法弄清楚如何对 txt 文件中的单词进行词形还原。我已经列出了单词,但我不确定如何在事后对它们进行词形还原。
这是我的:
import nltk, re
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def lemfile():
f = open('1865-Lincoln.txt', 'r')
text = f.read().lower()
f.close()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
初始化一个 WordNetLemmatizer
对象,并对行中的每个单词进行词形还原。您可以使用 fileinput
模块执行就地文件 I/O。
#
import fileinput
lemmatizer = WordNetLemmatizer()
for line in fileinput.input('1865-Lincoln.txt', inplace=True, backup='.bak'):
line = ' '.join(
[lemmatizer.lemmatize(w) for w in line.rstrip().split()]
)
# overwrites current `line` in file
print(line)
fileinput.input
在使用时将标准输出重定向到打开的文件。
您还可以在 pywsd
包中尝试围绕 NLTK 的 WordNetLemmatizer
进行包装,具体来说,https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L129
安装:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
代码:
>>> from pywsd.utils import lemmatize_sentence
>>> lemmatize_sentence('These are foo bar sentences.')
['these', 'be', 'foo', 'bar', 'sentence', '.']
>>> lemmatize_sentence('These are foo bar sentences running.')
['these', 'be', 'foo', 'bar', 'sentence', 'run', '.']
具体针对您的问题:
from __future__ import print_function
from pywsd.util import lemmatize_sentence
with open('file.txt') as fin, open('outputfile.txt', 'w') as fout
for line in fin:
print(' '.join(lemmatize_sentence(line.strip()), file=fout, end='\n')
对 txt 文件进行词形还原并只替换词形还原的单词可以像 --`
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from pywsd.utils import lemmatize_sentence
lmm = WordNetLemmatizer()
ps = PorterStemmer()
new_data= []
with open('/home/rahul/Desktop/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
en_stops = set(stopwords.words('english'))
hu_stops = set(stopwords.words('hungarian'))
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~[<p>]'''
#if lemmatization of one string is required then uncomment below line
#data='this is coming rahul schooling met happiness making'
print ()
for line in all_words:
new_data=' '.join(lemmatize_sentence(line))
print (new_data)
PS- 根据您的需要进行识别。
希望这对您有所帮助!!!
无法弄清楚如何对 txt 文件中的单词进行词形还原。我已经列出了单词,但我不确定如何在事后对它们进行词形还原。
这是我的:
import nltk, re
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def lemfile():
f = open('1865-Lincoln.txt', 'r')
text = f.read().lower()
f.close()
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
初始化一个 WordNetLemmatizer
对象,并对行中的每个单词进行词形还原。您可以使用 fileinput
模块执行就地文件 I/O。
#
import fileinput
lemmatizer = WordNetLemmatizer()
for line in fileinput.input('1865-Lincoln.txt', inplace=True, backup='.bak'):
line = ' '.join(
[lemmatizer.lemmatize(w) for w in line.rstrip().split()]
)
# overwrites current `line` in file
print(line)
fileinput.input
在使用时将标准输出重定向到打开的文件。
您还可以在 pywsd
包中尝试围绕 NLTK 的 WordNetLemmatizer
进行包装,具体来说,https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L129
安装:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
代码:
>>> from pywsd.utils import lemmatize_sentence
>>> lemmatize_sentence('These are foo bar sentences.')
['these', 'be', 'foo', 'bar', 'sentence', '.']
>>> lemmatize_sentence('These are foo bar sentences running.')
['these', 'be', 'foo', 'bar', 'sentence', 'run', '.']
具体针对您的问题:
from __future__ import print_function
from pywsd.util import lemmatize_sentence
with open('file.txt') as fin, open('outputfile.txt', 'w') as fout
for line in fin:
print(' '.join(lemmatize_sentence(line.strip()), file=fout, end='\n')
对 txt 文件进行词形还原并只替换词形还原的单词可以像 --`
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from pywsd.utils import lemmatize_sentence
lmm = WordNetLemmatizer()
ps = PorterStemmer()
new_data= []
with open('/home/rahul/Desktop/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
en_stops = set(stopwords.words('english'))
hu_stops = set(stopwords.words('hungarian'))
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~[<p>]'''
#if lemmatization of one string is required then uncomment below line
#data='this is coming rahul schooling met happiness making'
print ()
for line in all_words:
new_data=' '.join(lemmatize_sentence(line))
print (new_data)
PS- 根据您的需要进行识别。 希望这对您有所帮助!!!