为文件中的每个单词创建字典并计算其后单词的频率
Creating a dictionary for each word in a file and counting the frequency of words that follow it
我正在尝试解决一个难题,但迷路了。
这是我应该做的:
INPUT: file
OUTPUT: dictionary
Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.
You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.
Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
... word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
这是我目前所做的,至少要找出正确的词:
def word_counts(f):
i = 0
orgwordlist = f.split()
for word in orgwordlist:
if i<len(orgwordlist)-1:
print orgwordlist[i]
print orgwordlist[i+1]
with open('../data/example.txt') as f:
word_counts(f)
我想我需要以某种方式使用 .count 方法并最终将一些词典压缩在一起,但我不确定如何计算每个第一个单词的第二个单词。
我知道我离解决问题还差得很远,但我正在努力一步一个脚印。感谢任何帮助,即使只是指向正确方向的提示。
我们可以在两遍中解决这个问题:
- 在第一遍中,我们构造一个
Counter
并使用zip(..)
计算两个连续单词的元组;和
- 然后我们将
Counter
放入字典中。
这导致以下代码:
from collections import Counter, defaultdict
def word_counts(f):
st = f.read().lower().split()
ctr = Counter(zip(st,st[1:]))
dc = defaultdict(dict)
for (k1,k2),v in ctr.items():
dc[k1][k2] = v
return dict(dc)
首先是一只勇敢的猫追狗!其次它有点棘手,因为我们不是每天都与这种类型的解析交互。这是代码:
k = "The cat chased the dog."
sp = k.split()
res = {}
prev = ''
for w in sp:
word = w.lower().replace('.', '')
if prev in res:
if word.lower() in res[prev]:
res[prev][word] += 1
else:
res[prev][word] = 1
elif not prev == '':
res[prev] = {word: 1}
prev = word
print res
你可以:
- 创建剥离词列表;
- 使用
zip(list_, list_[1:])
或任何成对迭代的方法创建单词对;
- 创建一个词组中第一个词的字典,然后是词对中第二个词的列表;
- 计算列表中的单词数。
像这样:
from collections import Counter
s="The cat chased the dog."
li=[w.lower().strip('.,') for w in s.split()] # list of the words
di={}
for a,b in zip(li,li[1:]): # words by pairs
di.setdefault(a,[]).append(b) # list of the words following first
di={k:dict(Counter(v)) for k,v in di.items()} # count the words
>>> di
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
如果您有文件,只需将文件读入字符串并继续。
或者,您可以
- 前两步相同
- 使用
defaultdict
和 Counter
作为工厂。
像这样:
from collections import Counter, defaultdict
li=[w.lower().strip('.,') for w in s.split()]
dd=defaultdict(Counter)
for a,b in zip(li, li[1:]):
dd[a][b]+=1
>>> dict(dd)
{'the': Counter({'dog': 1, 'cat': 1}), 'chased': Counter({'the': 1}), 'cat': Counter({'chased': 1})}
或者,
>>> {k:dict(v) for k,v in dd.items()}
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
我们可以在一次完成:
- 使用
defaultdict
作为计数器。
- 迭代二元组,就地计数
所以...为了简洁起见,我们将不进行规范化和清理:
>>> from collections import defaultdict
>>> counter = defaultdict(lambda: defaultdict(int))
>>> s = 'the dog chased the cat'
>>> tokens = s.split()
>>> from itertools import islice
>>> for a, b in zip(tokens, islice(tokens, 1, None)):
... counter[a][b] += 1
...
>>> counter
defaultdict(<function <lambda> at 0x102078950>, {'the': defaultdict(<class 'int'>, {'cat': 1, 'dog': 1}), 'dog': defaultdict(<class 'int'>, {'chased': 1}), 'chased': defaultdict(<class 'int'>, {'the': 1})})
以及更具可读性的输出:
>>> {k:dict(v) for k,v in counter.items()}
{'the': {'cat': 1, 'dog': 1}, 'dog': {'chased': 1}, 'chased': {'the': 1}}
>>>
我认为这是一个无需导入 defaultdict 的一次性解决方案。它还具有标点符号剥离功能。我尝试针对大文件或重复打开文件进行优化
from itertools import islice
class defaultdictint(dict):
def __missing__(self,k):
r = self[k] = 0
return r
class defaultdictdict(dict):
def __missing__(self,k):
r = self[k] = defaultdictint()
return r
keep = set('1234567890abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ')
def count_words(file):
d = defaultdictdict()
with open(file,"r") as f:
for line in f:
line = ''.join(filter(keep.__contains__,line)).strip().lower().split()
for one,two in zip(line,islice(line,1,None)):
d[one][two] += 1
return d
print (count_words("example.txt"))
将输出:
{'chased': {'the': 1}, 'cat': {'chased': 1}, 'the': {'dog': 1, 'cat': 1}}
我正在尝试解决一个难题,但迷路了。
这是我应该做的:
INPUT: file
OUTPUT: dictionary
Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.
You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.
Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
... word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
这是我目前所做的,至少要找出正确的词:
def word_counts(f):
i = 0
orgwordlist = f.split()
for word in orgwordlist:
if i<len(orgwordlist)-1:
print orgwordlist[i]
print orgwordlist[i+1]
with open('../data/example.txt') as f:
word_counts(f)
我想我需要以某种方式使用 .count 方法并最终将一些词典压缩在一起,但我不确定如何计算每个第一个单词的第二个单词。
我知道我离解决问题还差得很远,但我正在努力一步一个脚印。感谢任何帮助,即使只是指向正确方向的提示。
我们可以在两遍中解决这个问题:
- 在第一遍中,我们构造一个
Counter
并使用zip(..)
计算两个连续单词的元组;和 - 然后我们将
Counter
放入字典中。
这导致以下代码:
from collections import Counter, defaultdict
def word_counts(f):
st = f.read().lower().split()
ctr = Counter(zip(st,st[1:]))
dc = defaultdict(dict)
for (k1,k2),v in ctr.items():
dc[k1][k2] = v
return dict(dc)
首先是一只勇敢的猫追狗!其次它有点棘手,因为我们不是每天都与这种类型的解析交互。这是代码:
k = "The cat chased the dog."
sp = k.split()
res = {}
prev = ''
for w in sp:
word = w.lower().replace('.', '')
if prev in res:
if word.lower() in res[prev]:
res[prev][word] += 1
else:
res[prev][word] = 1
elif not prev == '':
res[prev] = {word: 1}
prev = word
print res
你可以:
- 创建剥离词列表;
- 使用
zip(list_, list_[1:])
或任何成对迭代的方法创建单词对; - 创建一个词组中第一个词的字典,然后是词对中第二个词的列表;
- 计算列表中的单词数。
像这样:
from collections import Counter
s="The cat chased the dog."
li=[w.lower().strip('.,') for w in s.split()] # list of the words
di={}
for a,b in zip(li,li[1:]): # words by pairs
di.setdefault(a,[]).append(b) # list of the words following first
di={k:dict(Counter(v)) for k,v in di.items()} # count the words
>>> di
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
如果您有文件,只需将文件读入字符串并继续。
或者,您可以
- 前两步相同
- 使用
defaultdict
和Counter
作为工厂。
像这样:
from collections import Counter, defaultdict
li=[w.lower().strip('.,') for w in s.split()]
dd=defaultdict(Counter)
for a,b in zip(li, li[1:]):
dd[a][b]+=1
>>> dict(dd)
{'the': Counter({'dog': 1, 'cat': 1}), 'chased': Counter({'the': 1}), 'cat': Counter({'chased': 1})}
或者,
>>> {k:dict(v) for k,v in dd.items()}
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
我们可以在一次完成:
- 使用
defaultdict
作为计数器。 - 迭代二元组,就地计数
所以...为了简洁起见,我们将不进行规范化和清理:
>>> from collections import defaultdict
>>> counter = defaultdict(lambda: defaultdict(int))
>>> s = 'the dog chased the cat'
>>> tokens = s.split()
>>> from itertools import islice
>>> for a, b in zip(tokens, islice(tokens, 1, None)):
... counter[a][b] += 1
...
>>> counter
defaultdict(<function <lambda> at 0x102078950>, {'the': defaultdict(<class 'int'>, {'cat': 1, 'dog': 1}), 'dog': defaultdict(<class 'int'>, {'chased': 1}), 'chased': defaultdict(<class 'int'>, {'the': 1})})
以及更具可读性的输出:
>>> {k:dict(v) for k,v in counter.items()}
{'the': {'cat': 1, 'dog': 1}, 'dog': {'chased': 1}, 'chased': {'the': 1}}
>>>
我认为这是一个无需导入 defaultdict 的一次性解决方案。它还具有标点符号剥离功能。我尝试针对大文件或重复打开文件进行优化
from itertools import islice
class defaultdictint(dict):
def __missing__(self,k):
r = self[k] = 0
return r
class defaultdictdict(dict):
def __missing__(self,k):
r = self[k] = defaultdictint()
return r
keep = set('1234567890abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ')
def count_words(file):
d = defaultdictdict()
with open(file,"r") as f:
for line in f:
line = ''.join(filter(keep.__contains__,line)).strip().lower().split()
for one,two in zip(line,islice(line,1,None)):
d[one][two] += 1
return d
print (count_words("example.txt"))
将输出:
{'chased': {'the': 1}, 'cat': {'chased': 1}, 'the': {'dog': 1, 'cat': 1}}