返回计算在不同列表上的单词列表

Question

大家下午好，

今天我被要求编写以下函数：

def compareurl(url1,url2,enc,n)

此函数比较两个 url 和 return 一个包含以下内容的列表：

[word,occ_in_url1,occ_in_u2]

其中：

word ---> 长度为 n 的单词

occ_in_url1 ---> url1 中的单词次数

occ_in_url2 ---> url2 中的单词次数

所以我开始写这个函数，这是我到目前为止写的：

def compare_url(url1,url2,enc,n):
    from urllib.request import urlopen
    with urlopen('url1') as f1:
        readpage1 = f1.read()
        decodepage1 = readpage1.decode('enc')
    with urlopen('url2') as f2:
        readpage2 = f2.read()
        decodepage2 = readpage2.decode('enc')
    all_lower1 = decodepage1.lower()
    all_lower2 = decodepage2.lower()
    import string
    all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
    all_lower2nopunctuation = "".join(l for l in all_lower2 if l not in string.punctuation)
    for word1 in all_lower1nopunctuation:
        if len(word1) == k:
            all_lower1nopunctuation.count(word1)
    for word2 in all_lower2nopunctuation:
        if len(word2) == k:
            all_lower2opunctuation.count(word2)
    return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
    return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))

但是这段代码并没有按照我想的那样工作，实际上根本就没有工作。

我也想：

对 returning 列表进行递减排序（从 return 次数最多的单词开始）
如果 2 个单词出现的次数相同，则它们必须 returned in 按字母顺序

Answer 1

您的代码中有一些拼写错误（以后注意这些错误），但也有一些 python 问题（或可以改进的地方）。

首先，你的imports should come in the top文件

from urllib.request import urlopen
import string

您应该用 string 调用 urlopen，这就是您正在做的，但是这个字符串是 'url1' 而不是 'http://...'。您不要在引号内使用变量：

with urlopen(url1) as f1: #remove quotes
    readpage1 = f1.read()
    decodepage1 = readpage1.decode(enc) #remove quotes
with urlopen(url2) as f2: #remove quotes
    readpage2 = f2.read()
    decodepage2 = readpage2.decode(enc) #remove quotes

您需要改进 all_lower1nopunctuation 初始化。您正在用 Whosebugcom 替换 whosebug.com，实际上应该是 Whosebug com.

#all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
#the if statement should be after 'l' and before 'for'
#you should include 'else' to replace the punctuation with a space
all_lower1nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower1)
all_lower2nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower2)

将 for 合并为一个。还将找到的单词添加到集合（唯一元素列表）中。

all_lower1nopunctuation.count(word1) returns word1在all_lower1nopunctuation中出现的次数。它不会增加计数器。

for word1 in all_lower1nopunctuation 不起作用，因为 all_lower1nopunctuation 是一个 字符串 （而不是列表)。将其转换为 list with .split(' ').

.replace('\n', '') 删除所有换行符，否则它们也会被计为 words。

#for word1 in all_lower1nopunctuation:
#    if len(word1) == k: #also, this should be == n, not == k
#        all_lower1nopunctuation.count(word1)
#for word2 in all_lower2nopunctuation:
#    if len(word2) == k:
#        all_lower2opunctuation.count(word2)

word_set = set([])
for word in all_lower1nopunctuation.replace('\n', '').split(' '):
    if len(word) == n and word in all_lower2nopunctuation:
        word_set.add(word) #set uses .add() instead of .append()

既然您有一组出现在两个 url 上的词，您需要存储每个 url 中有多少词。以下代码将确保您有一个 元组列表 作为

count_list = []
for final_word in word_set:
    count_list.append((final_word,
    all_lower1nopunctuation.count(final_word),
    all_lower2nopunctuation.count(final_word)))

Returning means the function is finished and the interpreter continues wherever it was before the function was called, so whatever comes after the return is irrelevant.

如RemcoGerlich所说。

您的代码将始终仅 return 首先 return，因此您需要合并两者return合二为一。

#return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
#return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
return(count_list) # which contains a list of tuples with all words and its counts

TL;DR

from urllib.request import urlopen
import string

def compare_url(url1,url2,enc,n):
    with urlopen(url1) as f1:
        readpage1 = f1.read()
        decodepage1 = readpage1.decode(enc)
    with urlopen(url2) as f2:
        readpage2 = f2.read()
        decodepage2 = readpage2.decode(enc)

    all_lower1 = decodepage1.lower()
    all_lower2 = decodepage2.lower()

    all_lower1nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower1)
    all_lower2nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower2)

    word_set = set([])
    for word in all_lower1nopunctuation.replace('\n', '').split(' '):
        if len(word) == n and word in all_lower2nopunctuation:
            word_set.add(word)

    count_list = []
    for final_word in word_set:
        count_list.append((final_word,
        all_lower1nopunctuation.count(final_word),
        all_lower2nopunctuation.count(final_word)))

    return(count_list)

url1 = 'https://www.tutorialspoint.com/python/list_count.htm'
url2 = '

for word_count in compare_url(url1,url2, 'utf-8', 5):
    print (word_count)

返回计算在不同列表上的单词列表

Returning list of words counted on different list

python

string

count

notepad++

python-3.5