如何优化 json 数据对象中 10 个最常用词的检索?

How to optimize retrieval of 10 most frequent words inside a json data object?

我正在寻找使代码更高效的方法(运行时和内存复杂性) 我应该使用像 Max-Heap 这样的东西吗? 性能不佳是由于字符串连接或字典排序不正确还是其他原因造成的? 编辑:我将 dictionary/map 对象替换为对所有检索到的名称(有重复项)的列表应用 Counter 方法

最低要求: 脚本应该少于 30 秒 当前运行时间:需要 54 秒

   # Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests

# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead

import json

# dict subclass for counting hashable objects
from collections import Counter

#import heapq

import datetime

url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []

#####################################################################################
# Calls the site http://www.namefake.com  100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden            etc....
#####################################################################################

requests.packages.urllib3.disable_warnings()

t = datetime.datetime.now()

for x in range(100):
    # for each name, break it to first and last name
    # no need for authentication
    # http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
    responseObj = requests.get(url, verify=False)

    # Decoding JSON data from returned response object text
    # Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
    #    containing a JSON document) to a Python object.
    jsonData = json.loads(responseObj.text)
    x = jsonData['name']

    newName = ""
    for full_name in x:
        # make a string from the decoded python object concatenation
        newName += str(full_name)

    # split by whitespaces
    y = newName.split()

    # parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
    if "." in y[0] or "Miss" in y[0]:
        words.append(y[2])
    else:
        words.append(y[0])

    words.append(y[1])

# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)

# list of tuples
top_ten =Counter(words).most_common(10)

top_names_list = [name[0] for name in top_ten ]

print((datetime.datetime.now()-t).total_seconds())

print(top_names_list)

您正在调用一个 API 的端点,该端点一次生成 一个 人的虚拟信息 - 这需要相当长的时间。

其余代码几乎不花时间。

更改您正在使用的端点(您使用的端点没有批量名称收集)或使用 python 模块提供的内置虚拟数据。


你可以清楚地看到 "counting and processing names" 不是这里的瓶颈:

from faker import Faker          # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()

# get 10.000 names, split them and add 1st part
t = datetime.datetime.now() 
c.update( (fake.name().split()[0] for _ in range(10000)) )

print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())

10000 个名称的输出:

[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134), 
 ('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111), 
 ('Matthew', 111), ('Lisa', 101)]

1.886564 # seconds

代码优化的一般建议:先测量然后优化瓶颈

如果您需要 codereview,您可以检查 https://codereview.stackexchange.com/help/on-topic 并查看您的代码是否符合 codereview stackexchange 站点的要求。与 SO 一样,首先应该在问题上付出一些努力 - 即分析 哪里 你的大部分时间都花在了哪里。


编辑 - 性能测量:

import requests
import json
from collections import defaultdict
import datetime


# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time 
d = defaultdict(int)

def insertToDict(n):
    d[n] += 1

url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
    # for each name, break it to first and last name
    try:
        t = datetime.datetime.now()      # start time for API call
        # no need for authentication
        responseObj = requests.get(url, verify=False)
        jsonData = json.loads(responseObj.text)

        # end time for API call
        api_times.append( (datetime.datetime.now()-t).total_seconds() )
        x = jsonData['name']

        t = datetime.datetime.now()      # start time for name processing
        newName = ""
        for name_char in x:
            # make a string from the decoded python object concatenation
            newName = newName + str(name_char)

        # split by whitespaces
        y = newName.split()

        # parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
        if "." in y[0] or "Miss" in y[0]:
            insertToDict(y[2])
        else:
            insertToDict(y[0])
        insertToDict(y[1])

        # end time for name processing
        process_times.append( (datetime.datetime.now()-t).total_seconds() )
    except:
        continue

newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times )) 

输出:

['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
 'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206

你可以把解析部分做得更好..我没有,因为这无关紧要。


最好使用 timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx @bruno desthuilliers ) - 在这种情况下我没有使用 timeit 因为我不想调用 API 100000 次来平均结果