具有相同 phone 编号的文档分组
Grouping of documents having the same phone number
我的数据库由collection个大号组成。酒店(约 121,000 家)。
这就是我的 collection 的样子 :
{
"_id" : ObjectId("57bd5108f4733211b61217fa"),
"autoid" : 1,
"parentid" : "P01982.01982.110601173548.N2C5",
"companyname" : "Sheldan Holiday Home",
"latitude" : 34.169552,
"longitude" : 77.579315,
"state" : "JAMMU AND KASHMIR",
"city" : "LEH Ladakh",
"pincode" : 194101,
"phone_search" : "9419179870|253013",
"address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
"email" : "",
"website" : "",
"national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
"area" : "Leh Ladakh",
"data_city" : "Leh Ladakh"
}
每个文档可以有 1 个或多个 phone 个数字,用“|”分隔分隔符。
我必须将具有相同 phone 编号的文档组合在一起。
实时,我的意思是当用户打开特定酒店以在网络界面上查看其详细信息时,我应该能够显示所有链接到该酒店的酒店,这些酒店按常用 phone 号码分组。
分组时,如果一家酒店链接到另一家酒店,那家酒店又链接到另一家酒店,则应将所有 3 家酒店分组在一起。
Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C
has phone numbers 2|3, then A, B and C should be grouped together.
from pymongo import MongoClient
from pprint import pprint #Pretty print
import re #for regex
#import unicodedata
client = MongoClient()
cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []
#We'll be passing the value later as parameter via a function call
#hId = 37443;
regx = re.compile("^Vivanta", re.IGNORECASE)
#Connection
db = client.hotel
collection = db.hotelData
#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
list.append(post)
#Copying all hotels in a list
for post1 in collection.find():
allHotels.append(post1)
hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel
try:
splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
splitList = y
print "Contact details of",x,":"
#Printing all contacts...
for contact in splitList:
print contact
conContact.extend(contact)
cLen = cLen+1
print "No. of contacts in",x,"=",cLen
for i in allHotels:
yAll = allHotels[countA]["phone_search"]
try:
splitListAll.append(yAll.split("|"))
countA = countA+1
except:
splitListAll.append(yAll)
countA = countA + 1
# print splitListAll
#count = 0
#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll:
if contactAll in conContact:
conContact.extend(contactAll)
# contactChk = contactAll
# if (set(conContact) & set(contactChk)):
# conContact = contactChk
# contactChk[:] = [] #drop contactChk list
conId = allHotels[countB]["autoid"]
countB = countB+1
print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
print final
这是我在 Python 中编写的一段代码。在这一个中,我尝试在 for 循环中执行线性搜索。到目前为止,我遇到了一些错误,但在更正后应该可以正常工作。
我需要一个优化版本,因为线性搜索的时间复杂度很低。
我对此很陌生,所以欢迎任何其他改进代码的建议。
谢谢。
任何 Python 内存搜索问题的最简单答案是 "use a dict"。字典给出 O(ln N) 键访问速度,列表给出 O(N)。
还请记住,您可以将一个 Python 对象放入任意多个字典(或列表)中,并多次放入一个字典或列表中,视需要而定。它们不会被复制。仅供参考。
所以要点看起来像
for hotel in hotels:
phones = hotel["phone_search"].split("|")
for phone in phones:
hotelsbyphone.setdefault(phone,[]).append(hotel)
在此循环结束时,hotelsbyphone["123456"]
将是一个酒店对象列表,其 phone_search
字符串之一为“123456”。键编码功能是 .setdefault(key, [])
方法,如果键不在字典中,它会初始化一个空列表,以便您可以附加到它。
一旦你建立了这个索引,这会很快
try:
hotels = hotelsbyphone[x]
# and process a list of one or more hotels
except KeyError:
# no hotels exist with that number
替代try ... except
,测试if x in hotelsbyphone:
我的数据库由collection个大号组成。酒店(约 121,000 家)。
这就是我的 collection 的样子 :
{
"_id" : ObjectId("57bd5108f4733211b61217fa"),
"autoid" : 1,
"parentid" : "P01982.01982.110601173548.N2C5",
"companyname" : "Sheldan Holiday Home",
"latitude" : 34.169552,
"longitude" : 77.579315,
"state" : "JAMMU AND KASHMIR",
"city" : "LEH Ladakh",
"pincode" : 194101,
"phone_search" : "9419179870|253013",
"address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
"email" : "",
"website" : "",
"national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
"area" : "Leh Ladakh",
"data_city" : "Leh Ladakh"
}
每个文档可以有 1 个或多个 phone 个数字,用“|”分隔分隔符。
我必须将具有相同 phone 编号的文档组合在一起。
实时,我的意思是当用户打开特定酒店以在网络界面上查看其详细信息时,我应该能够显示所有链接到该酒店的酒店,这些酒店按常用 phone 号码分组。
分组时,如果一家酒店链接到另一家酒店,那家酒店又链接到另一家酒店,则应将所有 3 家酒店分组在一起。
Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C has phone numbers 2|3, then A, B and C should be grouped together.
from pymongo import MongoClient
from pprint import pprint #Pretty print
import re #for regex
#import unicodedata
client = MongoClient()
cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []
#We'll be passing the value later as parameter via a function call
#hId = 37443;
regx = re.compile("^Vivanta", re.IGNORECASE)
#Connection
db = client.hotel
collection = db.hotelData
#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
list.append(post)
#Copying all hotels in a list
for post1 in collection.find():
allHotels.append(post1)
hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel
try:
splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
splitList = y
print "Contact details of",x,":"
#Printing all contacts...
for contact in splitList:
print contact
conContact.extend(contact)
cLen = cLen+1
print "No. of contacts in",x,"=",cLen
for i in allHotels:
yAll = allHotels[countA]["phone_search"]
try:
splitListAll.append(yAll.split("|"))
countA = countA+1
except:
splitListAll.append(yAll)
countA = countA + 1
# print splitListAll
#count = 0
#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll:
if contactAll in conContact:
conContact.extend(contactAll)
# contactChk = contactAll
# if (set(conContact) & set(contactChk)):
# conContact = contactChk
# contactChk[:] = [] #drop contactChk list
conId = allHotels[countB]["autoid"]
countB = countB+1
print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
print final
这是我在 Python 中编写的一段代码。在这一个中,我尝试在 for 循环中执行线性搜索。到目前为止,我遇到了一些错误,但在更正后应该可以正常工作。
我需要一个优化版本,因为线性搜索的时间复杂度很低。
我对此很陌生,所以欢迎任何其他改进代码的建议。
谢谢。
任何 Python 内存搜索问题的最简单答案是 "use a dict"。字典给出 O(ln N) 键访问速度,列表给出 O(N)。
还请记住,您可以将一个 Python 对象放入任意多个字典(或列表)中,并多次放入一个字典或列表中,视需要而定。它们不会被复制。仅供参考。
所以要点看起来像
for hotel in hotels:
phones = hotel["phone_search"].split("|")
for phone in phones:
hotelsbyphone.setdefault(phone,[]).append(hotel)
在此循环结束时,hotelsbyphone["123456"]
将是一个酒店对象列表,其 phone_search
字符串之一为“123456”。键编码功能是 .setdefault(key, [])
方法,如果键不在字典中,它会初始化一个空列表,以便您可以附加到它。
一旦你建立了这个索引,这会很快
try:
hotels = hotelsbyphone[x]
# and process a list of one or more hotels
except KeyError:
# no hotels exist with that number
替代try ... except
,测试if x in hotelsbyphone: