试图找到独特的子数组和子元素?
Trying to find unique subarrays and sub-elements?
我有一个数组将包含 page_name、url 和 dirty_pages 中的 id] 的子数组。该数组包含重复的子数组。
我需要将 dirty_pages
中的每个 subarray
解析为 clean_pages
,这样:
没有重复项(重复子数组)
子数组中的1st index
即url 必须是唯一的! 例如这个 url 应该算作 one (url/#review
是还是一样url):
file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
和
file:///home/joe/Desktop/my-projects/FashionShop/product.html
我目前的尝试 returns clean_pages
有 6 个子数组(重复!)而正确答案应该是 4
# clean pages
clean_pages = []
# dirty pages
dirty_pages = [
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'],
]
# clean data - get unique pages for each session
for j in range(len(dirty_pages)):
page_name = dirty_pages[j][0]
page_url = dirty_pages[j][1]
page_sessionId = dirty_pages[j][2]
not_seen = False
if len(clean_pages) == 0:
clean_pages.append([page_name, page_url, page_sessionId])
else:
for i in range(len(clean_pages)):
next_page_name = clean_pages[i][0]
next_page_url = clean_pages[i][1]
next_page_sessionId = clean_pages[i][2]
if page_url != next_page_url and page_name != next_page_name \
and page_sessionId == next_page_sessionId:
not_seen = True
else:
not_seen = False
if not_seen is True:
clean_pages.append([page_name, page_url, page_sessionId])
print("$$$ clean...", len(clean_pages))
# correct answer should be 4 - as anyting after url e.g. #review is still duplicate!
更新示例 - 如果示例不清楚,我们深表歉意(就像在 url 之后的 # 这些应该被视为一个 url)
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/'
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123'
'file:///home/joe/Desktop/my-projects/FashionShop/index.html'
你可以这样做:
for j in dirty_pages:
page_name = j[0]
long_url = j[1]
split_url = long_url.split('#')
short_url = split_url[0]
page_sessionID = j[2]
edited_subarray = [page_name, short_url, page_sessionID]
if edited_subarray not in clean_pages:
clean_pages.append(edited_subarray)
除非您需要在 clean_pages 列表中保留 url 的“#review”部分。
您可以使用 furl 规范化 url
from furl import furl
# Iterate over each page - subarray
for page in dirty_pages:
# normalize url
page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/")
# check if subarray already in clean_pages
if page not in clean_pages:
clean_pages.append(page)
我认为这个问题本质上归结为存在唯一性 url。如果是这样,您之前检查的方法有点过于复杂,您得到的结果归结为仅添加唯一名称的第一项。因此 6 但是您似乎想要唯一的 url.
为了解决 # 问题,我只是将 url 围绕主题标签分开,并使用了它的第一部分。请注意,这只会获取主题标签之前的字符串的第一部分。因此,如果超过 1 #.
,这可能会导致问题
我也整理了一下
# clean pages
clean_pages = []
# dirty pages
dirty_pages = [
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
]
used_url = []
used_names = []
# clean data - get unique pages for each session
for page in dirty_pages:
page_name = page[0]
page_url = page[1].split('#')[0]
page_sessionId = page[2]
if page_url not in used_url:
used_url.append(page_url)
clean_pages.append(page)
print(clean_pages)
print("$$$ clean...", len(clean_pages))
我有一个数组将包含 page_name、url 和 dirty_pages 中的 id] 的子数组。该数组包含重复的子数组。
我需要将 dirty_pages
中的每个 subarray
解析为 clean_pages
,这样:
没有重复项(重复子数组)
子数组中的
1st index
即url 必须是唯一的! 例如这个 url 应该算作 one (url/#review
是还是一样url):file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
和
file:///home/joe/Desktop/my-projects/FashionShop/product.html
我目前的尝试 returns clean_pages
有 6 个子数组(重复!)而正确答案应该是 4
# clean pages
clean_pages = []
# dirty pages
dirty_pages = [
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'],
]
# clean data - get unique pages for each session
for j in range(len(dirty_pages)):
page_name = dirty_pages[j][0]
page_url = dirty_pages[j][1]
page_sessionId = dirty_pages[j][2]
not_seen = False
if len(clean_pages) == 0:
clean_pages.append([page_name, page_url, page_sessionId])
else:
for i in range(len(clean_pages)):
next_page_name = clean_pages[i][0]
next_page_url = clean_pages[i][1]
next_page_sessionId = clean_pages[i][2]
if page_url != next_page_url and page_name != next_page_name \
and page_sessionId == next_page_sessionId:
not_seen = True
else:
not_seen = False
if not_seen is True:
clean_pages.append([page_name, page_url, page_sessionId])
print("$$$ clean...", len(clean_pages))
# correct answer should be 4 - as anyting after url e.g. #review is still duplicate!
更新示例 - 如果示例不清楚,我们深表歉意(就像在 url 之后的 # 这些应该被视为一个 url)
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/'
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123'
'file:///home/joe/Desktop/my-projects/FashionShop/index.html'
你可以这样做:
for j in dirty_pages:
page_name = j[0]
long_url = j[1]
split_url = long_url.split('#')
short_url = split_url[0]
page_sessionID = j[2]
edited_subarray = [page_name, short_url, page_sessionID]
if edited_subarray not in clean_pages:
clean_pages.append(edited_subarray)
除非您需要在 clean_pages 列表中保留 url 的“#review”部分。
您可以使用 furl 规范化 url
from furl import furl
# Iterate over each page - subarray
for page in dirty_pages:
# normalize url
page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/")
# check if subarray already in clean_pages
if page not in clean_pages:
clean_pages.append(page)
我认为这个问题本质上归结为存在唯一性 url。如果是这样,您之前检查的方法有点过于复杂,您得到的结果归结为仅添加唯一名称的第一项。因此 6 但是您似乎想要唯一的 url.
为了解决 # 问题,我只是将 url 围绕主题标签分开,并使用了它的第一部分。请注意,这只会获取主题标签之前的字符串的第一部分。因此,如果超过 1 #.
,这可能会导致问题我也整理了一下
# clean pages
clean_pages = []
# dirty pages
dirty_pages = [
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
]
used_url = []
used_names = []
# clean data - get unique pages for each session
for page in dirty_pages:
page_name = page[0]
page_url = page[1].split('#')[0]
page_sessionId = page[2]
if page_url not in used_url:
used_url.append(page_url)
clean_pages.append(page)
print(clean_pages)
print("$$$ clean...", len(clean_pages))