Python 中的网站名称提取
Website Name extract in Python
我想从 url 中提取网站名称。例如https://plus.google.com/in/test.html
应该给出输出 - "plus google"
更多测试用例是 -
- WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML
输出:- OH MADISON 商店 ADVANCEAUTOPARTS
- WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054
输出:- LQ
- WWW.LOCATIONS.DENNYS.COM
输出:- LOCATIONS DENNYS
- WV.WESTON.STORES.ADVANCEAUTOPARTS.COM
输出:- WV WESTON 商店 ADVANCEAUTOPARTS
- WOODYANDERSONFORDFAYETTEVILLE.NET/
输出:- WOODYANDERSONFORFAYETTEVILLE
- WILMINGTONMAYFAIRETOWNCENTER.HGI.COM
输出:- WILMINGTONMAYFAIRETOWNCENTER HGI
- WHITEHOUSEBLACKMARKET.COM/
输出:- WHITEHOUSEBLACKMARKET
- WINGATEHOTELS.COM
输出:- WINGATEHOTELS
string = str(input("Enter the url "))
new_list = list(string)
count=0
flag=0
if 'w' in new_list:
index1 = new_list.index('w')
new_list.pop(index1)
count += 1
if 'w' in new_list:
index2 = new_list.index('w')
if index2 != -1 and index2 == index1:
new_list.pop(index2)
count += 1
if 'w' in new_list:
index3= new_list.index('w')
if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
new_list.pop(index3)
count+=1
flag = 1
if flag == 0:
start = string.find('/')
start += 2
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
elif flag == 1:
start = string.find('.')
start = start + 1
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
以上适用于一些测试用例,但不是全部。请帮助我。
谢谢
这是您可以建立的基础;使用 urllib.parse.urlparse
:
from urllib.parse import urlparse
tests = ('https://plus.google.com/in/test.html',
('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
'AUTO_PARTS_MADISON_OH_7402.HTML'),
'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')
def extract(url):
# urlparse will not work without a 'scheme'
if not url.startswith('http'):
url = 'http://' + url
parsed = urlparse(url).netloc
split = parsed.split('.')[:-1] # get rid of TLD
if split[0].lower() == 'www':
split = split[1:]
ret = ' '.join(split)
return ret
for url in tests:
print(extract(url))
该函数将 url 从双斜线剥离为单斜线:
剩下的是 'clean up'
def stripURL( url, TwoSlashes, OneSlash ):
try:
start = url.index(TwoSlashes) + len(TwoSlashes)
end = url.index( OneSlash, start )
return url[start:end]
except ValueError:
return ""
url= raw_input("URL : ")
if "www." in url:url=url.replace("www.","")
Strip = stripURL( url, "//", "/" )
# Strips anything after the last period found
Stripped = Strip[:Strip.rfind(".")]
# get rid of the any periods used in the name
Stripped = Stripped.replace("."," ")
print Stripped
我想从 url 中提取网站名称。例如https://plus.google.com/in/test.html 应该给出输出 - "plus google"
更多测试用例是 -
- WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML
输出:- OH MADISON 商店 ADVANCEAUTOPARTS
- WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054
输出:- LQ
- WWW.LOCATIONS.DENNYS.COM
输出:- LOCATIONS DENNYS
- WV.WESTON.STORES.ADVANCEAUTOPARTS.COM
输出:- WV WESTON 商店 ADVANCEAUTOPARTS
- WOODYANDERSONFORDFAYETTEVILLE.NET/
输出:- WOODYANDERSONFORFAYETTEVILLE
- WILMINGTONMAYFAIRETOWNCENTER.HGI.COM
输出:- WILMINGTONMAYFAIRETOWNCENTER HGI
- WHITEHOUSEBLACKMARKET.COM/
输出:- WHITEHOUSEBLACKMARKET
- WINGATEHOTELS.COM
输出:- WINGATEHOTELS
string = str(input("Enter the url "))
new_list = list(string)
count=0
flag=0
if 'w' in new_list:
index1 = new_list.index('w')
new_list.pop(index1)
count += 1
if 'w' in new_list:
index2 = new_list.index('w')
if index2 != -1 and index2 == index1:
new_list.pop(index2)
count += 1
if 'w' in new_list:
index3= new_list.index('w')
if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
new_list.pop(index3)
count+=1
flag = 1
if flag == 0:
start = string.find('/')
start += 2
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
elif flag == 1:
start = string.find('.')
start = start + 1
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
以上适用于一些测试用例,但不是全部。请帮助我。
谢谢
这是您可以建立的基础;使用 urllib.parse.urlparse
:
from urllib.parse import urlparse
tests = ('https://plus.google.com/in/test.html',
('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
'AUTO_PARTS_MADISON_OH_7402.HTML'),
'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')
def extract(url):
# urlparse will not work without a 'scheme'
if not url.startswith('http'):
url = 'http://' + url
parsed = urlparse(url).netloc
split = parsed.split('.')[:-1] # get rid of TLD
if split[0].lower() == 'www':
split = split[1:]
ret = ' '.join(split)
return ret
for url in tests:
print(extract(url))
该函数将 url 从双斜线剥离为单斜线: 剩下的是 'clean up'
def stripURL( url, TwoSlashes, OneSlash ):
try:
start = url.index(TwoSlashes) + len(TwoSlashes)
end = url.index( OneSlash, start )
return url[start:end]
except ValueError:
return ""
url= raw_input("URL : ")
if "www." in url:url=url.replace("www.","")
Strip = stripURL( url, "//", "/" )
# Strips anything after the last period found
Stripped = Strip[:Strip.rfind(".")]
# get rid of the any periods used in the name
Stripped = Stripped.replace("."," ")
print Stripped