提取 url 及其存储在磁盘上的 html 文件的名称并分别打印 - Python
Extract url & their names of an html file stored on disk and print them respectively - Python
我正在尝试提取和打印 url 及其名称(在 html 文件中存在的 <a href='url' title='smth'>NAME</a>
之间(保存在磁盘中) 没有 使用 BeautifulSoup 或其他库。只是初学者的 Python 代码。
许愿打印格式为:
http://..filepath/filename.pdf
File's Name
so on...
我能够单独提取和打印所有 url 或所有名称,但我未能在标签之前的代码中附加所有随后的名称并打印它们在每个 url 下方。我的代码变得凌乱,我很累。
到目前为止,这是我的代码:
import os
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
# for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
for item in urlarrow:
print(item)
#for names
almost identical code to the above
html.close()
如何让它发挥作用?我需要将它们合并为一个函数或 def,但是怎么做呢?
ps。我在下面发布了一个答案,但我认为可能会有更简单且 Pythonic 的解决方案
这是我需要的正确输出,但我确信有更好的方法。
import os
with open ('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
text = txt
#for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
text = html.read()
#for names
noloop = 0
namearrow = []
while noloop == 0:
posB = text.find("title")
if posB >= 0:
textcount = len(text)
text = text[posB:textcount]
posB = text.find('"')
textcount = len(text)
text = text[posB+19:textcount] #because string starts 19 chars after the posB
posB = text.find('</')
name = text[1:posB]
if text[0].startswith('>'):
namearrow.append(name)
else:
noloop = 1
fullarrow = []
for pair in zip(urlarrow, namearrow):
for item in pair:
fullarrow.append(item)
for instance in fullarrow:
print(instance)
html.close()
我正在尝试提取和打印 url 及其名称(在 html 文件中存在的 <a href='url' title='smth'>NAME</a>
之间(保存在磁盘中) 没有 使用 BeautifulSoup 或其他库。只是初学者的 Python 代码。
许愿打印格式为:
http://..filepath/filename.pdf
File's Name
so on...
我能够单独提取和打印所有 url 或所有名称,但我未能在标签之前的代码中附加所有随后的名称并打印它们在每个 url 下方。我的代码变得凌乱,我很累。 到目前为止,这是我的代码:
import os
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
# for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
for item in urlarrow:
print(item)
#for names
almost identical code to the above
html.close()
如何让它发挥作用?我需要将它们合并为一个函数或 def,但是怎么做呢? ps。我在下面发布了一个答案,但我认为可能会有更简单且 Pythonic 的解决方案
这是我需要的正确输出,但我确信有更好的方法。
import os
with open ('~/SomeFolder/page.html'),'r') as html:
txt = html.read()
text = txt
#for urls
nolp = 0
urlarrow = []
while nolp == 0:
pos = txt.find("href")
if pos >= 0:
txtcount = len(txt)
txt = txt[pos:txtcount]
pos = txt.find('"')
txtcount = len(txt)
txt = txt[pos+1:txtcount]
pos = txt.find('"')
url = txt[0:pos]
if url.startswith("http") and url.endswith("pdf"):
urlarrow.append(url)
else:
nolp = 1
with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html:
text = html.read()
#for names
noloop = 0
namearrow = []
while noloop == 0:
posB = text.find("title")
if posB >= 0:
textcount = len(text)
text = text[posB:textcount]
posB = text.find('"')
textcount = len(text)
text = text[posB+19:textcount] #because string starts 19 chars after the posB
posB = text.find('</')
name = text[1:posB]
if text[0].startswith('>'):
namearrow.append(name)
else:
noloop = 1
fullarrow = []
for pair in zip(urlarrow, namearrow):
for item in pair:
fullarrow.append(item)
for instance in fullarrow:
print(instance)
html.close()