如何将多个 .html 文件名传递到单个 txt 输出文件,该文件输出 html 中的所有 href 链接及其文件名?
how can I pass the multiple .html file names in to a single txt output file that outputs all the href links in html along with their file names?
import pandas as pd
import glob
import csv
import re
from bs4 import BeautifulSoup
links_with_text = []
textfile = open("a_file.txt", "w")
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
print(links_with_text)
for element in links_with_text:
textfile.write(element + "\n")
示例输出:
文件名:
- 链接1
- 链接2
- 链接3
文件名2:
- 链接1
- 链接2
- 链接3
文件名3:
- 链接1
- 链接2
- 链接3
我找到了一个 post 一些与我相关的东西,但它在多个文本文件中打印输出,但在这里我想将这些文件名及其链接放在一个文本文件中。
BeautifulSoup on multiple .html files
求推荐。提前谢谢你
我做了一个类似的东西但是用 img 也许它会对你有帮助:
link = input('Url is: ')
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
f= open("cache.txt","w+")
for image in images:
url = ('https:' + image['src']+'\n')
f.write(url)
with open('cache.txt') as f:
for line in f:
url = line
path = 'image'+url.split('/', -1)[-1]
urllib.request.urlretrieve(url, path.rstrip('\n'))
试试这个
with open("a_file.txt", "a") as textfile: # "a" to append string
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text = "\n".join(links_with_text)
textfile.write(f"{filename}\n{links_with_text}\n")
要将文件名放在每个块的顶部,只需添加另一个 .write()
行,如下所示:
from bs4 import BeautifulSoup
import glob
import csv
links_with_text = []
with open("a_file.txt", "w") as textfile:
for filename in glob.iglob('*.html'):
textfile.write(f"{filename}:\n")
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
for element in links_with_text:
textfile.write(f" {element}\n")
import pandas as pd
import glob
import csv
import re
from bs4 import BeautifulSoup
links_with_text = []
textfile = open("a_file.txt", "w")
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
print(links_with_text)
for element in links_with_text:
textfile.write(element + "\n")
示例输出:
文件名:
- 链接1
- 链接2
- 链接3
文件名2:
- 链接1
- 链接2
- 链接3
文件名3:
- 链接1
- 链接2
- 链接3
我找到了一个 post 一些与我相关的东西,但它在多个文本文件中打印输出,但在这里我想将这些文件名及其链接放在一个文本文件中。
BeautifulSoup on multiple .html files
求推荐。提前谢谢你
我做了一个类似的东西但是用 img 也许它会对你有帮助:
link = input('Url is: ')
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
f= open("cache.txt","w+")
for image in images:
url = ('https:' + image['src']+'\n')
f.write(url)
with open('cache.txt') as f:
for line in f:
url = line
path = 'image'+url.split('/', -1)[-1]
urllib.request.urlretrieve(url, path.rstrip('\n'))
试试这个
with open("a_file.txt", "a") as textfile: # "a" to append string
for filename in glob.iglob('*.html'):
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text = "\n".join(links_with_text)
textfile.write(f"{filename}\n{links_with_text}\n")
要将文件名放在每个块的顶部,只需添加另一个 .write()
行,如下所示:
from bs4 import BeautifulSoup
import glob
import csv
links_with_text = []
with open("a_file.txt", "w") as textfile:
for filename in glob.iglob('*.html'):
textfile.write(f"{filename}:\n")
with open(filename) as f:
soup = BeautifulSoup(f)
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
for element in links_with_text:
textfile.write(f" {element}\n")