如何将多个 .html 文件名传递到单个 txt 输出文件，该文件输出 html 中的所有 href 链接及其文件名？

Question

import pandas as pd
import glob
import csv
import re
from bs4 import BeautifulSoup
links_with_text = []
textfile = open("a_file.txt", "w")
 for filename in glob.iglob('*.html'):
  with open(filename) as f:
    soup = BeautifulSoup(f)


    links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

    print(links_with_text)
    
    for element in links_with_text:
      textfile.write(element + "\n")

示例输出：

文件名：

链接1
链接2
链接3

文件名2:

链接1
链接2
链接3

文件名3:

链接1
链接2
链接3

我找到了一个 post 一些与我相关的东西，但它在多个文本文件中打印输出，但在这里我想将这些文件名及其链接放在一个文本文件中。

BeautifulSoup on multiple .html files

求推荐。提前谢谢你

Answer 1

我做了一个类似的东西但是用 img 也许它会对你有帮助:

link = input('Url is: ')
html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
f= open("cache.txt","w+")
for image in images: 
    url = ('https:' + image['src']+'\n')
    f.write(url)

with open('cache.txt') as f:
   for line in f:
      url = line
      path = 'image'+url.split('/', -1)[-1]
      urllib.request.urlretrieve(url, path.rstrip('\n'))

Answer 2

试试这个

with open("a_file.txt", "a") as textfile: # "a" to append string
    for filename in glob.iglob('*.html'):
        with open(filename) as f:
            soup = BeautifulSoup(f)
            links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
            links_with_text = "\n".join(links_with_text)
            textfile.write(f"{filename}\n{links_with_text}\n")

Answer 3

要将文件名放在每个块的顶部，只需添加另一个 .write() 行，如下所示：

from bs4 import BeautifulSoup
import glob
import csv

links_with_text = []

with open("a_file.txt", "w") as textfile:
    for filename in glob.iglob('*.html'):
        textfile.write(f"{filename}:\n")
        
        with open(filename) as f:
            soup = BeautifulSoup(f)
            links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
            
            for element in links_with_text:
                textfile.write(f"  {element}\n")

如何将多个 .html 文件名传递到单个 txt 输出文件，该文件输出 html 中的所有 href 链接及其文件名？

how can I pass the multiple .html file names in to a single txt output file that outputs all the href links in html along with their file names?

python

beautifulsoup