如何以 JSON 文件格式保存 Python 网络抓取工具输出？

Question

我最近开始编码和学习 Python，目前我正在开发网络爬虫。我想从多个网站抓取数据并将其保存为 JSON 文件格式。所以它目前只是打印出搜索结果。我希望将网站抓取数据保存在 JSON 文件中。我正在编写这段代码，但出现 "is not JSON serializable" 错误。它没有写入文件名文件。在 Mac 终端上使用 Python 2.7.14。下面是 Scraper.py 文件。

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip
import json

urls = ['http://www.ctex.cn', 'http://www.ss-gate.org/']
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    #open the file "filename" in write ("w") mode
    file = open("filename", "w")
    json_data = json.dumps(my_list,file)
    #json.dump(soup, file)
    file.close()

我也在使用不同的代码，但它仍然没有写入文件名文件。错误 "is not JSON serializable"。下面是 Scraper2.py 文件。

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

urls = ['http://www.ctex.cn', 'http://www.ss-gate.org/']
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    #print(soup)

import json
# open the file "filename" in write ("w") mode
file = open("filename", "w")
#output = soup
# dumps "output" encoded in the JSON format into "filename"
json.dump(soup, file)
file.close()

Answer 1

符合逻辑

你的问题有点模棱两可
因为我不确定你想做请求还是解析器？
最好不要混淆他们

在技术方面

html 格式不完全适合 json
我建议两种解决方法

将每个文本保存为 html 文件

您可以将 response.text（不是 response.content）保存到 html 文件
像这样

for url in urls:
    url = A_URL
    res = requests.get(url)
    html_file = open('FILENAME.html','w')
    html_file.write(res.text)
    html_file.close()

或

将多个结果保存到单个 json 文件

out_list = []
for url in urls:
    res = requests.get(url)
    out_list.append(res.text)
json_file = open('out.json','w')
json.dump(out_list,json_file)
json_file.close()

并编写另一个程序来解析它们

加油

如何以 JSON 文件格式保存 Python 网络抓取工具输出？

How to save Python web scraper output in JSON file format?

beautifulsoup

scrapy

web-scraping

python-2.7

scrapy-spider

符合逻辑

在技术方面

将每个文本保存为 html 文件

将多个结果保存到单个 json 文件