Web Scraper 分为不同的文件

Web Scraper sectioned in different files

我在 Python 爬虫上工作了一段时间。 我想将获得的信息保存在不同的文件中。网址必须在一个文件中,字幕必须在另一个文件中。

虽然使用 URL 没有问题,但是当我尝试抓取我正在搜索的博客的名称时,我得到了这个结果:

w
a
t
a
s
h
i
n
o
s
e
k
a
i
s
w
o
r
l
d
v
-
a
-
p
-
o
-
r
-
s
-
m
-
u
-
t
b
l
a
c
k
e
n
e
d
d
e
a
t
h
e
y
e
5
h
i
n
y
8
l
a
z
e
2
o
m
b
i
e
p
o
r
y
g
o
n
-
d
i
g
i
t
a
l
v
a
p
o
r
w
a
v
e
b
o
m
b
s
u
b
t
l
e
a
n
i
m
e
v
a
p
o
r
w
a
v
e
c
o
r
p
f
i
r
m
i
m
a
g
e

我已经确定了问题,我认为它与'\n'有关,但我还没有找到解决方案。

这是我的代码:

from bs4 import BeautifulSoup

search_term = "landscape/recent"
posts_scrape = requests.get(f"https://www.tumblr.com/search/{search_term}")
soup = BeautifulSoup(posts_scrape.text, "html.parser")

articles = soup.find_all("article", class_="FtjPK")

data = {}
for article in articles:
    try:
        source = article.find("div", class_="vGkyT").text
        for imgvar in article.find_all("img", alt="Image"):
            data.setdefault(source, []).extend(
                [
                    i.replace("500w", "").strip()
                    for i in imgvar["srcset"].split(",")
                    if "500w" in i
                ]
            )
    except AttributeError:
        continue


archivo = open ("Sites.txt", "w")
for source, image_urls in data.items():
    for url in image_urls:
        archivo.write(url + '\n')
archivo.close()


archivo = open ("Source.txt", "w")
for source, image_urls in data.items():
    for sources in source:
        archivo.write(sources + '\n')
archivo.close()

将最后一个循环更改为:

archivo = open("Source.txt", "w")
for source in data:
    archivo.write(source + "\n")
archivo.close()

那么Source.txt的内容就是:

harshvardhan25
mikeahrens
amazinglybeautifulphotography
landscaperrosebay
danielapelli
sahrish-acrylic-painting
sweetd3lights
pensamentsisomnis
pics-bae
oneshotolive
scattopermestesso
huariqueje

或使用with:

with open("Source.txt", "w") as archivo:
    archivo.write("\n".join(data))