如何将 URL 迭代到 curl 命令？

Question

我是网络抓取的新手，我正在使用 python 和 bash 脚本来获取我需要的信息。我是运行 WSL（ windows subsystem for Linux），出于某种原因，脚本是运行 git-bash。
我正在尝试创建一个 bash 脚本来下载网页的 Html，然后发送到 python 脚本，其中 returns 2 个 txt 文件 links 到其他网页。然后，原始脚本遍历其中一个 txt 文件的 link，并将每个网页的 html 内容下载到以 link 的特定部分命名的文件中。但是这个最后的循环不起作用。
如果我手动将 links 写入 curl 命令，它就可以工作。但是如果我尝试运行脚本它不起作用。
这是 bash 脚本：

#!/bin/bash

curl http://mythicspoiler.com/sets.html |
cat >>mainpage.txt
python creatingAListOfAllExpansions.py #returns two txt files containing the expansion links and the commander decks' links
rm mainpage.txt

#get the pages from the links
cat commanderDeckLinks.txt |
while read a ; do
    curl $a |          ##THIS DOESN'T WORK
    cat >>$(echo $a | cut --delimiter="/" -f4).txt
done

我尝试了几种不同的方法并看到了类似的问题，但对于我来说，我无法弄清楚这个问题。总是出现同样的错误：

curl: (3) URL using bad/illegal format or missing URL

这是commanderDeckLinks.txt的内容：

http://mythicspoiler.com/cmd/index.html
http://mythicspoiler.com/c13/index.html
http://mythicspoiler.com/c14/index.html
http://mythicspoiler.com/c15/index.html
http://mythicspoiler.com/c16/index.html
http://mythicspoiler.com/c17/index.html
http://mythicspoiler.com/c18/index.html
http://mythicspoiler.com/c19/index.html
http://mythicspoiler.com/c20/index.html

这是python脚本

#reads the main page of the website
with open("mainpage.txt") as datafile:
    data = datafile.read()

#gets the content after the first appearance of the introduced string
def getContent(data, x):
    j=0
    content=[]
    for i in range(len(data)):
        if(data[i].strip().startswith(x) and j == 0):
            j=i
        if(i>j and j != 0):
            content.append(data[i])
    return content

#gets the content of the website that is inside the body tag
mainNav = getContent(data.splitlines(), "<!--MAIN NAVIGATION-->")

#gets the content of the website that is inside of the outside center tags
content = getContent(mainNav, "<!--CONTENT-->")

#removes extra content from list
def restrictNoise(data, string):
    content=[]
    for i in data:
        if(i.startswith(string)):
            break
        content.append(i)
    return content

#return only lines which are links
def onlyLinks(data):
    content=[]
    for i in data:
        if(i.startswith("<a")):
            content.append(i)
    return content


#creates a list of the ending of the links to later fetch
def links(data):
    link=[]
    for i in data:
        link.append(i.split('"')[1])
    return link

#adds the rest of the link
def completLinks(data):
    completeLinks=[]
    for i in data:
        completeLinks.append("http://mythicspoiler.com/"+i)
    return completeLinks

#getting the commander decks
commanderDecksAndNoise = getContent(content,"<!---->")
commanderDeck = restrictNoise(commanderDecksAndNoise, "<!---->")
commanderDeckLinks = onlyLinks(commanderDeck)
commanderDecksCleanedLinks = links(commanderDeckLinks)

#creates a txt file and writes in it
def writeInTxt(nameOfFile, restrictions, usedList):
    file = open(nameOfFile,restrictions)
    for i in usedList:
        file.write(i+"\n")
    file.close()

#creating the commander deck text file
writeInTxt("commanderDeckLinks.txt", "w+", completLinks(commanderDecksCleanedLinks))

#getting the expansions
expansionsWithNoise = getContent(commanderDecksAndNoise, "<!---->")
expansionsWithoutNoise = restrictNoise(expansionsWithNoise, "</table>")
expansionsLinksWNoise = onlyLinks(expansionsWithoutNoise)
expansionsCleanedLinks = links(expansionsLinksWNoise)

#creating the expansions text file
writeInTxt("expansionLinks.txt", "w+", completLinks(expansionsCleanedLinks))

如果需要更多信息来解决我的问题，请告诉我。并感谢所有试图提供帮助的人

Answer 1

这里的问题是 bash(Linux) 和 windows 的行尾不同，分别是 LF 和 CRLF（我不太确定，因为这个对我来说都是新的）。因此，当我在 python 中创建一个项目由行分隔的文件时，bash 脚本无法很好地读取它，因为创建的文件具有 CRLF 结尾，而 bash 脚本是只读的LF，使 URL 变得无用，因为它们有一个不应该存在的 CR 结尾。我不知道如何使用 bash 代码解决这个问题，但我所做的是创建一个文件（使用 python），每个项目由下划线“_”分隔，并添加最后一项， n，这样我就不必处理行尾了。然后我只是运行在 bash 中的一个 for 循环迭代由下划线分隔的每个项目，除了最后一项。这解决了问题。

如何将 URL 迭代到 curl 命令？

How to iterate URLs to the curl command?

python

bash

scripting

web-scraping

windows-subsystem-for-linux