Pandas 并行 URL 下载 pd.read_html

Question

我知道我可以通过以下方式从网页下载 csv 文件：

import pandas as pd
import numpy as np
from io import StringIO    

URL = "http://www.something.com"
data = pd.read_html(URL)[0].to_csv(index=False, header=True)
file = pd.read_csv(StringIO(data), sep=',')

现在我想同时为更多URLs执行上述操作，就像您打开浏览器中的不同选项卡。换句话说，当你有不同的 URL 时，一种并行化它的方法，而不是循环遍历或一次执行一个。所以，我想在一个数据框中包含一系列 URL，然后创建一个包含字符串 'data' 的新列，每个 URL.

list_URL = ["http://www.something.com", "http://www.something2.com", 
            "http://www.something3.com"]
df = pd.DataFrame(list_URL, columns =['URL'])    
df['data'] = pd.read_html(df['URL'])[0].to_csv(index=False, header=True)

但它给我错误：cannot parse from 'Series'

是否有更好的语法，或者这是否意味着我不能同时为多个 URL 执行此操作？

Answer 1

你可以这样试试：

import pandas as pd

URLS = [
    "https://en.wikipedia.org/wiki/Periodic_table#Presentation_forms",
    "https://en.wikipedia.org/wiki/Planet#Planetary_attributes",
]

df = pd.DataFrame(URLS, columns=["URL"])
df["data"] = df["URL"].map(
    lambda x: pd.read_html(x)[0].to_csv(index=False, header=True)
)

print(df)
# Output
                                           URL                                         data
0  https://en.wikipedia.org/wiki/Periodic_t...  0\r\nPart of a series on the\r\nPeriodic...
1  https://en.wikipedia.org/wiki/Planet#Pla...  0\r\n"The eight known planets of the Sol...

Pandas 并行 URL 下载 pd.read_html

Pandas parallel URL downloads with pd.read_html

html

web-scraping

pandas