如何将抓取的数据放入数据框中

How do I put my scraped data into a data frame

我需要帮助,我无法尝试将我的抓取数据放入具有 3 列的数据框中,即从每个抓取的网站中提取的日期、来源和关键字以进行进一步的文本分析,我的代码是从 https://whosebug.com/users/12229253/foreverlearning 并在下面给出:

from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    results[url] = article
for url in urls:
    print(url)
    article = results[url]
    print(article.authors)
    print(article.publish_date)
    print(article.keywords)

我试过它,下面是如何将它变成数据框的方法。假设您想首先使用 pandas:

import nltk
import pandas as pd

from newspaper import Article


nltk.download('punkt')


urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']

    # create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
    # put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
    for url in urls: # process each url separately
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()
        # create a row with the data you need using attributes
        record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
        # append info about each url as a new row
        saved_data = saved_data.append(record, ignore_index = True)

    return saved_data

现在,当你 运行 这个函数 add_data_to_df(urls, saved_data),您应该会看到一个数据框,其内容与我在测试期间获得的以下内容类似:

日期来源关键词 0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger..。 [尼日利亚、保护、奥约、州、参议员、保护…… 1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, 恐怖分子, nigeria, guardian, decl... 2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [恐怖分子,尼日利亚,声明,国家,军事...... 3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [总统,布哈里,众议院,国民,立法者,... 4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [计划、国会、交易、尼日利亚、尼日利亚、装备...

(对于格式,我很抱歉,我将输出显示为纯文本,因为我不允许附加屏幕截图,但你会有一个很好的 pandas 格式)

编辑: 添加将数据框保存到csv文件的功能。请注意,这是执行此操作的最短方法之一,它将文件保存到当前工作目录,即您正在执行代码的位置:

# this function saves given data to csv    
def save_to_csv(saved_data):
        saved_data.to_csv('output.csv', index=False, sep=',')

# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))