如何将抓取的数据放入数据框中
How do I put my scraped data into a data frame
我需要帮助,我无法尝试将我的抓取数据放入具有 3 列的数据框中,即从每个抓取的网站中提取的日期、来源和关键字以进行进一步的文本分析,我的代码是从 https://whosebug.com/users/12229253/foreverlearning 并在下面给出:
from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
article = Article(url)
article.download()
article.parse()
article.nlp()
results[url] = article
for url in urls:
print(url)
article = results[url]
print(article.authors)
print(article.publish_date)
print(article.keywords)
我试过它,下面是如何将它变成数据框的方法。假设您想首先使用 pandas:
import nltk
import pandas as pd
from newspaper import Article
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
# create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
# put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
for url in urls: # process each url separately
article = Article(url)
article.download()
article.parse()
article.nlp()
# create a row with the data you need using attributes
record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
# append info about each url as a new row
saved_data = saved_data.append(record, ignore_index = True)
return saved_data
现在,当你 运行 这个函数
add_data_to_df(urls, saved_data)
,您应该会看到一个数据框,其内容与我在测试期间获得的以下内容类似:
日期来源关键词
0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger..。 [尼日利亚、保护、奥约、州、参议员、保护……
1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, 恐怖分子, nigeria, guardian, decl...
2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [恐怖分子,尼日利亚,声明,国家,军事......
3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [总统,布哈里,众议院,国民,立法者,...
4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [计划、国会、交易、尼日利亚、尼日利亚、装备...
(对于格式,我很抱歉,我将输出显示为纯文本,因为我不允许附加屏幕截图,但你会有一个很好的 pandas 格式)
编辑: 添加将数据框保存到csv文件的功能。请注意,这是执行此操作的最短方法之一,它将文件保存到当前工作目录,即您正在执行代码的位置:
# this function saves given data to csv
def save_to_csv(saved_data):
saved_data.to_csv('output.csv', index=False, sep=',')
# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))
我需要帮助,我无法尝试将我的抓取数据放入具有 3 列的数据框中,即从每个抓取的网站中提取的日期、来源和关键字以进行进一步的文本分析,我的代码是从 https://whosebug.com/users/12229253/foreverlearning 并在下面给出:
from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
article = Article(url)
article.download()
article.parse()
article.nlp()
results[url] = article
for url in urls:
print(url)
article = results[url]
print(article.authors)
print(article.publish_date)
print(article.keywords)
我试过它,下面是如何将它变成数据框的方法。假设您想首先使用 pandas:
import nltk
import pandas as pd
from newspaper import Article
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
# create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
# put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
for url in urls: # process each url separately
article = Article(url)
article.download()
article.parse()
article.nlp()
# create a row with the data you need using attributes
record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
# append info about each url as a new row
saved_data = saved_data.append(record, ignore_index = True)
return saved_data
现在,当你 运行 这个函数
add_data_to_df(urls, saved_data)
,您应该会看到一个数据框,其内容与我在测试期间获得的以下内容类似:
日期来源关键词 0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger..。 [尼日利亚、保护、奥约、州、参议员、保护…… 1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, 恐怖分子, nigeria, guardian, decl... 2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [恐怖分子,尼日利亚,声明,国家,军事...... 3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [总统,布哈里,众议院,国民,立法者,... 4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [计划、国会、交易、尼日利亚、尼日利亚、装备...
(对于格式,我很抱歉,我将输出显示为纯文本,因为我不允许附加屏幕截图,但你会有一个很好的 pandas 格式)
编辑: 添加将数据框保存到csv文件的功能。请注意,这是执行此操作的最短方法之一,它将文件保存到当前工作目录,即您正在执行代码的位置:
# this function saves given data to csv
def save_to_csv(saved_data):
saved_data.to_csv('output.csv', index=False, sep=',')
# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))