在 Tweepy 中循环后保存为 DataFrame,没有循环工作,添加循环后,保存为列表
Saving as a DataFrame after looping in Tweepy, works without loop, after adding loop, saves as list
问题:在 Twitter 上拉取多个用户时间线以保存为 DataFrame。
这是一个完美的解决方案,一次只适用于一个用户:
import tweepy
import pandas as pd
import numpy as np
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
# Creation of the actual interface, using authentication
api = tweepy.API(auth, wait_on_rate_limit=True)
# Running only on handle returns a dataframe
tweets = api.user_timeline(screen_name='pycon', count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns= ['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
print(data)
上面的效果很好,并且 return 用户 pycon
的 10 条最新推文将在 DataFrame 中。下一步是添加多个要查询的句柄。这是使用多个句柄执行相同操作的代码:
#Added list of handles
handles = ['pycon', 'gvanrossum']
#Added Empty DF to fill
test = []
#Added loop
for handle in handles:
tweets = api.user_timeline(screen_name=handle, count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
test.append(data)
print(test)
运行 这将给出两个输出。 data
将是一个包含 gvanrossum
的 10 条最新推文的 DataFrame(作为句柄列表中的第二个句柄,这是有道理的)。第二个输出将是 test
,这是一个列表。有趣的是,test
拥有来自 pycon
和 gvansossum
的所有 20 条推文,但以列表形式。该循环正在运行,但它没有保存为 DataFrame。
问题:如何将多个句柄之间的循环保存为DataFrame?
如果要将数据存储在单个数据库中
merged=pd.DataFrame()
#Added loop
for handle in handles:
tweets = api.user_timeline(screen_name=handle, count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
#created new column handle to identify the source of tweet. Can comment if you do not need.
data.loc['Handle',:]=handle
#merging the data frames
merged=pd.concat([merged,data])
print(merged)
问题:在 Twitter 上拉取多个用户时间线以保存为 DataFrame。
这是一个完美的解决方案,一次只适用于一个用户:
import tweepy
import pandas as pd
import numpy as np
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
# Creation of the actual interface, using authentication
api = tweepy.API(auth, wait_on_rate_limit=True)
# Running only on handle returns a dataframe
tweets = api.user_timeline(screen_name='pycon', count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns= ['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
print(data)
上面的效果很好,并且 return 用户 pycon
的 10 条最新推文将在 DataFrame 中。下一步是添加多个要查询的句柄。这是使用多个句柄执行相同操作的代码:
#Added list of handles
handles = ['pycon', 'gvanrossum']
#Added Empty DF to fill
test = []
#Added loop
for handle in handles:
tweets = api.user_timeline(screen_name=handle, count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
test.append(data)
print(test)
运行 这将给出两个输出。 data
将是一个包含 gvanrossum
的 10 条最新推文的 DataFrame(作为句柄列表中的第二个句柄,这是有道理的)。第二个输出将是 test
,这是一个列表。有趣的是,test
拥有来自 pycon
和 gvansossum
的所有 20 条推文,但以列表形式。该循环正在运行,但它没有保存为 DataFrame。
问题:如何将多个句柄之间的循环保存为DataFrame?
如果要将数据存储在单个数据库中
merged=pd.DataFrame()
#Added loop
for handle in handles:
tweets = api.user_timeline(screen_name=handle, count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
#created new column handle to identify the source of tweet. Can comment if you do not need.
data.loc['Handle',:]=handle
#merging the data frames
merged=pd.concat([merged,data])
print(merged)