Dataframe 未显示来自 Android 的推特来源
Dataframe not showing twitter sources from Android
我正在尝试对 Twitter 帐户进行一些分析,但我在尝试显示来自 Android 的来源时遇到了问题。我所做的是合并两个 json 文件,我认为我正确地合并了它,但万一我弄错了这里是我使用的代码。
old_tweets = load_tweets("real_tweets/real_old_tweets.json")
print(len(old_tweets))
for aLis1 in old_tweets:
if aLis1 not in tweets:
tweets.append(aLis1)
load_tweets 是一个自定义函数,它只打开并加载给定特定 路径
的 json 文件
with open(path, "rb") as f:
import json
return json.load(f)
合并两个 json 推文文件后,我调用此函数创建数据框并清理它以仅显示我想要的信息。
df_tweets1 = pd.DataFrame(tweets)
df_tweets2 = df_tweets1[['id','created_at','source','full_text','retweet_count']]
df_tweets = df_tweets2.drop_duplicates('id', keep=False
df_tweets.set_index('id', inplace=True)
df_tweets = df_tweets.rename(columns={"created_at": "time", "full_text": "text"})
df_tweets["time"] = pd.to_datetime(df_tweets["time"])
问题是,当我调用 df_tweets["source"].unique() 时,我没有看到来自 [=37 的任何推文=]
array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
'<a href="https://studio.twitter.com" rel="nofollow">Twitter Media Studio</a>',
'<a href="https://studio.twitter.com" rel="nofollow">Media Studio</a>',
'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'],
dtype=object)
我是不是在合并两组推特数据的时候做错了什么?还是我在尝试创建数据框时做错了什么?
编辑**这里是 real_old_tweets.json 的示例输出,可以让您了解格式。我只打算 post 一个,因为一条推文中包含很多信息。
[{'created_at': 'Tue Oct 16 16:22:11 +0000 2018',
'id': 1052233253040640001,
'id_str': '1052233253040640001',
'full_text': 'REGISTER TO https://url/0pWiwCHGbh! #MAGA https://url/ACTMe53TZU',
'truncated': False,
'display_text_range': [0, 44],
'entities': {'hashtags': [{'text': 'MAGA', 'indices': [37, 42]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/0pWiwCHGbh',
'expanded_url': 'linkVote.GOP',
'display_url': 'Vote.GOP',
'indices': [12, 35]},
{'url': 'url/ACTMe53TZU',
'expanded_url': 'linktwitter.com/erictrump/status/1052174007708147714',
'display_url': 'twitter.com/erictrump/stat…',
'indices': [45, 68]}]},
'source': '<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 25073877,
'id_str': '25073877',
'name': 'Donald J. Trump',
'screen_name': 'realDonaldTrump',
'location': 'Washington, DC',
'description': '45th President of the United States of America',
'url': 'url/OMxB0x7xC5',
'entities': {'url': {'urls': [{'url': 'url/OMxB0x7xC5',
'expanded_url': 'linkwww.Instagram.com/realDonaldTrump',
'display_url': 'Instagram.com/realDonaldTrump',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 55165024,
'friends_count': 47,
'listed_count': 94709,
'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
'favourites_count': 25,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 39296,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': True,
'profile_background_color': '6D5C18',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_image_url_https': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/25073877/1539493274',
'profile_link_color': '1B95E0',
'profile_sidebar_border_color': 'BDDCAD',
'profile_sidebar_fill_color': 'C5CEC0',
'profile_text_color': '333333',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'regular'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': True,
'quoted_status_id': 1052174007708147714,
'quoted_status_id_str': '1052174007708147714',
'quoted_status_permalink': {'url': 'url/ACTMe53TZU',
'expanded': 'linktwitter.com/erictrump/status/1052174007708147714',
'display': 'twitter.com/erictrump/stat…'},
'quoted_status': {'created_at': 'Tue Oct 16 12:26:46 +0000 2018',
'id': 1052174007708147714,
'id_str': '1052174007708147714',
'full_text': 'Friends: Quick reminder that today is that last day to register to vote in Oregon, Kansas, Louisiana, West Virginia, New Jersey and Maryland. It is very quick and easy - simply go to url/GE5BO5ONN1! Let’s #MakeAmericaGreatAgain ',
'truncated': False,
'display_text_range': [0, 243],
'entities': {'hashtags': [{'text': 'MakeAmericaGreatAgain',
'indices': [214, 236]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/GE5BO5ONN1',
'expanded_url': 'linkwww.Vote.GOP',
'display_url': 'Vote.GOP',
'indices': [183, 206]}]},
'source': '<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 39349894,
'id_str': '39349894',
'name': 'Eric Trump',
'screen_name': 'EricTrump',
'location': '',
'description': "Executive Vice President of The @Trump Organization. Husband to @LaraLeaTrump. Large advocate of @StJude Children's Research Hospital. #MakeAmericaGreatAgain",
'url': 'url/uwwNiWyamR',
'entities': {'url': {'urls': [{'url': 'url/uwwNiWyamR',
'expanded_url': 'linkwww.Trump.com',
'display_url': 'Trump.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 2191617,
'friends_count': 715,
'listed_count': 5736,
'created_at': 'Mon May 11 21:42:30 +0000 2009',
'favourites_count': 8638,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 5601,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_link': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_image_url_link': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/39349894/1516709628',
'profile_link_color': '116AB8',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '616161',
'profile_text_color': '000000',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': False,
'retweet_count': 1945,
'favorite_count': 3828,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
'retweet_count': 5415,
'favorite_count': 16565,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
我假设您有 "android" 来源,但我不清楚您的数据是什么样子以及 "id" 和来源之间的关系。话虽如此,当您准备数据时存在一个错误。您正在删除所有重复项。
例如:
>>> import pandas as pd
>>> df = pd.DataFrame(data={'col1':[1,2,2],'col2':[3,4,3],'col3':[1,4,1]})
>>> df
col1 col2 col3
0 1 3 1
1 2 4 4
2 2 3 1
>>> df.drop_duplicates('col1',keep=False)
col1 col2 col3
0 1 3 1
在上面的代码中,您可以看到如果您使用 "keep=False".
,它会删除所有重复的行
>>> df.drop_duplicates('col1',keep='first')
col1 col2 col3
0 1 3 1
1 2 4 4
改为使用keep='first'或keep='last'看看有没有改善.另外,如果我能对数据有更多的了解,找出哪里出了问题,那就太好了。
编辑
一段时间后,我将您的 JSON 对象保存到 "me.json" 文件中,格式为:
[{},{}]
第一个对象的来源是 iPhone,第二个对象的来源是 android。我使用您的代码加载数据:
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import json
>>> with open('me.json','rb') as file:
... json_list = json.load(file)
...
>>> len(json_list)
2
>>> df = pd.DataFrame(json_list)
>>> df1 = df[['id','source']]
>>> df1['source'].value_counts()
<a href="linktwitter.com/download/Android" rel="nofollow">Twitter for Android</a> 1
<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> 1
Name: source, dtype: int64
在上面的输出中,你可以看到我能够看到 "Android"。我的结论是,在您的数据中,df['source'] 列中可能 根本没有 "Android"。
每个JSON对象里面有两个"source"键,一个键在[=里面,请仔细看53=]。您可能已经在该密钥中看到了 "Android"。
我正在尝试对 Twitter 帐户进行一些分析,但我在尝试显示来自 Android 的来源时遇到了问题。我所做的是合并两个 json 文件,我认为我正确地合并了它,但万一我弄错了这里是我使用的代码。
old_tweets = load_tweets("real_tweets/real_old_tweets.json")
print(len(old_tweets))
for aLis1 in old_tweets:
if aLis1 not in tweets:
tweets.append(aLis1)
load_tweets 是一个自定义函数,它只打开并加载给定特定 路径
的 json 文件with open(path, "rb") as f:
import json
return json.load(f)
合并两个 json 推文文件后,我调用此函数创建数据框并清理它以仅显示我想要的信息。
df_tweets1 = pd.DataFrame(tweets)
df_tweets2 = df_tweets1[['id','created_at','source','full_text','retweet_count']]
df_tweets = df_tweets2.drop_duplicates('id', keep=False
df_tweets.set_index('id', inplace=True)
df_tweets = df_tweets.rename(columns={"created_at": "time", "full_text": "text"})
df_tweets["time"] = pd.to_datetime(df_tweets["time"])
问题是,当我调用 df_tweets["source"].unique() 时,我没有看到来自 [=37 的任何推文=]
array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
'<a href="https://studio.twitter.com" rel="nofollow">Twitter Media Studio</a>',
'<a href="https://studio.twitter.com" rel="nofollow">Media Studio</a>',
'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'],
dtype=object)
我是不是在合并两组推特数据的时候做错了什么?还是我在尝试创建数据框时做错了什么?
编辑**这里是 real_old_tweets.json 的示例输出,可以让您了解格式。我只打算 post 一个,因为一条推文中包含很多信息。
[{'created_at': 'Tue Oct 16 16:22:11 +0000 2018',
'id': 1052233253040640001,
'id_str': '1052233253040640001',
'full_text': 'REGISTER TO https://url/0pWiwCHGbh! #MAGA https://url/ACTMe53TZU',
'truncated': False,
'display_text_range': [0, 44],
'entities': {'hashtags': [{'text': 'MAGA', 'indices': [37, 42]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/0pWiwCHGbh',
'expanded_url': 'linkVote.GOP',
'display_url': 'Vote.GOP',
'indices': [12, 35]},
{'url': 'url/ACTMe53TZU',
'expanded_url': 'linktwitter.com/erictrump/status/1052174007708147714',
'display_url': 'twitter.com/erictrump/stat…',
'indices': [45, 68]}]},
'source': '<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 25073877,
'id_str': '25073877',
'name': 'Donald J. Trump',
'screen_name': 'realDonaldTrump',
'location': 'Washington, DC',
'description': '45th President of the United States of America',
'url': 'url/OMxB0x7xC5',
'entities': {'url': {'urls': [{'url': 'url/OMxB0x7xC5',
'expanded_url': 'linkwww.Instagram.com/realDonaldTrump',
'display_url': 'Instagram.com/realDonaldTrump',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 55165024,
'friends_count': 47,
'listed_count': 94709,
'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
'favourites_count': 25,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 39296,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': True,
'profile_background_color': '6D5C18',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_image_url_https': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/25073877/1539493274',
'profile_link_color': '1B95E0',
'profile_sidebar_border_color': 'BDDCAD',
'profile_sidebar_fill_color': 'C5CEC0',
'profile_text_color': '333333',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'regular'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': True,
'quoted_status_id': 1052174007708147714,
'quoted_status_id_str': '1052174007708147714',
'quoted_status_permalink': {'url': 'url/ACTMe53TZU',
'expanded': 'linktwitter.com/erictrump/status/1052174007708147714',
'display': 'twitter.com/erictrump/stat…'},
'quoted_status': {'created_at': 'Tue Oct 16 12:26:46 +0000 2018',
'id': 1052174007708147714,
'id_str': '1052174007708147714',
'full_text': 'Friends: Quick reminder that today is that last day to register to vote in Oregon, Kansas, Louisiana, West Virginia, New Jersey and Maryland. It is very quick and easy - simply go to url/GE5BO5ONN1! Let’s #MakeAmericaGreatAgain ',
'truncated': False,
'display_text_range': [0, 243],
'entities': {'hashtags': [{'text': 'MakeAmericaGreatAgain',
'indices': [214, 236]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/GE5BO5ONN1',
'expanded_url': 'linkwww.Vote.GOP',
'display_url': 'Vote.GOP',
'indices': [183, 206]}]},
'source': '<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 39349894,
'id_str': '39349894',
'name': 'Eric Trump',
'screen_name': 'EricTrump',
'location': '',
'description': "Executive Vice President of The @Trump Organization. Husband to @LaraLeaTrump. Large advocate of @StJude Children's Research Hospital. #MakeAmericaGreatAgain",
'url': 'url/uwwNiWyamR',
'entities': {'url': {'urls': [{'url': 'url/uwwNiWyamR',
'expanded_url': 'linkwww.Trump.com',
'display_url': 'Trump.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 2191617,
'friends_count': 715,
'listed_count': 5736,
'created_at': 'Mon May 11 21:42:30 +0000 2009',
'favourites_count': 8638,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 5601,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_link': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_image_url_link': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/39349894/1516709628',
'profile_link_color': '116AB8',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '616161',
'profile_text_color': '000000',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': False,
'retweet_count': 1945,
'favorite_count': 3828,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
'retweet_count': 5415,
'favorite_count': 16565,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
我假设您有 "android" 来源,但我不清楚您的数据是什么样子以及 "id" 和来源之间的关系。话虽如此,当您准备数据时存在一个错误。您正在删除所有重复项。
例如:
>>> import pandas as pd
>>> df = pd.DataFrame(data={'col1':[1,2,2],'col2':[3,4,3],'col3':[1,4,1]})
>>> df
col1 col2 col3
0 1 3 1
1 2 4 4
2 2 3 1
>>> df.drop_duplicates('col1',keep=False)
col1 col2 col3
0 1 3 1
在上面的代码中,您可以看到如果您使用 "keep=False".
,它会删除所有重复的行>>> df.drop_duplicates('col1',keep='first')
col1 col2 col3
0 1 3 1
1 2 4 4
改为使用keep='first'或keep='last'看看有没有改善.另外,如果我能对数据有更多的了解,找出哪里出了问题,那就太好了。
编辑
一段时间后,我将您的 JSON 对象保存到 "me.json" 文件中,格式为:
[{},{}]
第一个对象的来源是 iPhone,第二个对象的来源是 android。我使用您的代码加载数据:
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import json
>>> with open('me.json','rb') as file:
... json_list = json.load(file)
...
>>> len(json_list)
2
>>> df = pd.DataFrame(json_list)
>>> df1 = df[['id','source']]
>>> df1['source'].value_counts()
<a href="linktwitter.com/download/Android" rel="nofollow">Twitter for Android</a> 1
<a href="linktwitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> 1
Name: source, dtype: int64
在上面的输出中,你可以看到我能够看到 "Android"。我的结论是,在您的数据中,df['source'] 列中可能 根本没有 "Android"。
每个JSON对象里面有两个"source"键,一个键在[=里面,请仔细看53=]。您可能已经在该密钥中看到了 "Android"。