根据 python 中数据框中的列值循环文本数据

Question

我有一个名为 data_set_tweets.csv 的数据集，如下所示

created_at,tweet,retweet_count
7/29/2021 2:40,Great Sunny day for Cricket at London,3
7/29/2021 10:40,Great Score put on by England batting,0
7/29/2021 11:50,England won the match,1

我想做的是将以下输出输入到数据框中。
这意味着我想根据 retweet_count 值迭代 tweet 列中具有相同 created_at 值的文本那个特定的推文
下面是我的数据集的预期输出

created_at,tweet
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 10:40,Great Score put on by England batting
7/29/2021 11:50,England won the match
7/29/2021 11:50,England won the match

以下是我如何开始我的方法

import pandas as pd

def iterateTweets():
tweets = pd.read_csv(r'data_set_tweets.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet', 'retweet_count'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
df['retweet_count'] = df['retweet_count'].apply(lambda x: str(x))

# print(df)
return df

if __name__ == '__main__':

print(iterateTweets())

我是数据框的初学者，python有人可以帮我吗？

Answer 1

Use Index.repeat with DataFrame.loc for duplicated columns, DataFrame.pop 用于使用和删除列：

df = pd.read_csv(r'data_set_tweets.csv')

df['created_at'] = pd.to_datetime(df['created_at'])
df = df.loc[df.index.repeat(df.pop('retweet_count') + 1)].reset_index(drop=True)
print (df)
           created_at                                  tweet
0 2021-07-29 02:40:00  Great Sunny day for Cricket at London
1 2021-07-29 02:40:00  Great Sunny day for Cricket at London
2 2021-07-29 02:40:00  Great Sunny day for Cricket at London
3 2021-07-29 02:40:00  Great Sunny day for Cricket at London
4 2021-07-29 10:40:00  Great Score put on by England batting
5 2021-07-29 11:50:00                  England won the match
6 2021-07-29 11:50:00                  England won the match

Answer 2

或使用：

df = df.apply(lambda x: x.repeat(df['retweet_count'] + 1)).reset_index(drop=True)

如果要删除 retweet_count 列：

df = df.apply(lambda x: x.repeat(df['retweet_count'] + 1)).reset_index(drop=True).drop('retweet_count', axis=1)

或：

col = df.pop('retweet_count') + 1
df = df.apply(lambda x: x.repeat(col)).reset_index(drop=True)

df 输出：

           created_at                                  tweet
0 2021-07-29 02:40:00  Great Sunny day for Cricket at London
1 2021-07-29 02:40:00  Great Sunny day for Cricket at London
2 2021-07-29 02:40:00  Great Sunny day for Cricket at London
3 2021-07-29 02:40:00  Great Sunny day for Cricket at London
4 2021-07-29 10:40:00  Great Score put on by England batting
5 2021-07-29 11:50:00                  England won the match
6 2021-07-29 11:50:00                  England won the match

或使用 loc 和 enumerate:

df.loc[sum([[i] * (v + 1) for i, v in enumerate(df['retweet_count'])], [])].reset_index(drop=True)

根据 python 中数据框中的列值循环文本数据

Loop text data based on column value in data frame in python

python

nlp

dataframe

pandas