pandas:将列中的字符串展开为子串并添加到行中
pandas: expand strings in a column to substrings and add them to the rows
我有一个包含许多 cloumns 的数据框,每个单元格中有多个字符串,我想获取字符串的子字符串并将它们添加为新数据框中的新列,并添加一个描述第一列的额外列就像下面的例子。我知道如何对原始数据框中的一列执行此操作,但我想一次对所有列执行此操作。
import pandas as pd
data = {'First': ['First string, second string, third string,...', 'NaN','First string, second string, third string,...'],
'Second': ['NaN', 'First string, second string, third string,...','First string, second string, third string,...'],
'third': ['First string, second string, third string,...', 'First string, second string, third string,...','NaN'],
'forth': ['First string, second string, third string,...', 'NaN','First string, second string, third string,...'],
....
}
df = pd.DataFrame (data, columns = ['First','Second',...])
对于一列:
lst= df['first'].dropna().tolist()
my_list= [x for xs in lst for x in xs.split(',')]
df_new = pd.DataFrame(my_list, columns =['text'])
尽管我不确定如何添加与 'my_list' 大小相同且带有前一列名称的第二列,因此在本例中 'first'.
一列的期望输出:
df_new:
text name
0 First string first
1 second string first
2 third string first
... ...
我想要发生的是 df 中的所有列都作为行添加到 df_new 而列 'name' 具有与第一列相对应的前列名称的单元格字符串。
希望对您有所帮助!
#create the columns as rows
df_new = pd.DataFrame({'text':df.T.index})
df_new['text'] = df_new['text'].str.strip("'")
#create a new column for group
df_new['group']=1
#cumsum the column names
df_new['name'] = df_new.groupby('group')['text'].apply(lambda x: (x + ' ').cumsum().str.strip() + ",")
del df_new['group']
我有一个包含许多 cloumns 的数据框,每个单元格中有多个字符串,我想获取字符串的子字符串并将它们添加为新数据框中的新列,并添加一个描述第一列的额外列就像下面的例子。我知道如何对原始数据框中的一列执行此操作,但我想一次对所有列执行此操作。
import pandas as pd
data = {'First': ['First string, second string, third string,...', 'NaN','First string, second string, third string,...'],
'Second': ['NaN', 'First string, second string, third string,...','First string, second string, third string,...'],
'third': ['First string, second string, third string,...', 'First string, second string, third string,...','NaN'],
'forth': ['First string, second string, third string,...', 'NaN','First string, second string, third string,...'],
....
}
df = pd.DataFrame (data, columns = ['First','Second',...])
对于一列:
lst= df['first'].dropna().tolist()
my_list= [x for xs in lst for x in xs.split(',')]
df_new = pd.DataFrame(my_list, columns =['text'])
尽管我不确定如何添加与 'my_list' 大小相同且带有前一列名称的第二列,因此在本例中 'first'.
一列的期望输出:
df_new:
text name
0 First string first
1 second string first
2 third string first
... ...
我想要发生的是 df 中的所有列都作为行添加到 df_new 而列 'name' 具有与第一列相对应的前列名称的单元格字符串。
希望对您有所帮助!
#create the columns as rows
df_new = pd.DataFrame({'text':df.T.index})
df_new['text'] = df_new['text'].str.strip("'")
#create a new column for group
df_new['group']=1
#cumsum the column names
df_new['name'] = df_new.groupby('group')['text'].apply(lambda x: (x + ' ').cumsum().str.strip() + ",")
del df_new['group']