将 for 循环输出转换为数据帧 python
convert for-loop output into dataframe python
我正在尝试将此代码的输出转换为数据帧,但不知道如何操作。将输出列(字符串和频率)转换为数据帧的好方法是什么?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
.....
使用str.split
和value_counts
:
>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
.406 ventures 1
transamerica ventures 1
crane venture partners 1
geekdom fund 1
endiya partners 1
Name: Select Investors, Length: 1187, dtype: int64
使用Series.str.split
, reshape by DataFrame.stack
, convert to lowercase by Series.str.lower
and last count by Series.value_counts
:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
对于 DataFrame 使用:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
如果想修改你的解决方案:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)
all_data = []
# Calculate frequency
for string in all_investor:
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
all_data.append([string, frequency])
new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df
Output:
index Investor Frequency
0 854 sequoia capital 115
1 626 ing 90
2 765 accel 62
3 1025 tiger global 50
4 964 softbank 50
... ... ... ...
1180 486 quiet capital 1
1181 487 york capital management 1
1182 489 ewtp capital 1
1183 490 kleiner perkins caulfield & byers 1
1184 1184 google capital 1
我正在尝试将此代码的输出转换为数据帧,但不知道如何操作。将输出列(字符串和频率)转换为数据帧的好方法是什么?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
.....
使用str.split
和value_counts
:
>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
.406 ventures 1
transamerica ventures 1
crane venture partners 1
geekdom fund 1
endiya partners 1
Name: Select Investors, Length: 1187, dtype: int64
使用Series.str.split
, reshape by DataFrame.stack
, convert to lowercase by Series.str.lower
and last count by Series.value_counts
:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
对于 DataFrame 使用:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
如果想修改你的解决方案:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)
all_data = []
# Calculate frequency
for string in all_investor:
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
all_data.append([string, frequency])
new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df
Output:
index Investor Frequency
0 854 sequoia capital 115
1 626 ing 90
2 765 accel 62
3 1025 tiger global 50
4 964 softbank 50
... ... ... ...
1180 486 quiet capital 1
1181 487 york capital management 1
1182 489 ewtp capital 1
1183 490 kleiner perkins caulfield & byers 1
1184 1184 google capital 1