Pandas,为什么字符串匹配到第一个字母就停止
Pandas, why does string matching stops at the first letter
我有一个包含公司的系列,它们是加入的股票名称:
stock
0 AAPLApple
1 AMZNAmazon.com
2 FBFacebook
3 NFLXNetflix
4 INTCIntel
5 TSLATesla
6 MUMicron Technology
7 MSFTMicrosoft
8 NVDANVIDIA
9 CSCOCisco Systems
11 LULULululemon Athletica
12 EBAYeBay
13 AVGOBroadcom
14 QCOMQUALCOMM
15 GILDGilead Sciences
16 WDCWestern Digital
17 GOOGLAlphabet
18 BIIBBiogen
19 GOOGAlphabet
20 URBNUrban Outfitters
21 NTAPNetApp
22 AABAAltaba
23 SBUXStarbucks
24 CELGCelgene
25 SPLKSplunk
26 COSTCostco Wholesale
27 AMDAdvanced Micro Devices
28 PYPLPaypal
29 REGNRegeneron Pharmaceuticals
30 AMATApplied Materials
...
Name: stock, Length: 243, dtype: object
我还有一个要匹配的股票符号列表:
['ETSY',
'COUP',
'TSLA',
'CRWD',
'ROKU',
'A',
'AAL',
'AAP',
'AAPL',
'ABBV',
'AEP',
'AES',
'AFL',
'HUBS',
'AIG',
'AIV',
'AIZ',
'AJG',
'AKAM',
'ALB',
'ALGN',
'ALK',
'ALL',
'ALLE',
'ALXN',
'AMAT',
'AMCR',
'AMD',
'AME',
'AMGN',
'AMP',
'AMT',
'AMZN',
...]
我想匹配以去除公司全名的股票列中的每个名称,只留下符号,如果找不到符号,则删除该行。
到目前为止我的代码:
def clean_name(name):
companies = list(COMPANIES.keys())
for company in companies:
if company in name:
return company
return None
def sort_df():
df[STOCK] = df[STOCK].apply(lambda x: clean_name(x))
df = df.dropna()
return df
问题是每个字符串的匹配 returns 大多数情况下只有一个字母。
所以输出是:
0 A
1 A
2 F
3 F
4 C
5 TSLA
6 MU
7 F
8 A
9 C
11 A
12 A
13 A
14 A
15 D
16 C
17 A
18 BIIB
19 A
20 O
21 A
22 A
23 SBUX
24 C
25 SPLK
26 C
27 A
28 L
29 RE
30 A
...
一个想法是按长度反向排序以匹配第一个最长的公司名称:
def clean_name(name):
companies = list(COMPANIES.keys())
for company in sorted(companies, key=len, reverse=True):
if company in name:
return company
return None
解决此问题的另一种方法是使用股票代码编译正则表达式字符串,并且 运行 与 DataFrame 中的 stock
列匹配。
例如:
import re
# Build regex string.
exp_s = '|'.join('^{}'.format(i) for i in sorted(s, key=len, reverse=True))
exp = re.compile('({})'.format(exp_s))
# Match symbols using regex.
df['stock'].str.extract(exp).dropna()
其中:
# Stock symbols list.
s = ['AAPL', 'AMZN', 'FB', 'NFLX', ...]
输出:
0 AAPL
1 AMZN
2 FB
3 NFLX
...
26 AMD
27 PYPL
28 REGN
29 AMAT
我有一个包含公司的系列,它们是加入的股票名称:
stock
0 AAPLApple
1 AMZNAmazon.com
2 FBFacebook
3 NFLXNetflix
4 INTCIntel
5 TSLATesla
6 MUMicron Technology
7 MSFTMicrosoft
8 NVDANVIDIA
9 CSCOCisco Systems
11 LULULululemon Athletica
12 EBAYeBay
13 AVGOBroadcom
14 QCOMQUALCOMM
15 GILDGilead Sciences
16 WDCWestern Digital
17 GOOGLAlphabet
18 BIIBBiogen
19 GOOGAlphabet
20 URBNUrban Outfitters
21 NTAPNetApp
22 AABAAltaba
23 SBUXStarbucks
24 CELGCelgene
25 SPLKSplunk
26 COSTCostco Wholesale
27 AMDAdvanced Micro Devices
28 PYPLPaypal
29 REGNRegeneron Pharmaceuticals
30 AMATApplied Materials
...
Name: stock, Length: 243, dtype: object
我还有一个要匹配的股票符号列表:
['ETSY',
'COUP',
'TSLA',
'CRWD',
'ROKU',
'A',
'AAL',
'AAP',
'AAPL',
'ABBV',
'AEP',
'AES',
'AFL',
'HUBS',
'AIG',
'AIV',
'AIZ',
'AJG',
'AKAM',
'ALB',
'ALGN',
'ALK',
'ALL',
'ALLE',
'ALXN',
'AMAT',
'AMCR',
'AMD',
'AME',
'AMGN',
'AMP',
'AMT',
'AMZN',
...]
我想匹配以去除公司全名的股票列中的每个名称,只留下符号,如果找不到符号,则删除该行。 到目前为止我的代码:
def clean_name(name):
companies = list(COMPANIES.keys())
for company in companies:
if company in name:
return company
return None
def sort_df():
df[STOCK] = df[STOCK].apply(lambda x: clean_name(x))
df = df.dropna()
return df
问题是每个字符串的匹配 returns 大多数情况下只有一个字母。
所以输出是:
0 A
1 A
2 F
3 F
4 C
5 TSLA
6 MU
7 F
8 A
9 C
11 A
12 A
13 A
14 A
15 D
16 C
17 A
18 BIIB
19 A
20 O
21 A
22 A
23 SBUX
24 C
25 SPLK
26 C
27 A
28 L
29 RE
30 A
...
一个想法是按长度反向排序以匹配第一个最长的公司名称:
def clean_name(name):
companies = list(COMPANIES.keys())
for company in sorted(companies, key=len, reverse=True):
if company in name:
return company
return None
解决此问题的另一种方法是使用股票代码编译正则表达式字符串,并且 运行 与 DataFrame 中的 stock
列匹配。
例如:
import re
# Build regex string.
exp_s = '|'.join('^{}'.format(i) for i in sorted(s, key=len, reverse=True))
exp = re.compile('({})'.format(exp_s))
# Match symbols using regex.
df['stock'].str.extract(exp).dropna()
其中:
# Stock symbols list.
s = ['AAPL', 'AMZN', 'FB', 'NFLX', ...]
输出:
0 AAPL
1 AMZN
2 FB
3 NFLX
...
26 AMD
27 PYPL
28 REGN
29 AMAT