从 python 中的 URL 列中提取部分 URL

Extract part of URL from column of URLs in python

我有一列 URL,我想检索“/show”之后但下一个“/”之前的数字,并希望这些数字采用整数形式

sn    URL
1     https://tvseries.net/show/51/johnny155
2     https://tvseries.net/show/213/kimble2
3     https://tvseries.net/show/46/forceps
4     https://tvseries.net/show/90/tr9
5     https://tvseries.net/show/22/candlenut

预期输出是

sn    URL                                          show_id
1     https://tvseries.net/show/51/johnny155       51
2     https://tvseries.net/show/213/kimble2        213
3     https://tvseries.net/show/46/forceps         46 
4     https://tvseries.net/show/90/tr9             90
5     https://tvseries.net/show/22/candlenut       22

目前,我已尝试使用以下代码来检索“显示”后的数字,它能够生成一个列,其中 show_id 位于方括号中(即 [51]、[213] ) 并且它的类型是 pandas.core.series.Series.

有没有更有效的方法来获取不带括号的整数形式的 show_id?感谢任何形式的帮助,谢谢

import urllib.parse as urlparse

df['protocol'],df['domain'],df['path'], df['query'], df['fragment'] = zip(*df['URL'].map(urlparse.urlsplit))

df['UID'] = df['path'].str.findall(r'(?<=show)[^,.\d\n]+?(\d+)')

您可以使用 extract 通过使用捕获组来匹配 show 之后正斜杠之间的数字来创建列:

df = pd.DataFrame({ 'sn' : [1, 2, 3, 4, 5], 
                   'URL': ['https://tvseries.net/show/51/johnny155',
                           'https://tvseries.net/show/213/kimble2',
                           'https://tvseries.net/show/46/forceps',
                           'https://tvseries.net/show/90/tr9',
                           'https://tvseries.net/show/22/candlenut'
                           ]})
df['show_id'] = df['URL'].str.extract('show/(\d+)/')
df

输出

   sn                                     URL show_id
0   1  https://tvseries.net/show/51/johnny155      51
1   2   https://tvseries.net/show/213/kimble2     213
2   3    https://tvseries.net/show/46/forceps      46
3   4        https://tvseries.net/show/90/tr9      90
4   5  https://tvseries.net/show/22/candlenut      22