python 通过嵌套 for/in/if 循环传递额外信息,该循环迭代数据框中的项目

python pass extra information through nested for/in/if loop that iterates over items in a dataframe

我的问题的基础是在遍历数据框时传递“额外”信息。我将数据框中的每个值传递给要检查的函数。但我还需要传递 return 一些额外的识别信息。 (身份证号码。)所以我需要一种方法,允许我将这个额外信息传递给下一个函数,以便我可以 return 它与该函数的结果。

基本问题:

import pandas as pd

urls = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

for col in urls.columns:
    for url in urls[col]:
        if url:
            print(url,col)
#I need to be able to print the corresponding ID that belongs to each URL

期望的输出:

ID    URL     COL
1     apple   link1
etc...

我认为如果可以用for/in/if结构来完成,那么它可以应用到下面的真实代码中:


实际代码稍微复杂一些。我正在使用 asyncio.gather 来处理数据帧。传递列名很简单,但我不知道如何获取 ID。

import asyncio, aiohttp, time, pandas as pd
from validator_collection import checkers

url_df = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

async def get(url, sem, session,col):
    try:
            async with sem, session.get(url=url,raise_for_status=True, timeout=20) as response:
                    resp = await response.read()
                    print("Successfully got url {} from column {} with response of length {}.".format(url, col, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))


async def main(urls):
    sem = asyncio.BoundedSemaphore(50)
    async with aiohttp.ClientSession() as session:
        ret = await asyncio.gather(*[get(url, sem, session, col) 
                                                for col in urls.columns #for each column in the dataframe
                                                    for url in urls[col] #for each row in the column
                                                        if url #if the item isn't null
                                                            if checkers.is_url(url)==True]) #if url is valid
    print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))

amount = url_df.count(axis='columns').sum()
start = time.time()
asyncio.run(main(url_df))
end = time.time()

print("Took {} seconds to pull {} websites.".format(end - start, amount))

如果你能清理 urls 字符串,这样你就可以拆分它们,下面的效果会更好(比如循环执行 for link in s.split('|'): 或拆分 ; 等。这一切都取决于数据清理和争吵你可以事先做。

尝试:

import pandas as pd
from urllib.parse import urlparse

urls = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

def getDomain(s):
    if bool(urlparse(s).scheme):
        return urlparse(s).netloc

for row in urls.itertuples():
    print(row[1], getDomain(row[2]), getDomain(row[3]), getDomain(row[4]))
    print('\n')

1 None www.bing.com www.whosebug.com~|~http:

2 None www.linkedin.com www.imdb.com

5 None None www.google.co.uk

25 www.youtube.com None None

26 None www.reddit.com None

补充:

要清理 URL 列,您可能需要这样的东西:

import pandas as pd
from urllib.parse import urlparse
import re
import numpy as np

urls = pd.DataFrame({
    'ID':[1,2,5,25,26],
    'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
    'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
    'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
    })

startURL = re.compile(r'(?:www|http://www)')
sURL = re.compile(r'((?:www\.|https?://www\.|https?://).*?\.(?:$|[A-Za-z0-9.]+))')

def extractDomain(s):
    pos = []
    l = []
    for m in startURL.finditer(s):
        pos.append(m.start())
    for p in pos:
        n = urlparse(sURL.search(s[p:]).group(1)).netloc
        if n == '':
            l.append(urlparse(sURL.search(s[p:]).group(1)).path)
        else:
            l.append(urlparse(sURL.search(s[p:]).group(1)).netloc)
    if len(l) > 0:
        #return ', '.join(l) #return a string
        return l # return a list

filter_col = [col for col in urls if col.startswith('link')]

df_links = urls[filter_col].copy()

df_links.columns = [str(col) + '_clean' for col in df_links.columns]

df_links = df_links.applymap(extractDomain)

urls = urls.join(df_links)

# example of link3 column change
print(urls[['ID', 'link3', 'link3_clean']])

    ID  link3                                               link3_clean
0   1   http://www.whosebug.com~|~http://www.ebay...   [www.whosebug.com, www.ebay.com]
1   2   http://www.imdb.com                                 [www.imdb.com]
2   5   http://www.google.co.uk                             [www.google.co.uk]
3   25  more random text that could be really long and...   None
4   26  over the hills and through the woods                None

与其遍历所有列,不如遍历行听起来更有意义 (?)。

您可以通过多种方式解决此问题:

对于初学者来说,通过 ID 列索引您的数据框可能是有意义的:

url_df = url_df.set_index('ID')

然后,除其他可能性外,您可以使用 itertuples() 方法:

for row in url_df.itertuples():
    # The first item will always be the index, so:
    ID = row[0]  # or ID = row.Index
    
    # Then do whatever you want for the other columns:
    for link in row[1:]:
        print(ID, link)

输出:

1 apple
1 http://www.bing.com
1 http://www.whosebug.com~|~http://www.ebay.com
2 www.google.com
2 http://www.linkedin.com
2 http://www.imdb.com
5 gm@yahoo.com
5 
5 http://www.google.co.uk
25 http://www.youtube.com
25  please call now
25 more random text that could be really long and annoying
26 888-555-5556 Ryan Parkes rp@abc.io
26 http://www.reddit.com
26 over the hills and through the woods

如果您还想包含列名,您可以这样做:

for row in url_df.itertuples():
    # The first item will always be the index, so:
    ID = row[0]  # or ID = row.Index
    
    # Then do whatever you want for the other columns:
    for link, col in zip(row[1:], row._fields[1:]):
        print(ID, link, col)

为了在您的实际代码中使用,如果您将其包装在一个子例程中可能会更清楚:

def iter_links(df):
    for row in url_df.itertuples():
        # The first item will always be the index, so:
        ID = row[0]  # or ID = row.Index
    
        # Then do whatever you want for the other columns:
        for url, col in zip(row[1:], row._fields[1:]):
            if url and checkers.is_url(url):
                yield (ID, col, url)

然后在您的代码中使用它,例如:

await asyncio.gather(*(get(sess, sem, ID, col, url)
                       for ID, col, url in iter_links(df)))