python 通过嵌套 for/in/if 循环传递额外信息,该循环迭代数据框中的项目
python pass extra information through nested for/in/if loop that iterates over items in a dataframe
我的问题的基础是在遍历数据框时传递“额外”信息。我将数据框中的每个值传递给要检查的函数。但我还需要传递 return 一些额外的识别信息。 (身份证号码。)所以我需要一种方法,允许我将这个额外信息传递给下一个函数,以便我可以 return 它与该函数的结果。
基本问题:
import pandas as pd
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
for col in urls.columns:
for url in urls[col]:
if url:
print(url,col)
#I need to be able to print the corresponding ID that belongs to each URL
期望的输出:
ID URL COL
1 apple link1
etc...
我认为如果可以用for/in/if结构来完成,那么它可以应用到下面的真实代码中:
实际代码稍微复杂一些。我正在使用 asyncio.gather 来处理数据帧。传递列名很简单,但我不知道如何获取 ID。
import asyncio, aiohttp, time, pandas as pd
from validator_collection import checkers
url_df = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
async def get(url, sem, session,col):
try:
async with sem, session.get(url=url,raise_for_status=True, timeout=20) as response:
resp = await response.read()
print("Successfully got url {} from column {} with response of length {}.".format(url, col, len(resp)))
except Exception as e:
print("Unable to get url {} due to {}.".format(url, e.__class__))
async def main(urls):
sem = asyncio.BoundedSemaphore(50)
async with aiohttp.ClientSession() as session:
ret = await asyncio.gather(*[get(url, sem, session, col)
for col in urls.columns #for each column in the dataframe
for url in urls[col] #for each row in the column
if url #if the item isn't null
if checkers.is_url(url)==True]) #if url is valid
print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))
amount = url_df.count(axis='columns').sum()
start = time.time()
asyncio.run(main(url_df))
end = time.time()
print("Took {} seconds to pull {} websites.".format(end - start, amount))
如果你能清理 urls 字符串,这样你就可以拆分它们,下面的效果会更好(比如循环执行 for link in s.split('|'):
或拆分 ;
等。这一切都取决于数据清理和争吵你可以事先做。
尝试:
import pandas as pd
from urllib.parse import urlparse
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
def getDomain(s):
if bool(urlparse(s).scheme):
return urlparse(s).netloc
for row in urls.itertuples():
print(row[1], getDomain(row[2]), getDomain(row[3]), getDomain(row[4]))
print('\n')
1 None www.bing.com www.whosebug.com~|~http:
2 None www.linkedin.com www.imdb.com
5 None None www.google.co.uk
25 www.youtube.com None None
26 None www.reddit.com None
补充:
要清理 URL 列,您可能需要这样的东西:
import pandas as pd
from urllib.parse import urlparse
import re
import numpy as np
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
startURL = re.compile(r'(?:www|http://www)')
sURL = re.compile(r'((?:www\.|https?://www\.|https?://).*?\.(?:$|[A-Za-z0-9.]+))')
def extractDomain(s):
pos = []
l = []
for m in startURL.finditer(s):
pos.append(m.start())
for p in pos:
n = urlparse(sURL.search(s[p:]).group(1)).netloc
if n == '':
l.append(urlparse(sURL.search(s[p:]).group(1)).path)
else:
l.append(urlparse(sURL.search(s[p:]).group(1)).netloc)
if len(l) > 0:
#return ', '.join(l) #return a string
return l # return a list
filter_col = [col for col in urls if col.startswith('link')]
df_links = urls[filter_col].copy()
df_links.columns = [str(col) + '_clean' for col in df_links.columns]
df_links = df_links.applymap(extractDomain)
urls = urls.join(df_links)
# example of link3 column change
print(urls[['ID', 'link3', 'link3_clean']])
ID link3 link3_clean
0 1 http://www.whosebug.com~|~http://www.ebay... [www.whosebug.com, www.ebay.com]
1 2 http://www.imdb.com [www.imdb.com]
2 5 http://www.google.co.uk [www.google.co.uk]
3 25 more random text that could be really long and... None
4 26 over the hills and through the woods None
与其遍历所有列,不如遍历行听起来更有意义 (?)。
您可以通过多种方式解决此问题:
对于初学者来说,通过 ID 列索引您的数据框可能是有意义的:
url_df = url_df.set_index('ID')
然后,除其他可能性外,您可以使用 itertuples()
方法:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link in row[1:]:
print(ID, link)
输出:
1 apple
1 http://www.bing.com
1 http://www.whosebug.com~|~http://www.ebay.com
2 www.google.com
2 http://www.linkedin.com
2 http://www.imdb.com
5 gm@yahoo.com
5
5 http://www.google.co.uk
25 http://www.youtube.com
25 please call now
25 more random text that could be really long and annoying
26 888-555-5556 Ryan Parkes rp@abc.io
26 http://www.reddit.com
26 over the hills and through the woods
如果您还想包含列名,您可以这样做:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link, col in zip(row[1:], row._fields[1:]):
print(ID, link, col)
为了在您的实际代码中使用,如果您将其包装在一个子例程中可能会更清楚:
def iter_links(df):
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for url, col in zip(row[1:], row._fields[1:]):
if url and checkers.is_url(url):
yield (ID, col, url)
然后在您的代码中使用它,例如:
await asyncio.gather(*(get(sess, sem, ID, col, url)
for ID, col, url in iter_links(df)))
我的问题的基础是在遍历数据框时传递“额外”信息。我将数据框中的每个值传递给要检查的函数。但我还需要传递 return 一些额外的识别信息。 (身份证号码。)所以我需要一种方法,允许我将这个额外信息传递给下一个函数,以便我可以 return 它与该函数的结果。
基本问题:
import pandas as pd
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
for col in urls.columns:
for url in urls[col]:
if url:
print(url,col)
#I need to be able to print the corresponding ID that belongs to each URL
期望的输出:
ID URL COL
1 apple link1
etc...
我认为如果可以用for/in/if结构来完成,那么它可以应用到下面的真实代码中:
实际代码稍微复杂一些。我正在使用 asyncio.gather 来处理数据帧。传递列名很简单,但我不知道如何获取 ID。
import asyncio, aiohttp, time, pandas as pd
from validator_collection import checkers
url_df = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
async def get(url, sem, session,col):
try:
async with sem, session.get(url=url,raise_for_status=True, timeout=20) as response:
resp = await response.read()
print("Successfully got url {} from column {} with response of length {}.".format(url, col, len(resp)))
except Exception as e:
print("Unable to get url {} due to {}.".format(url, e.__class__))
async def main(urls):
sem = asyncio.BoundedSemaphore(50)
async with aiohttp.ClientSession() as session:
ret = await asyncio.gather(*[get(url, sem, session, col)
for col in urls.columns #for each column in the dataframe
for url in urls[col] #for each row in the column
if url #if the item isn't null
if checkers.is_url(url)==True]) #if url is valid
print("Finalized all. ret is a list of len {} outputs.".format(len(ret)))
amount = url_df.count(axis='columns').sum()
start = time.time()
asyncio.run(main(url_df))
end = time.time()
print("Took {} seconds to pull {} websites.".format(end - start, amount))
如果你能清理 urls 字符串,这样你就可以拆分它们,下面的效果会更好(比如循环执行 for link in s.split('|'):
或拆分 ;
等。这一切都取决于数据清理和争吵你可以事先做。
尝试:
import pandas as pd
from urllib.parse import urlparse
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
def getDomain(s):
if bool(urlparse(s).scheme):
return urlparse(s).netloc
for row in urls.itertuples():
print(row[1], getDomain(row[2]), getDomain(row[3]), getDomain(row[4]))
print('\n')
1 None www.bing.com www.whosebug.com~|~http:
2 None www.linkedin.com www.imdb.com
5 None None www.google.co.uk
25 www.youtube.com None None
26 None www.reddit.com None
补充:
要清理 URL 列,您可能需要这样的东西:
import pandas as pd
from urllib.parse import urlparse
import re
import numpy as np
urls = pd.DataFrame({
'ID':[1,2,5,25,26],
'link1':['apple', 'www.google.com', 'gm@yahoo.com', 'http://www.youtube.com', '888-555-5556 Ryan Parkes rp@abc.io'],
'link2':['http://www.bing.com','http://www.linkedin.com','',' please call now','http://www.reddit.com' ],
'link3':['http://www.whosebug.com~|~http://www.ebay.com', 'http://www.imdb.com', 'http://www.google.co.uk','more random text that could be really long and annoying','over the hills and through the woods']
})
startURL = re.compile(r'(?:www|http://www)')
sURL = re.compile(r'((?:www\.|https?://www\.|https?://).*?\.(?:$|[A-Za-z0-9.]+))')
def extractDomain(s):
pos = []
l = []
for m in startURL.finditer(s):
pos.append(m.start())
for p in pos:
n = urlparse(sURL.search(s[p:]).group(1)).netloc
if n == '':
l.append(urlparse(sURL.search(s[p:]).group(1)).path)
else:
l.append(urlparse(sURL.search(s[p:]).group(1)).netloc)
if len(l) > 0:
#return ', '.join(l) #return a string
return l # return a list
filter_col = [col for col in urls if col.startswith('link')]
df_links = urls[filter_col].copy()
df_links.columns = [str(col) + '_clean' for col in df_links.columns]
df_links = df_links.applymap(extractDomain)
urls = urls.join(df_links)
# example of link3 column change
print(urls[['ID', 'link3', 'link3_clean']])
ID link3 link3_clean
0 1 http://www.whosebug.com~|~http://www.ebay... [www.whosebug.com, www.ebay.com]
1 2 http://www.imdb.com [www.imdb.com]
2 5 http://www.google.co.uk [www.google.co.uk]
3 25 more random text that could be really long and... None
4 26 over the hills and through the woods None
与其遍历所有列,不如遍历行听起来更有意义 (?)。
您可以通过多种方式解决此问题:
对于初学者来说,通过 ID 列索引您的数据框可能是有意义的:
url_df = url_df.set_index('ID')
然后,除其他可能性外,您可以使用 itertuples()
方法:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link in row[1:]:
print(ID, link)
输出:
1 apple
1 http://www.bing.com
1 http://www.whosebug.com~|~http://www.ebay.com
2 www.google.com
2 http://www.linkedin.com
2 http://www.imdb.com
5 gm@yahoo.com
5
5 http://www.google.co.uk
25 http://www.youtube.com
25 please call now
25 more random text that could be really long and annoying
26 888-555-5556 Ryan Parkes rp@abc.io
26 http://www.reddit.com
26 over the hills and through the woods
如果您还想包含列名,您可以这样做:
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for link, col in zip(row[1:], row._fields[1:]):
print(ID, link, col)
为了在您的实际代码中使用,如果您将其包装在一个子例程中可能会更清楚:
def iter_links(df):
for row in url_df.itertuples():
# The first item will always be the index, so:
ID = row[0] # or ID = row.Index
# Then do whatever you want for the other columns:
for url, col in zip(row[1:], row._fields[1:]):
if url and checkers.is_url(url):
yield (ID, col, url)
然后在您的代码中使用它,例如:
await asyncio.gather(*(get(sess, sem, ID, col, url)
for ID, col, url in iter_links(df)))