将字符串附加到 for 循环中的空 pandas 列
Append string to empty pandas column in a for loop
该代码使用 OCR 从列表 'url_list' 中的 URL 中读取文本。我正在尝试将字符串 'txt' 形式的输出附加到空 pandas 列 'url_text' 中。但是,代码不会向 'url_text' 列附加任何内容?当
df = pd.read_csv(r'path') # main dataframe
df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list
print(url_list)
['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg',
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg',
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg',
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg',
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg',
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
df['url_text'].append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
pass
print(df)
我无法测试,但请试试这个,因为您可能需要先创建列表,然后将其作为新列添加到 df(我将列表本身转换为数据框,然后连接到原始列表df)
txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
txtlst.append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
txtlst.append("")
pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt], axis=1)
print(df)
如 Series.append() 的文档中所述,追加调用仅在两个系列之间有效。
更好的做法是在循环外创建一个空列表,在循环本身内附加到该字符串列表,然后将该列表插入到 df["url_list"] = list_of_urls
中。这在运行时也比重复将两个系列附加在一起要快得多。
url_list = []
for ...:
...
url_list.append(url_text)
df["url_list"] = url_list
该代码使用 OCR 从列表 'url_list' 中的 URL 中读取文本。我正在尝试将字符串 'txt' 形式的输出附加到空 pandas 列 'url_text' 中。但是,代码不会向 'url_text' 列附加任何内容?当
df = pd.read_csv(r'path') # main dataframe
df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list
print(url_list)
['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg',
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg',
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg',
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg',
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg',
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
df['url_text'].append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
pass
print(df)
我无法测试,但请试试这个,因为您可能需要先创建列表,然后将其作为新列添加到 df(我将列表本身转换为数据框,然后连接到原始列表df)
txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
txtlst.append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
txtlst.append("")
pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt], axis=1)
print(df)
如 Series.append() 的文档中所述,追加调用仅在两个系列之间有效。
更好的做法是在循环外创建一个空列表,在循环本身内附加到该字符串列表,然后将该列表插入到 df["url_list"] = list_of_urls
中。这在运行时也比重复将两个系列附加在一起要快得多。
url_list = []
for ...:
...
url_list.append(url_text)
df["url_list"] = url_list