将字符串附加到 for 循环中的空 pandas 列

Append string to empty pandas column in a for loop

该代码使用 OCR 从列表 'url_list' 中的 URL 中读取文本。我正在尝试将字符串 'txt' 形式的输出附加到空 pandas 列 'url_text' 中。但是,代码不会向 'url_text' 列附加任何内容?当

df = pd.read_csv(r'path') # main dataframe

df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list 

print(url_list)

['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg', 
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg', 
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg', 
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg', 
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg', 
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text

        df['url_text'].append(txt)

        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        pass

print(df)

我无法测试,但请试试这个,因为您可能需要先创建列表,然后将其作为新列添加到 df(我将列表本身转换为数据框,然后连接到原始列表df)

txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text
        txtlst.append(txt)


        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        txtlst.append("")
        pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt], axis=1)
print(df)

Series.append() 的文档中所述,追加调用仅在两个系列之间有效。

更好的做法是在循环外创建一个空列表,在循环本身内附加到该字符串列表,然后将该列表插入到 df["url_list"] = list_of_urls 中。这在运行时也比重复将两个系列附加在一起要快得多。

url_list = []

for ...:
    ...
    url_list.append(url_text)

df["url_list"] = url_list