从多个文本文件中提取 URL 的循环
A loop to extract URLS from several text files
我正在尝试使用 for 循环从多个文件中提取 URLS 列表,但这只会从第一个文件中提取 URLS 列表,重复 10 次。我不确定我做错了什么。另外,我是这方面的绝对初学者,所以我会假设有更好的方法来尝试实现我想要的,但这就是我目前所拥有的。
type_urls = []
y = 0
for files in cwk_dir:
while y < 10:
open('./cwkfiles/cwkfile{}.crawler.idx'.format(y))
lines = r.text.splitlines()
header_loc = 7
name_loc = lines[header_loc].find('Company Name')
type_loc = lines[header_loc].find('Form Type')
cik_loc = lines[header_loc].find('CIK')
filedate_loc = lines[header_loc].find('Date Filed')
url_loc = lines[header_loc].find('URL')
firstdata_loc = 9
for line in lines[firstdata_loc:]:
company_name = line[:type_loc].strip()
form_type = line[type_loc:cik_loc].strip()
cik = line[cik_loc:filedate_loc].strip()
file_date = line[filedate_loc:url_loc].strip()
page_url = line[url_loc:].strip()
typeandurl = (form_type, page_url)
type_urls.append(typeandurl)
y = y + 1
当您到达第二个文件时,while 条件失败,因为 y
已经是 10。
尝试在 while 循环之前将 y
设置回 0:
for files in cwk_dir:
y = 0
while y < 10:
...
当您在 while 循环内的第一行打开文件时,您可能需要在退出循环时关闭它。
这里有一个更 Pythonic 的方法,使用 pathlib
和 Python 3:
from pathlib import Path
cwk_dir = Path('./cwkfiles')
type_urls = []
header_loc = 7
firstdata_loc = 9
for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
with cwkfile.open() as f:
lines = f.readlines()
name_loc = lines[header_loc].find('Company Name')
type_loc = lines[header_loc].find('Form Type')
cik_loc = lines[header_loc].find('CIK')
filedate_loc = lines[header_loc].find('Date Filed')
url_loc = lines[header_loc].find('URL')
for line in lines[firstdata_loc:]:
company_name = line[:type_loc].strip()
form_type = line[type_loc:cik_loc].strip()
cik = line[cik_loc:filedate_loc].strip()
file_date = line[filedate_loc:url_loc].strip()
page_url = line[url_loc:].strip()
type_urls.append((form_type, page_url))
如果要测试一小批文件,请将 cwk_dir.glob('cwkfile*.crawler.idx')
替换为 cwk_dir.glob('cwkfile[0-9].crawler.idx')
。如果它们是从 0 开始按顺序编号的,那将为您提供第一个然后是文件。
这里有更好的方式将它们放在一起并且更易读:
from pathlib import Path
def get_offsets(header):
return dict(
company_name = header.find('Company Name'),
form_type = header.find('Form Type'),
cik = header.find('CIK'),
file_date = header.find('Date Filed'),
page_url = header.find('URL')
)
def get_data(line, offsets):
return dict(
company_name = line[:offsets['form_type']].strip(),
form_type = line[offsets['form_type']:offsets['cik']].strip(),
cik = line[offsets['cik']:offsets['file_date']].strip(),
file_date = line[offsets['file_date']:offsets['page_url']].strip(),
page_url = line[offsets['page_url']:].strip()
)
cwk_dir = Path('./cwkfiles')
types_and_urls = []
header_line = 7
first_data_line = 9
for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
with cwkfile.open() as f:
lines = f.readlines()
offsets = get_offsets(lines[header_line])
for line in lines[first_data_line:]:
data = get_data(line, offsets)
types_and_urls.append((data['form_type'], data['page_url']))
我正在尝试使用 for 循环从多个文件中提取 URLS 列表,但这只会从第一个文件中提取 URLS 列表,重复 10 次。我不确定我做错了什么。另外,我是这方面的绝对初学者,所以我会假设有更好的方法来尝试实现我想要的,但这就是我目前所拥有的。
type_urls = []
y = 0
for files in cwk_dir:
while y < 10:
open('./cwkfiles/cwkfile{}.crawler.idx'.format(y))
lines = r.text.splitlines()
header_loc = 7
name_loc = lines[header_loc].find('Company Name')
type_loc = lines[header_loc].find('Form Type')
cik_loc = lines[header_loc].find('CIK')
filedate_loc = lines[header_loc].find('Date Filed')
url_loc = lines[header_loc].find('URL')
firstdata_loc = 9
for line in lines[firstdata_loc:]:
company_name = line[:type_loc].strip()
form_type = line[type_loc:cik_loc].strip()
cik = line[cik_loc:filedate_loc].strip()
file_date = line[filedate_loc:url_loc].strip()
page_url = line[url_loc:].strip()
typeandurl = (form_type, page_url)
type_urls.append(typeandurl)
y = y + 1
当您到达第二个文件时,while 条件失败,因为 y
已经是 10。
尝试在 while 循环之前将 y
设置回 0:
for files in cwk_dir:
y = 0
while y < 10:
...
当您在 while 循环内的第一行打开文件时,您可能需要在退出循环时关闭它。
这里有一个更 Pythonic 的方法,使用 pathlib
和 Python 3:
from pathlib import Path
cwk_dir = Path('./cwkfiles')
type_urls = []
header_loc = 7
firstdata_loc = 9
for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
with cwkfile.open() as f:
lines = f.readlines()
name_loc = lines[header_loc].find('Company Name')
type_loc = lines[header_loc].find('Form Type')
cik_loc = lines[header_loc].find('CIK')
filedate_loc = lines[header_loc].find('Date Filed')
url_loc = lines[header_loc].find('URL')
for line in lines[firstdata_loc:]:
company_name = line[:type_loc].strip()
form_type = line[type_loc:cik_loc].strip()
cik = line[cik_loc:filedate_loc].strip()
file_date = line[filedate_loc:url_loc].strip()
page_url = line[url_loc:].strip()
type_urls.append((form_type, page_url))
如果要测试一小批文件,请将 cwk_dir.glob('cwkfile*.crawler.idx')
替换为 cwk_dir.glob('cwkfile[0-9].crawler.idx')
。如果它们是从 0 开始按顺序编号的,那将为您提供第一个然后是文件。
这里有更好的方式将它们放在一起并且更易读:
from pathlib import Path
def get_offsets(header):
return dict(
company_name = header.find('Company Name'),
form_type = header.find('Form Type'),
cik = header.find('CIK'),
file_date = header.find('Date Filed'),
page_url = header.find('URL')
)
def get_data(line, offsets):
return dict(
company_name = line[:offsets['form_type']].strip(),
form_type = line[offsets['form_type']:offsets['cik']].strip(),
cik = line[offsets['cik']:offsets['file_date']].strip(),
file_date = line[offsets['file_date']:offsets['page_url']].strip(),
page_url = line[offsets['page_url']:].strip()
)
cwk_dir = Path('./cwkfiles')
types_and_urls = []
header_line = 7
first_data_line = 9
for cwkfile in cwk_dir.glob('cwkfile*.crawler.idx'):
with cwkfile.open() as f:
lines = f.readlines()
offsets = get_offsets(lines[header_line])
for line in lines[first_data_line:]:
data = get_data(line, offsets)
types_and_urls.append((data['form_type'], data['page_url']))