如果第 2、3、4 页都具有相同的 URL,我如何解析该网站的所有 400 个 francises?
How do I parse all the 400 francises from this website if pages 2, 3, 4 all have the same URL?
我正在对网站进行网络抓取 https://www.franchisetimes.com/top-400-2021/ 我需要对每个特许经营权内的数据进行网络抓取,我正在构建主体(还没有真正进行抓取)但是无法解析特许经营 #25 以外的任何内容,我不知道如何推进下一页。
提前感谢您的意见和建议。
所以我被困在这里:
from bs4 import BeautifulSoup as bs
import requests
DOMAIN = 'https://www.franchisetimes.com'
URL = 'https://www.franchisetimes.com/top-400-2021/'
FILETYPE = '.html'
def get_soup(URL):
return bs(requests.get(url).text, 'html.parser')
#get_soup(DOMAIN)
i=0
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
try:
if "top-400-2021" in file_link and not "block_id" in file_link and FILETYPE in file_link:
i += 1
print(file_link)
print(i)
except:
print("nonetype")
它使用JavaScript从
加载JSON数据
https://www.franchisetimes.com/search/?bl=1111254&o=0&l=25&f=json&altf=widget
(我在 Firefox
/Chrome
中使用 DevTools
找到它(选项卡:network
,过滤器:XHR
)
如果你使用 o=25
而不是 o=0
那么你会得到第二页的 JSON 数据,如果你使用 o=50
那么第三页等等
import requests
payload = {
'bl': '1111254',
'o': 0,
'l': 25,
'f': 'json',
'altf': 'widget',
}
url = 'https://www.franchisetimes.com/search/'
for offset in range(0, 400, 25):
print('\n--- offset:', offset, '---\n')
payload['o'] = offset
response = requests.get(url, params=payload)
data = response.json()
for item in data['assets']:
print(item['title'])
结果:
--- offset: 0 ---
1. McDonald’s
2. 7-Eleven
3. KFC
4. Ace Hardware
5. Burger King
6. Domino's
7. Circle K
8. Chick-fil-A
9. Subway
10. Pizza Hut
11. Taco Bell
12. RE/MAX
13. Wendy’s
14. Keller Williams Realty
15. Dunkin’
16. Marriott Hotels & Resorts
17. Sonic Drive-In
18. Tim Hortons
19. Popeyes Louisiana Kitchen
20. Panera Bread
21. Dairy Queen
22. Little Caesars
23. Hampton by Hilton
24. Holiday Inn Express
25. Arby’s
--- offset: 25 ---
26. Papa John’s
27. Hyatt
28. Jack In The Box
29. Courtyard
30. Berkshire Hathaway HomeServices
31. Chili's
32. Hilton Hotels & Resorts
33. Buffalo Wild Wings
34. Applebee’s
35. Express Employment Professionals
36. The UPS Store
37. SERVPRO
38. Paris Baguette
39. Whataburger
40. Holiday Inn Hotels & Resorts
41. Outback Steakhouse
42. Residence Inn
43. H&R Block
44. Comfort Inn & Suites
45. Planet Fitness
46. Five Guys
47. IHOP
48. Home Instead Senior Care
49. Aaron’s
50. Baskin Robbins
--- offset: 50 ---
51. Renaissance
52. Zaxby’s
53. Hardee’s
54. G.J. Gardner Homes
55. Culver’s Butterburgers & Frozen Custard
56. Wingstop
57. Jimmy John’s
58. DoubleTree by Hilton
59. Denny’s
60. Jiffy Lube
61. Quality Inn & Suites
62. Snap-on Tools
63. Jersey Mike’s Subs
64. HomeVestors
65. Carl’s Jr.
66. Midas
67. Roto-Rooter
68. Anytime Fitness
69. Valvoline Instant Oil Change
70. ampm
71. Bojangles’ Famous Chicken 'n Biscuits
72. InterContinental Hotels & Resorts
73. Church’s Chicken
74. Crowne Plaza Hotels & Resorts
75. La Quinta Inn & Suites
--- offset: 75 ---
76. Pet Supplies Plus
77. Super 8
78. CARSTAR
79. Days Inn
80. Orangetheory Fitness
81. Interim HealthCare
82. Red Robin
83. Great Clips
84. Massage Envy
85. Big O Tires
86. Paul Davis Restoration
87. Home2 Suites by Hilton
88. Color Glo International
89. El Pollo Loco
90. Window World
91. Firehouse Subs
92. Checkers/Rally’s
93. American Family Care
94. Del Taco
95. Boston Pizza
96. Qdoba Mexican Eats
97. Linc Service
98. Papa Murphy's
99. Marco’s Pizza
100. Ramada
等等
如果你显示 data['assets'][0].keys()
那么你会看到你在数据中得到了什么
['title', 'uuid', 'published', 'type', 'url', 'canonical', 'byline', 'starttime', 'updated', 'last_updated', 'pretty_date', 'kicker', 'hammer', 'keywords', 'flags', 'comment_count', 'time_to_consume', 'preview', 'new_window', 'is_premium', 'icon', 'rank', 'name', 'sales', 'loc', 'sections', 'summary', 'authors']
例如:
item["rank"] "1"
item["name"] "McDonald’s"
item["sales"] ",317,000,000"
item["loc"] "39,198"
# iamge with logo
item["preview"]["url"] "https://bloximages.newyork1.vip.townnews.com/franchisetimes.com/content/tncms/assets/v3/editorial/8/ab/8ab10429-92c1-56e8-91fc-0ecf5f5504cd/5f7e8e875707b.image.jpg"
我正在对网站进行网络抓取 https://www.franchisetimes.com/top-400-2021/ 我需要对每个特许经营权内的数据进行网络抓取,我正在构建主体(还没有真正进行抓取)但是无法解析特许经营 #25 以外的任何内容,我不知道如何推进下一页。
提前感谢您的意见和建议。
所以我被困在这里:
from bs4 import BeautifulSoup as bs
import requests
DOMAIN = 'https://www.franchisetimes.com'
URL = 'https://www.franchisetimes.com/top-400-2021/'
FILETYPE = '.html'
def get_soup(URL):
return bs(requests.get(url).text, 'html.parser')
#get_soup(DOMAIN)
i=0
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
try:
if "top-400-2021" in file_link and not "block_id" in file_link and FILETYPE in file_link:
i += 1
print(file_link)
print(i)
except:
print("nonetype")
它使用JavaScript从
加载JSON数据https://www.franchisetimes.com/search/?bl=1111254&o=0&l=25&f=json&altf=widget
(我在 Firefox
/Chrome
中使用 DevTools
找到它(选项卡:network
,过滤器:XHR
)
如果你使用 o=25
而不是 o=0
那么你会得到第二页的 JSON 数据,如果你使用 o=50
那么第三页等等
import requests
payload = {
'bl': '1111254',
'o': 0,
'l': 25,
'f': 'json',
'altf': 'widget',
}
url = 'https://www.franchisetimes.com/search/'
for offset in range(0, 400, 25):
print('\n--- offset:', offset, '---\n')
payload['o'] = offset
response = requests.get(url, params=payload)
data = response.json()
for item in data['assets']:
print(item['title'])
结果:
--- offset: 0 ---
1. McDonald’s
2. 7-Eleven
3. KFC
4. Ace Hardware
5. Burger King
6. Domino's
7. Circle K
8. Chick-fil-A
9. Subway
10. Pizza Hut
11. Taco Bell
12. RE/MAX
13. Wendy’s
14. Keller Williams Realty
15. Dunkin’
16. Marriott Hotels & Resorts
17. Sonic Drive-In
18. Tim Hortons
19. Popeyes Louisiana Kitchen
20. Panera Bread
21. Dairy Queen
22. Little Caesars
23. Hampton by Hilton
24. Holiday Inn Express
25. Arby’s
--- offset: 25 ---
26. Papa John’s
27. Hyatt
28. Jack In The Box
29. Courtyard
30. Berkshire Hathaway HomeServices
31. Chili's
32. Hilton Hotels & Resorts
33. Buffalo Wild Wings
34. Applebee’s
35. Express Employment Professionals
36. The UPS Store
37. SERVPRO
38. Paris Baguette
39. Whataburger
40. Holiday Inn Hotels & Resorts
41. Outback Steakhouse
42. Residence Inn
43. H&R Block
44. Comfort Inn & Suites
45. Planet Fitness
46. Five Guys
47. IHOP
48. Home Instead Senior Care
49. Aaron’s
50. Baskin Robbins
--- offset: 50 ---
51. Renaissance
52. Zaxby’s
53. Hardee’s
54. G.J. Gardner Homes
55. Culver’s Butterburgers & Frozen Custard
56. Wingstop
57. Jimmy John’s
58. DoubleTree by Hilton
59. Denny’s
60. Jiffy Lube
61. Quality Inn & Suites
62. Snap-on Tools
63. Jersey Mike’s Subs
64. HomeVestors
65. Carl’s Jr.
66. Midas
67. Roto-Rooter
68. Anytime Fitness
69. Valvoline Instant Oil Change
70. ampm
71. Bojangles’ Famous Chicken 'n Biscuits
72. InterContinental Hotels & Resorts
73. Church’s Chicken
74. Crowne Plaza Hotels & Resorts
75. La Quinta Inn & Suites
--- offset: 75 ---
76. Pet Supplies Plus
77. Super 8
78. CARSTAR
79. Days Inn
80. Orangetheory Fitness
81. Interim HealthCare
82. Red Robin
83. Great Clips
84. Massage Envy
85. Big O Tires
86. Paul Davis Restoration
87. Home2 Suites by Hilton
88. Color Glo International
89. El Pollo Loco
90. Window World
91. Firehouse Subs
92. Checkers/Rally’s
93. American Family Care
94. Del Taco
95. Boston Pizza
96. Qdoba Mexican Eats
97. Linc Service
98. Papa Murphy's
99. Marco’s Pizza
100. Ramada
等等
如果你显示 data['assets'][0].keys()
那么你会看到你在数据中得到了什么
['title', 'uuid', 'published', 'type', 'url', 'canonical', 'byline', 'starttime', 'updated', 'last_updated', 'pretty_date', 'kicker', 'hammer', 'keywords', 'flags', 'comment_count', 'time_to_consume', 'preview', 'new_window', 'is_premium', 'icon', 'rank', 'name', 'sales', 'loc', 'sections', 'summary', 'authors']
例如:
item["rank"] "1"
item["name"] "McDonald’s"
item["sales"] ",317,000,000"
item["loc"] "39,198"
# iamge with logo
item["preview"]["url"] "https://bloximages.newyork1.vip.townnews.com/franchisetimes.com/content/tncms/assets/v3/editorial/8/ab/8ab10429-92c1-56e8-91fc-0ecf5f5504cd/5f7e8e875707b.image.jpg"