如果第 2、3、4 页都具有相同的 URL，我如何解析该网站的所有 400 个 francises？

Question

我正在对网站进行网络抓取 https://www.franchisetimes.com/top-400-2021/ 我需要对每个特许经营权内的数据进行网络抓取，我正在构建主体（还没有真正进行抓取）但是无法解析特许经营 #25 以外的任何内容，我不知道如何推进下一页。

提前感谢您的意见和建议。

所以我被困在这里：

from bs4 import BeautifulSoup as bs
import requests

DOMAIN = 'https://www.franchisetimes.com'
URL = 'https://www.franchisetimes.com/top-400-2021/'
FILETYPE = '.html'

def get_soup(URL):
    return bs(requests.get(url).text, 'html.parser')

#get_soup(DOMAIN)

i=0
for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')
    try:
        if "top-400-2021" in file_link and not "block_id" in file_link and FILETYPE in file_link:
            i += 1
            print(file_link)
            print(i)
        except:
            print("nonetype")

Answer 1

它使用JavaScript从

加载JSON数据

https://www.franchisetimes.com/search/?bl=1111254&o=0&l=25&f=json&altf=widget

（我在 Firefox/Chrome 中使用 DevTools 找到它（选项卡：network，过滤器：XHR）

如果你使用 o=25 而不是 o=0 那么你会得到第二页的 JSON 数据，如果你使用 o=50 那么第三页等等

import requests

payload = {
    'bl': '1111254',
    'o': 0,
    'l': 25,
    'f': 'json',
    'altf': 'widget',
}

url = 'https://www.franchisetimes.com/search/'

for offset in range(0, 400, 25):
    print('\n--- offset:', offset, '---\n')
    
    payload['o'] = offset
    response = requests.get(url, params=payload)
    data = response.json()
    for item in data['assets']:
        print(item['title'])

结果：

--- offset: 0 ---

1. McDonald’s
2. 7-Eleven
3. KFC
4. Ace Hardware
5. Burger King
6. Domino's
7. Circle K
8. Chick-fil-A
9. Subway
10. Pizza Hut
11. Taco Bell
12. RE/MAX
13. Wendy’s
14. Keller Williams Realty
15. Dunkin’
16. Marriott Hotels & Resorts
17. Sonic Drive-In
18. Tim Hortons
19. Popeyes Louisiana Kitchen
20. Panera Bread
21. Dairy Queen
22. Little Caesars
23. Hampton by Hilton
24. Holiday Inn Express
25. Arby’s

--- offset: 25 ---

26. Papa John’s
27. Hyatt
28. Jack In The Box
29. Courtyard
30. Berkshire Hathaway HomeServices
31. Chili's
32. Hilton Hotels & Resorts
33. Buffalo Wild Wings
34. Applebee’s
35. Express Employment Professionals
36. The UPS Store
37. SERVPRO
38. Paris Baguette
39. Whataburger
40. Holiday Inn Hotels & Resorts
41. Outback Steakhouse
42. Residence Inn
43. H&R Block
44. Comfort Inn & Suites
45. Planet Fitness
46. Five Guys
47. IHOP
48. Home Instead Senior Care
49. Aaron’s
50. Baskin Robbins

--- offset: 50 ---

51. Renaissance
52. Zaxby’s
53. Hardee’s
54. G.J. Gardner Homes
55. Culver’s Butterburgers & Frozen Custard
56. Wingstop
57. Jimmy John’s
58. DoubleTree by Hilton
59. Denny’s
60. Jiffy Lube
61. Quality Inn & Suites
62. Snap-on Tools
63. Jersey Mike’s Subs
64. HomeVestors
65. Carl’s Jr.
66. Midas
67. Roto-Rooter
68. Anytime Fitness
69. Valvoline Instant Oil Change
70. ampm
71. Bojangles’ Famous Chicken 'n Biscuits
72. InterContinental Hotels & Resorts
73. Church’s Chicken
74. Crowne Plaza Hotels & Resorts
75. La Quinta Inn & Suites

--- offset: 75 ---

76. Pet Supplies Plus
77. Super 8
78. CARSTAR
79. Days Inn
80. Orangetheory Fitness
81. Interim HealthCare
82. Red Robin
83. Great Clips
84. Massage Envy
85. Big O Tires
86. Paul Davis Restoration
87. Home2 Suites by Hilton
88. Color Glo International
89. El Pollo Loco
90. Window World
91. Firehouse Subs
92. Checkers/Rally’s
93. American Family Care
94. Del Taco
95. Boston Pizza
96. Qdoba Mexican Eats
97. Linc Service
98. Papa Murphy's
99. Marco’s Pizza
100. Ramada

等等

如果你显示 data['assets'][0].keys() 那么你会看到你在数据中得到了什么

['title', 'uuid', 'published', 'type', 'url', 'canonical', 'byline', 'starttime', 'updated', 'last_updated', 'pretty_date', 'kicker', 'hammer', 'keywords', 'flags', 'comment_count', 'time_to_consume', 'preview', 'new_window', 'is_premium', 'icon', 'rank', 'name', 'sales', 'loc', 'sections', 'summary', 'authors']

例如：

item["rank"]    "1"
item["name"]    "McDonald’s"
item["sales"]   ",317,000,000"
item["loc"]     "39,198"
# iamge with logo
item["preview"]["url"]  "https://bloximages.newyork1.vip.townnews.com/franchisetimes.com/content/tncms/assets/v3/editorial/8/ab/8ab10429-92c1-56e8-91fc-0ecf5f5504cd/5f7e8e875707b.image.jpg"

如果第 2、3、4 页都具有相同的 URL，我如何解析该网站的所有 400 个 francises？

How do I parse all the 400 francises from this website if pages 2, 3, 4 all have the same URL?

python

parsing

beautifulsoup

web-scraping