如何在 Google Play 搜索中抓取所有 App Store 应用

How to scrape all App Store apps on a Google Play Search

我正在尝试使用 find_all(),但在查找特定信息的标签时似乎遇到了问题。

我很想构建一个包装器,这样我就可以从应用商店中提取数据,例如标题、发布者等(public HTML 信息)。

代码不正确,我知道。我能找到的最接近 div 标识符的是 "c4".

任何见解都有帮助。

# Imports
import requests
from bs4 import BeautifulSoup

# Data Defining
url = "https://play.google.com/store/search?q=weather%20app"

# Getting HTML

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
soup.get_text()

results = soup.find_all(id="c4")

我期待不同天气应用程序和信息的输出:

Weather App 1
Develop Company 1

Google Weather App
Develop Company 2

Bing Weather App
Bing Developers

我从 url

得到以下输出
from bs4 import BeautifulSoup
import requests

url='https://play.google.com/store/search?q=weather%20app'
req=requests.get(url)

soup = BeautifulSoup(req.content, 'html.parser')

cards= soup.find_all("div",class_="vU6FJ p63iDd")

for card in cards:
    app_name= card.find("div",class_="WsMG1c nnK0zc").text
    company = card.find("div",class_="KoLSrc").text
    print("Name: " + app_name)
    print("Company: " + company)

输出:

Name: Weather app
Company: Accurate Weather Forecast & Weather Radar Map  
Name: AccuWeather: Weather Radar
Company: AccuWeather
Name: Weather Forecast - Accurate Local Weather & Widget
Company: Weather Forecast & Widget & Radar
Name: 1Weather Forecasts & Radar
Company: OneLouder Apps
Name: MyRadar Weather Radar
Company: ACME AtronOmatic LLC
Name: Weather data & microclimate : Weather Underground
Company: Weather Underground
Name: Weather & Widget - Weawow
Company: weawow weather app
Name: Weather forecast
Company: smart-pro android apps
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: The Mobile Mind Shift: Engineer Your Business to Win in the Mobile Moment
Company: Julie Ask
Name: Together: The Healing Power of Human Connection in a Sometimes Lonely World
Company: Vivek H. Murthy
Name: The Meadow
Company: James Galvin
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: Chaos Theory
Company: Introbooks Team
Name: Survival Training: Killer Tips for Toughness and Secret Smart Survival Skills       
Company: Wesley Jones
Name: Kiasunomics 2: Economic Insights for Everyday Life
Company: Ang Swee Hoon
Name: Summary of We Are The Weather by Jonathan Safran Foer
Company: QuickRead
Name: Learn Swift by Building Applications: Explore Swift programming through iOS app development
Company: Emil Atanasov
Name: Weather Hazard Warning Application in Car-to-X Communication: Concepts, Implementations, and Evaluations
Company: Attila Jaeger
Name: Mobile App Development with Ionic, Revised Edition: Cross-Platform Apps with Ionic, 
Angular, and Cordova
Company: Chris Griffith
Name: Good Application Makes a Good Roof Better: A Simplified Guide: Installing Laminated 
Asphalt Shingles for Maximum Life & Weather Protection
Company: ARMA Asphalt Roofing Manufacturers Association
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: Space Physics and Aeronomy, Space Weather Effects and Applications
Company: Book 5
Name: How to Build Android Apps with Kotlin: A hands-on guide to developing, testing, and 
publishing your first apps with Android
Company: Alex Forrester
Name: Android 6 for Programmers: An App-Driven Approach, Edition 3
Company: Paul J. Deitel

注意 基于极其动态生成的标识符(例如 class 名称)工作只是部分可靠。

因此,该策略应该基于更多的常量标识符,例如 tags 及其结构,或者在某些情况下,ids:

for e in soup.select('a[href^="/store/apps/details?id"]:has(div[title])'):
    data.append({
        'title': e.select_one('div[title]').get('title'),
        'company':e.find_next('a').text,
        'url':'https://play.google.com'+e.get('href')
    })

例子

另外 注意 真正的应用程序搜索应该参考 https://play.google.com/store/search?q=weather&c=apps 并且要获得所有这些应用程序,您必须处理动态呈现/加载的内容和滚动 - 就是这样为什么这个例子基于 selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://play.google.com/store/search?q=weather&c=apps'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 10)

while True:
    last_height = driver.execute_script("return window.pageYOffset + window.innerHeight")
    e =  wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'a[href="https://policies.google.com/privacy"]')))[-1]
    driver.execute_script("arguments[0].scrollIntoView();",e)
    time.sleep(0.5)

    if last_height == driver.execute_script("return window.pageYOffset + window.innerHeight"):
        break
    else:
        last_height = driver.execute_script("return window.pageYOffset + window.innerHeight")

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('a[href^="/store/apps/details?id"]:has(div[title])'):
    data.append({
        'title': e.select_one('div[title]').get('title'),
        'company':e.find_next('a').text,
        'url':'https://play.google.com'+e.get('href')
    })

print(pd.DataFrame(data).to_csv('app.csv', index=False)

输出

title company url
Weather app Accurate Weather Forecast & Weather Radar Map https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel
The Weather Channel - Radar The Weather Channel https://play.google.com/store/apps/details?id=com.weather.Weather
AccuWeather: Weather Radar AccuWeather https://play.google.com/store/apps/details?id=com.accuweather.android
Weather by WeatherBug WeatherBug https://play.google.com/store/apps/details?id=com.aws.android
Weather Forecast - Accurate Local Weather & Widget Weather Forecast & Widget & Radar https://play.google.com/store/apps/details?id=com.accurate.weather.forecast.live
The Weather Channel Weather Group, LLC https://play.google.com/store/apps/details?id=com.weathergroup.twc
WeatherNation WeatherNation TV, Inc. https://play.google.com/store/apps/details?id=com.weathernationtv
1Weather Forecasts & Radar OneLouder Apps https://play.google.com/store/apps/details?id=com.handmark.expressweather
Weather data & microclimate : Weather Underground Weather Underground https://play.google.com/store/apps/details?id=com.wunderground.android.weather
Weather & Widget - Weawow weawow weather app https://play.google.com/store/apps/details?id=com.weawow
Weather forecast smart-pro android apps https://play.google.com/store/apps/details?id=com.graph.weather.forecast.channel

...

确保您使用的是 user-agent to act as a "real" user request as sometimes you can receive a different HTML with different elements and selectors and some sort of an error because of not passing user-agent to request headers

Check what's your user-agent 并尽可能更新它,因为如果 user-agent 是旧的,例如使用 Chrome 70 版本,网站可能会阻止请求。

此外,请查看 SelectorGadget Chrome 扩展程序,通过在浏览器中单击所需的元素来直观地抓取 CSS 选择器。


代码和full example in the online IDE:

from bs4 import BeautifulSoup
import requests, json, lxml, re

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "weather",  # search query
    "c": "apps"      # display list of apps
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://play.google.com/store/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

apps_data = []

for app in soup.select(".mpg5gc"):
    title = app.select_one(".nnK0zc").text
    company = app.select_one(".b8cIId.KoLSrc").text
    description = app.select_one(".b8cIId.f5NCO a").text
    app_link = f'https://play.google.com{app.select_one(".b8cIId.Q9MA7b a")["href"]}'
    developer_link = f'https://play.google.com{app.select_one(".b8cIId.KoLSrc a")["href"]}'
    app_id = app.select_one(".b8cIId a")["href"].split("id=")[1]
    developer_id = app.select_one(".b8cIId.KoLSrc a")["href"].split("id=")[1]
    
    try:
        # https://regex101.com/r/SZLPRp/1
        rating = re.search(r"\d{1}\.\d{1}", app.select_one(".pf5lIe div[role=img]")["aria-label"]).group(0)
    except:
        rating = None
    
    thumbnail = app.select_one(".yNWQ8e img")["data-src"]
    
    apps_data.append({
        "title": title,
        "description": description,
        "rating": float(rating) if rating else rating, # float if rating is not None else rating or None
        "app_link": app_link,
        "developer_link": developer_link,
        "app_id": app_id,
        "developer_id": developer_id,
        "thumbnail": thumbnail
    })        

print(json.dumps(apps_data, indent=2, ensure_ascii=False))

部分输出:

[
  {
    "title": "Weather app",
    "company": "Accurate Weather Forecast & Weather Radar Map",
    "description": "The weather channel, tiempo weather forecast, weather radar & weather map",
    "rating": 4.6,
    "app_link": "https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel",
    "developer_link": "https://play.google.com/store/apps/developer?id=Accurate+Weather+Forecast+%26+Weather+Radar+Map",
    "app_id": "com.weather.forecast.weatherchannel",
    "developer_id": "Accurate+Weather+Forecast+%26+Weather+Radar+Map",
    "thumbnail": "https://play-lh.googleusercontent.com/GdXjVGXQ90eVNpb1VoXWGT3pff2M9oe3yDdYGIsde7W9h3s2S6FDLfo1uO-gljBZ1QXO=s128-rw"
  },
  {
    "title": "The Weather Channel - Radar",
    "company": "The Weather Channel",
    "description": "Weather Forecast & Snow Radar: local rain tracker, weather maps & alerts",
    "rating": 4.6,
    "app_link": "https://play.google.com/store/apps/details?id=com.weather.Weather",
    "developer_link": "https://play.google.com/store/apps/dev?id=5938833519207566184",
    "app_id": "com.weather.Weather",
    "developer_id": "5938833519207566184",
    "thumbnail": "https://play-lh.googleusercontent.com/RV3DftXlA7WUV7w-BpE8zM0X7Y4RQd2vBvZVv6A01DEGb_eXFRjLmUhSqdbqrEl9klI=s128-rw"
  },
  {
    "title": "Weather - By Xiaomi",
    "company": "Xiaomi Inc.",
    "description": "Always with you, rain or shine. Get temperature, forecast, AQI for you city.",
    "rating": 4.4,
    "app_link": "https://play.google.com/store/apps/details?id=com.miui.weather2",
    "developer_link": "https://play.google.com/store/apps/dev?id=5113340212256272297",
    "app_id": "com.miui.weather2",
    "developer_id": "5113340212256272297",
    "thumbnail": "https://play-lh.googleusercontent.com/sAZ2AZ16r5ThHiYCTWg8x1UUNQOhsxexRaDrDZKDlUy1hoZlggen6QogpJmQk8BwmgI=s128-rw"
  }, ... other results
]

另一种解决方案是使用 SerpApi 中的 Google Play Store API。这是付费 API 和免费计划。

不同之处在于不需要从头开始创建解析器、维护它、找出如何提取数据、绕过 Google 或其他搜索引擎的块。

要集成的代码:

from serpapi import GoogleSearch
import json

params = {
    "api_key": "API KEY",      # your serpapi api key
    "engine": "google_play",   # search engine
    "hl": "en",                # language
    "store": "apps",           # apps search
    "gl": "us",                # country to search from. Different country displays different.
    "q": "weather"             # search query
}

search = GoogleSearch(params)  # where data extracts
results = search.get_dict()    # JSON -> Python dictionary

apps_data = []

for apps in results["organic_results"]:
    for app in apps["items"]:
        apps_data.append({
            "title": app.get("title"),
            "link": app.get("link"),
            "description": app.get("description"),
            "product_id": app.get("product_id"),
            "rating": app.get("rating"),
            "thumbnail": app.get("thumbnail"),
            })

print(json.dumps(apps_data, indent=2, ensure_ascii=False))

部分输出(contains other data you can see in the Playground):

[
  {
    "title": "Weather app",
    "link": "https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel",
    "description": "The weather channel, tiempo weather forecast, weather radar & weather map",
    "product_id": "com.weather.forecast.weatherchannel",
    "rating": 4.7,
    "thumbnail": "https://play-lh.googleusercontent.com/GdXjVGXQ90eVNpb1VoXWGT3pff2M9oe3yDdYGIsde7W9h3s2S6FDLfo1uO-gljBZ1QXO=s128-rw"
  },
  {
    "title": "The Weather Channel - Radar",
    "link": "https://play.google.com/store/apps/details?id=com.weather.Weather",
    "description": "Weather Forecast & Snow Radar: local rain tracker, weather maps & alerts",
    "product_id": "com.weather.Weather",
    "rating": 4.6,
    "thumbnail": "https://play-lh.googleusercontent.com/RV3DftXlA7WUV7w-BpE8zM0X7Y4RQd2vBvZVv6A01DEGb_eXFRjLmUhSqdbqrEl9klI=s128-rw"
  },
  {
    "title": "AccuWeather: Weather Radar",
    "link": "https://play.google.com/store/apps/details?id=com.accuweather.android",
    "description": "Your local weather forecast, storm tracker, radar maps & live weather news",
    "product_id": "com.accuweather.android",
    "rating": 4.0,
    "thumbnail": "https://play-lh.googleusercontent.com/EgDT3XrIaJbhZjINCWsiqjzonzqve7LgAbim8kHXWgg6fZnQebqIWjE6UcGahJ6yugU=s128-rw"
  },
  {
    "title": "Weather by WeatherBug",
    "link": "https://play.google.com/store/apps/details?id=com.aws.android",
    "description": "The Most Accurate Weather Forecast. Alerts, Radar, Maps & News from WeatherBug",
    "product_id": "com.aws.android",
    "rating": 4.7,
    "thumbnail": "https://play-lh.googleusercontent.com/_rZCkobaGZzXN3iquPr4u2KOe7C-ljnrSkBfw6sVL1kpUfq3sBl5MoRJEisBSnxaD-M=s128-rw"
  }, ... other results
]

我还有一个专门的 Scrape Google Play Search Apps in Python 博客 post,其中的 step-by-step 解释对这个答案来说太过分了。

Disclaimer, I work for SerpApi.