无法从地图中的不同盒状容器中抓取不同的所有者名称

Unable to scrape different owner names from different box-like containers out of a map

我正在尝试使用 selenium 单击地图,以便我可以从类似盒子的容器中抓取 parcel idowner name。在该地图上单击时,会显示类似盒子的容器。我想从这样的容器中刮取 parcel idowner name。这就是 box-like container 的样子。我尝试使用 requests 但找不到任何方法来定位此类容器中可用的信息,所以我现在正在尝试使用 selenium。下面的脚本既没有点击该地图,也没有抛出任何错误。

website with map

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "http://app01.cityofboston.gov/parcelviewer/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 20)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "svg#mapDiv_gc"))):
    item.click()
driver.quit()

如何从该地图中的不同盒状容器中获取包裹 ID 和所有者姓名?

尝试使用ActionChainsclass中的.move_to_element_with_offset(to_element, xoffset, yoffset)方法点击特定x y位置的元素来解决。这将根据列表中指定的 x y 随机点击。

起始点x在左侧导航宽度后确定,即:

left_nav = driver.find_element_by_id('searchBox')
xstart = left_nav.size['width']

起点y是在顶部导航高度之后确定的,即:

top_nav = driver.find_element_by_id('headerFrame')
ystart = top_nav.size['height']

以下代码点击常量 y 位置:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#add following import
from selenium.webdriver import ActionChains

link = "http://app01.cityofboston.gov/parcelviewer/"

driver = webdriver.Chrome()
driver.get(link)
driver.maximize_window()
map_element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'svg#mapDiv_gc')))

left_nav = driver.find_element_by_id('searchBox')
xstart = left_nav.size['width']

top_nav = driver.find_element_by_id('headerFrame')
ystart = top_nav.size['height']

#random x y here
xlist_increment = [100, 200, 300, 400, 500, 600, 700, 800, 900]
ylist_increment = [300, 300, 300, 300, 300, 300, 300, 300, 300]

wait = WebDriverWait(driver, 1)
action = ActionChains(driver)

for x, y in zip(xlist_increment, ylist_increment):
    xoffset = xstart + x
    yoffset = ystart + y
    action.move_to_element_with_offset(map_element, xoffset, yoffset)
    action.click()
    action.perform()

    try:
        parcel_id = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='esriPopupWrapper']//b[contains(text(), 'Parcel ID')]//parent::div")))
        owner_name = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='esriPopupWrapper']//b[contains(text(), 'Owner')]//parent::div")))
        print(parcel_id.text)
        print(owner_name.text)
        driver.find_element_by_css_selector('div.close').click()
    except Exception:
        print("popup doesn't appear")

driver.quit()

因为随机点击x y位置,不保证每次点击都能弹出你说的parcel idowner name弹窗,但我至少明白了不止一次。

输出:

Parcel ID: 0302895000 Owner: SIXTY3-65 COURT ST LLC Land Use: C

弹出窗口没有出现

Parcel ID: 0302897000 Owner: SEARS CRESCENT BUILDING LLC Land Use: C

弹出窗口没有出现

Parcel ID: 0303694000 Owner: TWENTY-8 STATE STREET LLC Land Use: C

弹出窗口没有出现

Parcel ID: 0303685000 Owner: ANBECA 60 LLC Land Use: C

弹出窗口没有出现

Parcel ID: 0303746000 Owner: STATE ENTERPRISES LIMITED PA Land Use: C

这是来自 ArcGIS REST 服务的数据。

我找到了 returns 所需数据的 Argis 查询调用:

GET https://services.arcgis.com/sFnw0xNflSi8J0uh/arcgis/rest/services/Parcels19WMFull/FeatureServer/0/query

我检查了可能产生此 url 的原因并发现了以下内容:

当您在左上角的输入中搜索数据时调用此查询调用。您可以编辑 url 参数以匹配所有数据:

{
    "f": "json",
    "where": "1=1",
    "returnGeometry": "true",
    "spatialRel": "esriSpatialRelIntersects",
    "outFields": "*",
    "outSR": "102100"
}

它 returns 最多 2000 个项目,所以我们需要迭代。要知道如何迭代,我们可以检查 features 数组中的内容,检查 this query 它给出类似的内容:

{
  "attributes": {
    "FID": 1,
    "FULL_ADDRE": "104 A 104 PUTNAM ST, 02128",
    "PID": "0100001000"
  }
},
{
  "attributes": {
    "FID": 2,
    "FULL_ADDRE": "18 LEVERETT AV #10-B, 02128",
    "PID": "0101399120"
  }
},
{
  "attributes": {
    "FID": 3,
    "FULL_ADDRE": "197 LEXINGTON ST, 02128",
    "PID": "0100002000"
  }
}
....

所以我们可以使用 where=FID > 2000 迭代 FID 字段,对于下一次迭代,我们可以只存储我们获得的最后一个 FID 并使用 FID > {last_fid}[= 编辑 where 子句31=]

所以你可以像这样构建一个 脚本 :

import requests

base_url = "http://app01.cityofboston.gov/parcelviewer"

# get map id
r = requests.get(f"{base_url}/config/ParcelViewer.json")
map_id = r.json()["values"]["webmap"]

# get the query url
r = requests.get(f"https://www.arcgis.com/sharing/rest/content/items/{map_id}/data", params = {
    "f": "json"
})
url = r.json()["operationalLayers"][0]["url"]

params = {
    "f": "json",
    "where": "1=1",
    "returnGeometry": "true",
    "spatialRel": "esriSpatialRelIntersects",
    "outFields": "*",
    "outSR": "102100"
}

data = []
count = 1
finish = False

while finish == False:
    print(f"[{count}] requesting...")
    r = requests.get(f"{url}/query", params = params)
    entries = r.json()["features"]
    if len(entries) < 2000:
        finish = True
    else:
        last_fid = entries[-1]["attributes"]["FID"]
        print(f"next fid : {last_fid}")
        params["where"] = f"FID > {last_fid}"
    data.extend(entries)
    print(f"[{count}] received {len(entries)} items - total received : {len(data)}")
    count +=1

print(f"TOTAL: {len(data)}")

# print the last element (just to check)
print(data[-1])

几分钟后,脚本提取了 171922 条记录:


这是条目的样子:

{
    'attributes': {
        'FID': 171922,
        'PID_LONG': '2205670000',
        'PID': '2205670000',
        'GIS_ID': '2205670000',
        'FULL_ADDRE': '2203 COMMONWEALTH AV, 02135',
        'OWNER': 'COMMWLTH OF MASS',
        'LAND_USE': 'E',
        'LAND_SF': 34125,
        'LIVING_ARE': 7386,
        'AV_LAND': 1325400,
        'AV_BLDG': 841100,
        'AV_TOTAL': 2166500,
        'GROSS_TAX': 0,
        'ID': 0,
        'SHAPE_Leng': 1003.12908156,
        'SHAPE_Area': 33512.6220608,
        'Shape__Area': 5702.6640625,
        'Shape__Length': 414.046143349521
    },
    'geometry': {
        'rings': [
            [
                [-7922244.91043368, 5212145.61745703],
                [-7922247.98527419, 5212105.5446644],
                [-7922243.75007186, 5212106.29247827],
                [-7922235.83595224, 5212062.80771992],
                [-7922239.05526106, 5212062.68000813],
                [-7922327.54387782, 5212214.66112252],
                [-7922281.74795739, 5212208.62518937],
                [-7922266.82960043, 5212207.97287607],
                [-7922241.02937963, 5212204.61661323],
                [-7922244.0269726, 5212158.45234151],
                [-7922244.91043368, 5212145.61745703]
            ]
        ]
    }
}

最后一件事,只是为了直接在 API 上检查结果计数,我们可以使用 Arcgis 查询 UI 中的查询参数,例如 this one(这是地图顺便在网站上使用)。当仅按计数过滤时,它会添加字段 returnCountOnly=true,让我们在查询端点中执行此操作:

https://services.arcgis.com/sFnw0xNflSi8J0uh/arcgis/rest/services/Parcels19WMFull/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=FID%2CFULL_ADDRE%2CPID&outSR=102100&returnCountOnly=true

哪个 returns 正确:

{"count":171922}

请注意,您可以将此脚本的某些变体应用于任何 Arcgis Rest 服务查询类型。我在 this gist 上做了一个例子,从地图(城市)中获取数据。请注意,API 返回的最大结果可能因服务而异