python 的网络抓取 remax.com
Web scraping remax.com for python
这与我提出的问题类似 。得到了完美的回答。既然我有一些事情要做,我现在想做的是而不是手动输入 url 来获取数据。我想开发一个函数,只接收地址、邮政编码和 return 我想要的数据。
现在的问题是修改 url 以获得正确的 url。例如
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
我看到除了地址、州和邮政编码之外还有一个数字,即 gid100012499996,它似乎对每个地址都是唯一的。所以我不确定如何才能实现我想要的功能。
这是我的代码:
import urllib
from bs4 import BeautifulSoup
import pandas as pd
def get_data(url):
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(url, headers=hdr)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
foot = soup.find('span', class_="listing-detail-sqft-val")
print(foot.text.strip())
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
get_data(url)
我想要的是类似于上面的内容,但是 get_data() 将接受地址、州和邮政编码。如果这不是适合本网站的问题,我深表歉意。
该站点有一个 JSON API,可让您获取给定矩形中属性的所有详细信息。矩形由 NW 和 SE 角的纬度和经度坐标给出。以下请求显示可能的搜索:
import requests
params = {
"nwlat" : 41.841966864112, # Calculate from address
"nwlong" : -74.08774571289064, # Calculate from address
"selat" : 41.64189784194883, # Calculate from address
"selong" : -73.61430363525392, # Calculate from address
"Count" : 100,
"pagenumber" : 1,
"SiteID" : "68000000",
"pageCount" : "10",
"tab" : "map",
"sh" : "true",
"forcelatlong" : "true",
"maplistings" : "1",
"maplistcards" : "0",
"sv" : "true",
"sortorder" : "newest",
"view" : "forsale",
}
req_properties = requests.get("https://www.remax.com/api/listings", params=params)
matching_properties_json = req_properties.json()
for p in matching_properties_json[0]:
print(f"{p['Address']:<40} {p.get('BedRooms', 0)} beds | {int(p.get('BathRooms',0))} baths | {p['SqFt']} sqft")
这会产生 100 个响应(显然,更紧凑的矩形会减少结果)。例如:
3 Pond Ridge Road 2 beds | 3.0 baths | 2532 sqft
84 Hudson Avenue 3 beds | 1.0 baths | 1824 sqft
116 HUDSON POINTE DR 2 beds | 3.0 baths | 2455 sqft
6 Falcon Drive 4 beds | 3.0 baths | 1993 sqft
53 MAPLE 5 beds | 2.0 baths | 3511 sqft
4 WOODLAND CIR 3 beds | 2.0 baths | 1859 sqft
.
.
.
95 S HAMILTON ST 3 beds | 1.0 baths | 2576 sqft
40 S Manheim Boulevard 2 beds | 2.0 baths | 1470 sqft
如果您有一个地址,则需要计算该地址的纬度和经度。然后在它周围为 NW 和 SE 角创建一个小矩形。然后用这些数字构建一个 URL。然后,您将获得该地区所有房产(希望是 1 处)的列表。
要制作搜索方块,您可以使用类似的东西:
lat = 41.841966864112
long = -74.08774571289064
square_size = 0.001
params = {
"nwlat" : lat + square_size,
"nwlong" : long - square_size,
"selat" : lat - square_size,
"selong" : long + square_size,
"Count" : 100,
"pagenumber" : 1,
"SiteID" : "68000000",
"pageCount" : "10",
"tab" : "map",
"sh" : "true",
"forcelatlong" : "true",
"maplistings" : "1",
"maplistcards" : "0",
"sv" : "true",
"sortorder" : "newest",
"view" : "forsale",
}
square_size
需要根据您地址的准确程度进行调整。
这与我提出的问题类似
现在的问题是修改 url 以获得正确的 url。例如
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
我看到除了地址、州和邮政编码之外还有一个数字,即 gid100012499996,它似乎对每个地址都是唯一的。所以我不确定如何才能实现我想要的功能。
这是我的代码:
import urllib
from bs4 import BeautifulSoup
import pandas as pd
def get_data(url):
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(url, headers=hdr)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
foot = soup.find('span', class_="listing-detail-sqft-val")
print(foot.text.strip())
url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
get_data(url)
我想要的是类似于上面的内容,但是 get_data() 将接受地址、州和邮政编码。如果这不是适合本网站的问题,我深表歉意。
该站点有一个 JSON API,可让您获取给定矩形中属性的所有详细信息。矩形由 NW 和 SE 角的纬度和经度坐标给出。以下请求显示可能的搜索:
import requests
params = {
"nwlat" : 41.841966864112, # Calculate from address
"nwlong" : -74.08774571289064, # Calculate from address
"selat" : 41.64189784194883, # Calculate from address
"selong" : -73.61430363525392, # Calculate from address
"Count" : 100,
"pagenumber" : 1,
"SiteID" : "68000000",
"pageCount" : "10",
"tab" : "map",
"sh" : "true",
"forcelatlong" : "true",
"maplistings" : "1",
"maplistcards" : "0",
"sv" : "true",
"sortorder" : "newest",
"view" : "forsale",
}
req_properties = requests.get("https://www.remax.com/api/listings", params=params)
matching_properties_json = req_properties.json()
for p in matching_properties_json[0]:
print(f"{p['Address']:<40} {p.get('BedRooms', 0)} beds | {int(p.get('BathRooms',0))} baths | {p['SqFt']} sqft")
这会产生 100 个响应(显然,更紧凑的矩形会减少结果)。例如:
3 Pond Ridge Road 2 beds | 3.0 baths | 2532 sqft
84 Hudson Avenue 3 beds | 1.0 baths | 1824 sqft
116 HUDSON POINTE DR 2 beds | 3.0 baths | 2455 sqft
6 Falcon Drive 4 beds | 3.0 baths | 1993 sqft
53 MAPLE 5 beds | 2.0 baths | 3511 sqft
4 WOODLAND CIR 3 beds | 2.0 baths | 1859 sqft
.
.
.
95 S HAMILTON ST 3 beds | 1.0 baths | 2576 sqft
40 S Manheim Boulevard 2 beds | 2.0 baths | 1470 sqft
如果您有一个地址,则需要计算该地址的纬度和经度。然后在它周围为 NW 和 SE 角创建一个小矩形。然后用这些数字构建一个 URL。然后,您将获得该地区所有房产(希望是 1 处)的列表。
要制作搜索方块,您可以使用类似的东西:
lat = 41.841966864112
long = -74.08774571289064
square_size = 0.001
params = {
"nwlat" : lat + square_size,
"nwlong" : long - square_size,
"selat" : lat - square_size,
"selong" : long + square_size,
"Count" : 100,
"pagenumber" : 1,
"SiteID" : "68000000",
"pageCount" : "10",
"tab" : "map",
"sh" : "true",
"forcelatlong" : "true",
"maplistings" : "1",
"maplistcards" : "0",
"sv" : "true",
"sortorder" : "newest",
"view" : "forsale",
}
square_size
需要根据您地址的准确程度进行调整。