python 中的网络抓取 craigslist 公寓价格未显示成本最高的公寓
web scraping craigslist appartment prices in python not showing highest cost appartment
它显示公寓的最高价格是 4700 美元,而我看到的最高价格超过一百万。为什么不显示?我做错了什么?
import requests
import re
r = requests.get("http://orlando.craigslist.org/search/apa")
r.raise_for_status()
html = r.text
matches = re.findall(r'<span class="price">$(\d+)</span>', html)
prices = map(int, matches)
print "Highest price: ${}".format(max(prices))
print "Lowest price: ${}".format(min(prices))
print "Average price: ${}".format(sum(prices)/len(prices))
使用 html 解析器 bs4 非常易于使用,您可以通过将 ?sort=pricedsc
添加到 url 按价格排序,因此第一个匹配项将是max 和 last 将是最后一个最低的(对于该页面):
r = requests.get("http://orlando.craigslist.org/search/apa?sort=pricedsc")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
print "Highest price: ${}".format(prices[0])
print "Lowest price: ${}".format(prices[-1])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))
如果您想要最低价格,您需要按升序排列:
r = requests.get("http://orlando.craigslist.org/search/apa?sort=priceasc")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
print "Highest price: ${}".format(prices[-1])
print "Lowest price: ${}".format(prices[0])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))
现在的输出非常不同:
Highest price:
Lowest price:
Average price: .89
如果你想要所有的平均值,你需要添加更多的逻辑。默认情况下,您只能看到 100 of 2500
个结果,但我们可以更改它。
r = requests.get("http://orlando.craigslist.org/search/apa")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
# link to next 100 results
nxt = soup.select_one("a.button.next")["href"]
# keep looping until we find a page with no next button
while nxt:
url = "http://orlando.craigslist.org{}".format(nxt)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# extend prices to our list
prices.extend([int(pr.text.strip("$")) for pr in soup.select("span.price")])
nxt = soup.select_one("a.button.next")
if nxt:
nxt = nxt["href"]
这将为您提供 1-2500
中的所有列表
它显示公寓的最高价格是 4700 美元,而我看到的最高价格超过一百万。为什么不显示?我做错了什么?
import requests
import re
r = requests.get("http://orlando.craigslist.org/search/apa")
r.raise_for_status()
html = r.text
matches = re.findall(r'<span class="price">$(\d+)</span>', html)
prices = map(int, matches)
print "Highest price: ${}".format(max(prices))
print "Lowest price: ${}".format(min(prices))
print "Average price: ${}".format(sum(prices)/len(prices))
使用 html 解析器 bs4 非常易于使用,您可以通过将 ?sort=pricedsc
添加到 url 按价格排序,因此第一个匹配项将是max 和 last 将是最后一个最低的(对于该页面):
r = requests.get("http://orlando.craigslist.org/search/apa?sort=pricedsc")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
print "Highest price: ${}".format(prices[0])
print "Lowest price: ${}".format(prices[-1])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))
如果您想要最低价格,您需要按升序排列:
r = requests.get("http://orlando.craigslist.org/search/apa?sort=priceasc")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
print "Highest price: ${}".format(prices[-1])
print "Lowest price: ${}".format(prices[0])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))
现在的输出非常不同:
Highest price:
Lowest price:
Average price: .89
如果你想要所有的平均值,你需要添加更多的逻辑。默认情况下,您只能看到 100 of 2500
个结果,但我们可以更改它。
r = requests.get("http://orlando.craigslist.org/search/apa")
from bs4 import BeautifulSoup
html = r.content
soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
# link to next 100 results
nxt = soup.select_one("a.button.next")["href"]
# keep looping until we find a page with no next button
while nxt:
url = "http://orlando.craigslist.org{}".format(nxt)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# extend prices to our list
prices.extend([int(pr.text.strip("$")) for pr in soup.select("span.price")])
nxt = soup.select_one("a.button.next")
if nxt:
nxt = nxt["href"]
这将为您提供 1-2500