Python:如何使用 LXML/Requests 遍历 HTML 元素对象?
Python: How do I iterate through an HTML Element Object with LXML/Requests?
我正在尝试使用 LXML 和请求从网站创建数据 table。我需要标签内的文本和标签内包含的文本。这是 HTML:
<div class="houses">
<input type="hidden" class="houseNumber" value="107">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
<div class="houseCity">Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
<input type="hidden" class="houseNumber" value="237">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
<div class="houseCity">East Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
<input type="hidden" class="houseNumber" value="104">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Vacant">
<div class="houseInfo">
<div class="houseCity">Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
据此,我想创建一个这样的 table:
['107', 'MT', 'Occupied', 'Helena', 'Helena Valley']
['237', 'MT', 'Occupied', 'East Helena', 'Helena Valley']
['104', 'MT', 'Vacant', 'Helena', 'Helena Valley']
使用 Requests 和 LXML,我试图遍历 div class="houses"
以获得我需要的东西,但每次我尝试打印值时,它都会打印:
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
这是我的部分代码:
link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False)
sourceCode = response.content
htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
for house in houses:
houseNumber = house.xpath('//input[@class="houseNumber"]/@value')
houseState = house.xpath('//input[@class="houseState"]/@value')
houseStatus = house.xpath('//input[@class="houseStatus"]/@value')
如何在 table 中捕获数据,如上图所示?我可以用不同的方式遍历 houses 对象吗?
更新:@efirvida 我已将代码修改为以下内容:
link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False)
sourceCode = response.content
htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
houseNumber = []
houseState = []
houseStatus = []
for house in houses:
houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
print(houseNumber)
houseState.append(house.xpath('//input[@class="houseState"]/@value'))
houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))
data = map(list, zip(*[houseNumber,houseState,houseStatus]))
当我这样做时,打印出以下内容:
[['107', '237', '104']]
[['107', '237', '104'], ['107', '237', '104']]
[['107', '237', '104']], ['107', '237', '104'], ['107', '237', '104']]
尝试转置结果,请参阅 this thread 以了解我的代码。
# create a list with elements
houseNumber = []
houseState = []
houseStatus = []
# append each element to it's list
for house in houses:
houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
houseState.append(house.xpath('//input[@class="houseState"]/@value'))
houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))
# transpose the lists, and turn into a list of list
data = map(list, zip(*[houseNumber,houseState,houseStatus]))
>>> list(data)
#[['107', 'MT', 'Occupied'], ['237', 'MT', 'Occupied'], ['104', 'MT', 'Vacant']]
如果您可以将其用作元组,只需删除映射
#just transpose
data = zip(*[houseNumber,houseState,houseStatus])
>>> list(data)
#[('107', 'MT', 'Occupied'), ('237', 'MT', 'Occupied'), ('104', 'MT', 'Vacant') ]
我正在尝试使用 LXML 和请求从网站创建数据 table。我需要标签内的文本和标签内包含的文本。这是 HTML:
<div class="houses">
<input type="hidden" class="houseNumber" value="107">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
<div class="houseCity">Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
<input type="hidden" class="houseNumber" value="237">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
<div class="houseCity">East Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
<input type="hidden" class="houseNumber" value="104">
<input type="hidden" class="houseState" value="MT">
<input type="hidden" class="houseStatus" value="Vacant">
<div class="houseInfo">
<div class="houseCity">Helena</div>
<div class="houseArea">Helena Valley</div>
</div>
</div>
据此,我想创建一个这样的 table:
['107', 'MT', 'Occupied', 'Helena', 'Helena Valley']
['237', 'MT', 'Occupied', 'East Helena', 'Helena Valley']
['104', 'MT', 'Vacant', 'Helena', 'Helena Valley']
使用 Requests 和 LXML,我试图遍历 div class="houses"
以获得我需要的东西,但每次我尝试打印值时,它都会打印:
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
这是我的部分代码:
link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False)
sourceCode = response.content
htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
for house in houses:
houseNumber = house.xpath('//input[@class="houseNumber"]/@value')
houseState = house.xpath('//input[@class="houseState"]/@value')
houseStatus = house.xpath('//input[@class="houseStatus"]/@value')
如何在 table 中捕获数据,如上图所示?我可以用不同的方式遍历 houses 对象吗?
更新:@efirvida 我已将代码修改为以下内容:
link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False)
sourceCode = response.content
htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
houseNumber = []
houseState = []
houseStatus = []
for house in houses:
houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
print(houseNumber)
houseState.append(house.xpath('//input[@class="houseState"]/@value'))
houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))
data = map(list, zip(*[houseNumber,houseState,houseStatus]))
当我这样做时,打印出以下内容:
[['107', '237', '104']]
[['107', '237', '104'], ['107', '237', '104']]
[['107', '237', '104']], ['107', '237', '104'], ['107', '237', '104']]
尝试转置结果,请参阅 this thread 以了解我的代码。
# create a list with elements
houseNumber = []
houseState = []
houseStatus = []
# append each element to it's list
for house in houses:
houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
houseState.append(house.xpath('//input[@class="houseState"]/@value'))
houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))
# transpose the lists, and turn into a list of list
data = map(list, zip(*[houseNumber,houseState,houseStatus]))
>>> list(data)
#[['107', 'MT', 'Occupied'], ['237', 'MT', 'Occupied'], ['104', 'MT', 'Vacant']]
如果您可以将其用作元组,只需删除映射
#just transpose
data = zip(*[houseNumber,houseState,houseStatus])
>>> list(data)
#[('107', 'MT', 'Occupied'), ('237', 'MT', 'Occupied'), ('104', 'MT', 'Vacant') ]