从 table 中提取行
Extracting rows from a table
我正在尝试从以下 table:
中提取具有相应单元格的行
<table border="0" cellspacing="1" cellpading="3" width="100%">
<tr bgcolor="#505050">
<td><b></b></td>
<td colspan="2" align="center" class="white"><b>Last Day</b></td>
<td colspan="2" align="center" class="white"><b>Last Week</b></td>
</tr>
<tr bgcolor="#505050">
<td class="white"><b>Race</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
</tr>
<tr bgcolor="#F1E0C6">
<td>A</td>
<td align="right">0</td>
<td align="right">3</td>
<td align="right">0</td>
<td align="right">13</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>B</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#F1E0C6">
<td>C</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>D</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#505050">
<td class=white><b>Total</b></td>
<td align="right" class="white"><b>210</b></td>
<td align="right" class="white"><b>1060458</b></td>
<td align="right" class="white"><b>1132</b></td>
<td align="right" class="white"><b>5585115</b></td>
</tr>
我感兴趣的行是 A、B、C 等旁边带有数字的行。
我想到的解决办法是:
table = tree.xpath("//table/tr[td[not(contains(@class, 'white'))]]")
for tr in table:
print( tr.xpath('td/text()'))
但是,输出仍然包括带有空单元格的第一行和最后 Day/Week 行,看起来像这样:
['', 'Last Day', 'Last Week']
['A', '0', '3', '0', '13']
['B', '0', '0', '2', '0']
['C', '0', '3', '0', '5']
怎样才能摆脱它?
只要把tr
改成:
tr[not(contains(@bgcolor, "505050"))]
所以你的代码应该是这样的:
from lxml import html
HTML = """<table border="0" cellspacing="1" cellpading="3" width="100%">
<tr bgcolor="#505050">
<td><b></b></td>
<td colspan="2" align="center" class="white"><b>Last Day</b></td>
<td colspan="2" align="center" class="white"><b>Last Week</b></td>
</tr>
<tr bgcolor="#505050">
<td class="white"><b>Race</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
</tr>
<tr bgcolor="#F1E0C6">
<td>A</td>
<td align="right">0</td>
<td align="right">3</td>
<td align="right">0</td>
<td align="right">13</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>B</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#F1E0C6">
<td>C</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>D</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#505050">
<td class=white><b>Total</b></td>
<td align="right" class="white"><b>210</b></td>
<td align="right" class="white"><b>1060458</b></td>
<td align="right" class="white"><b>1132</b></td>
<td align="right" class="white"><b>5585115</b></td>
</tr>"""
tree = html.fromstring(HTML)
results = defaultdict
for item in tree.xpath('//table/tr[not(contains(@bgcolor, "505050"))]'):
print item.xpath('.//td/text()')
并且输出:
['A', '0', '3', '0', '13']
['B', '0', '0', '0', '7']
['C', '0', '0', '0', '1']
['D', '0', '0', '0', '7']
不过,我还是建议使用 dict()
。参见:
tree = html.fromstring(HTML)
results = dict()
def unpack(data):
return data[0], data[1:]
for item in tree.xpath('//table/tr[not(contains(@bgcolor, "505050"))]'):
key, values = unpack(item.xpath('.//td/text()'))
results[key] = values
print results
输出:
{
'A': ['0', '3', '0', '13'],
'C': ['0', '0', '0', '1'],
'B': ['0', '0', '0', '7'],
'D': ['0', '0', '0', '7']
}
In Python 3, there is not need to have a unpack()
function like the above one, you would just need to change
key, values = unpack(item.xpath('.//td/text()'))
to key, *values = item.xpath('.//td/text()')
参见:https://www.python.org/dev/peps/pep-3132/
此外,如果需要,您可以使用 sorted()
:
按字母(键)对结果进行排序
[
('A', ['0', '3', '0', '13']),
('B', ['0', '0', '0', '7']),
('C', ['0', '0', '0', '1']),
('D', ['0', '0', '0', '7'])
]
我正在尝试从以下 table:
中提取具有相应单元格的行<table border="0" cellspacing="1" cellpading="3" width="100%">
<tr bgcolor="#505050">
<td><b></b></td>
<td colspan="2" align="center" class="white"><b>Last Day</b></td>
<td colspan="2" align="center" class="white"><b>Last Week</b></td>
</tr>
<tr bgcolor="#505050">
<td class="white"><b>Race</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
</tr>
<tr bgcolor="#F1E0C6">
<td>A</td>
<td align="right">0</td>
<td align="right">3</td>
<td align="right">0</td>
<td align="right">13</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>B</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#F1E0C6">
<td>C</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>D</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#505050">
<td class=white><b>Total</b></td>
<td align="right" class="white"><b>210</b></td>
<td align="right" class="white"><b>1060458</b></td>
<td align="right" class="white"><b>1132</b></td>
<td align="right" class="white"><b>5585115</b></td>
</tr>
我感兴趣的行是 A、B、C 等旁边带有数字的行。
我想到的解决办法是:
table = tree.xpath("//table/tr[td[not(contains(@class, 'white'))]]")
for tr in table:
print( tr.xpath('td/text()'))
但是,输出仍然包括带有空单元格的第一行和最后 Day/Week 行,看起来像这样:
['', 'Last Day', 'Last Week']
['A', '0', '3', '0', '13']
['B', '0', '0', '2', '0']
['C', '0', '3', '0', '5']
怎样才能摆脱它?
只要把tr
改成:
tr[not(contains(@bgcolor, "505050"))]
所以你的代码应该是这样的:
from lxml import html
HTML = """<table border="0" cellspacing="1" cellpading="3" width="100%">
<tr bgcolor="#505050">
<td><b></b></td>
<td colspan="2" align="center" class="white"><b>Last Day</b></td>
<td colspan="2" align="center" class="white"><b>Last Week</b></td>
</tr>
<tr bgcolor="#505050">
<td class="white"><b>Race</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
<td align="center" class="white"><b>Killed Players</b></td>
<td align="center" class="white"><b>Killed by Players</b></td>
</tr>
<tr bgcolor="#F1E0C6">
<td>A</td>
<td align="right">0</td>
<td align="right">3</td>
<td align="right">0</td>
<td align="right">13</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>B</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#F1E0C6">
<td>C</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr bgcolor="#D4C0A1">
<td>D</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr bgcolor="#505050">
<td class=white><b>Total</b></td>
<td align="right" class="white"><b>210</b></td>
<td align="right" class="white"><b>1060458</b></td>
<td align="right" class="white"><b>1132</b></td>
<td align="right" class="white"><b>5585115</b></td>
</tr>"""
tree = html.fromstring(HTML)
results = defaultdict
for item in tree.xpath('//table/tr[not(contains(@bgcolor, "505050"))]'):
print item.xpath('.//td/text()')
并且输出:
['A', '0', '3', '0', '13']
['B', '0', '0', '0', '7']
['C', '0', '0', '0', '1']
['D', '0', '0', '0', '7']
不过,我还是建议使用 dict()
。参见:
tree = html.fromstring(HTML)
results = dict()
def unpack(data):
return data[0], data[1:]
for item in tree.xpath('//table/tr[not(contains(@bgcolor, "505050"))]'):
key, values = unpack(item.xpath('.//td/text()'))
results[key] = values
print results
输出:
{
'A': ['0', '3', '0', '13'],
'C': ['0', '0', '0', '1'],
'B': ['0', '0', '0', '7'],
'D': ['0', '0', '0', '7']
}
In Python 3, there is not need to have a
unpack()
function like the above one, you would just need to changekey, values = unpack(item.xpath('.//td/text()'))
tokey, *values = item.xpath('.//td/text()')
参见:https://www.python.org/dev/peps/pep-3132/
此外,如果需要,您可以使用 sorted()
:
[
('A', ['0', '3', '0', '13']),
('B', ['0', '0', '0', '7']),
('C', ['0', '0', '0', '1']),
('D', ['0', '0', '0', '7'])
]