解析 <div> HTML 内容
Parsing <div> HTML content with
我有以下监控 link 输出,我正在尝试将其解析为变量。
<html>
<head>
<style type="text/css"></style>
</head>
<body>
<div style="float:left;margin-right:50px">
<div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED:
<div><br><br> DataCenter: DC1 NY [ENABLED]
<div><br> Active Zone : BW Zone 1[1], VIP = 192.168.254.10</div>
<div><br> <a href=https://192.168.254.10/checkGlobalReplicationTier>https://192.168.254.10/checkGlobalReplicationTier</a>
[ACTIVE]</div>
<div> <a href=https://192.168.254.10/checkReplication>https://192.168.254.10/checkReplication</a></div>
<div><br> <a href=https://192.168.254.11/checkGlobalReplicationTier>https://192.168.254.11/checkGlobalReplicationTier</a>
[STANDBY]</div>
<div> <a href=https://192.168.254.11/checkReplication>https://192.168.254.11/checkReplication</a></div>
<div><br> Local Zones:</div>
<div> LC Zone 3[3], VIP = 192.168.254.13
<div> <a href=https://192.168.254.13/checkReplication>https://192.168.254.13/checkReplication</a>
[ACTIVE]</div>
<div><br><br> DataCenter: DC2 NJ [ENABLED]
[DEFAULT DC]</div>
<div><br> Active Portal Zone : BW Zone 2[2], VIP = 192.168.253.10</div>
<div><br> <a href=https://192.168.253.10/checkGlobalReplicationTier>https://192.168.253.10/checkGlobalReplicationTier</a>
[ACTIVE]</div>
<div> <a href=https://192.168.253.10/checkReplication>https://192.168.253.10/checkReplication</a></div>
<div><br> <a href=https://192.168.253.11/checkGlobalReplicationTier>https://192.168.253.11/checkGlobalReplicationTier</a>
[STANDBY]</div>
<div> <a href=https://192.168.253.11/checkReplication>https://192.168.253.11/checkReplication</a></div>
<div><br> Local Zones:</div>
<div> LC Zone 4[4], VIP = 192.168.253.13
<div> <a href=https://192.168.253.13/checkReplication>https://192.168.253.13/checkReplication</a>
[ACTIVE]</div>
<div> <a href=https://192.168.253.14/checkReplication>https://192.168.253.14/checkReplication</a>
[STANDBY]</div>
--> </div>
</div>
</body>
</html>
我想解析这个得到
Data Center Active Zone VIP Local Zone VIP
DC1 NY [Enabled] BW Zone 1[1] 192.168.254.10 LC Zone 3[3] 192.168.254.13
DC2 NJ [Enabled] [DEFAULT DC] BW Zone 2[2] 192.168.253.10 LC Zone 4[4] 192.168.253.13
代码似乎无法解析,正则表达式是解析此页面的最佳方式,还是我应该尝试其他方式。
$zone = "https://192.168.0.90/checkConfiguration"
$html = Invoke-WebRequest -Uri $zone -ErrorAction Stop
$DC = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div><br><br> DataCenter: *' }) | Foreach-Object {$_.outerText -replace '(?<!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}
为此你可以这样做:
$div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>*DataCenter:*' }
$DC = if ($div -and $div.outerText -match '(?s)DataCenter\s*:\s*(\w+).*Active Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d\.]+)') {
[PsCustomObject]@{
'DataCenter' = $matches[1]
'Active Zone' = $matches[2]
'VIP' = $matches[3]
}
}
$DC | Format-Table -AutoSize
输出:
DataCenter Active Zone VIP
---------- ----------- ---
DC1 BW Zone 192.168.0.95
或作为列表
$DC | Format-List
输出:
DataCenter : DC1
Active Zone : BW Zone
VIP : 192.168.0.95
当 html 文件中有多个数据中心时,这是一种不同的方法:
# use outerText to get the plain text for the surrounding <div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...</div>
$content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like '<div>DATA CENTERS*' }).outerText
$DC = $content -split 'DataCenter\s*:\s*' |
Where-Object { $_ -match '(?s)([\w ]+(?:[ [\w\]]*)).*Active (?:Portal )?Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d.]+)' } |
ForEach-Object {
[PsCustomObject]@{
'DataCenter' = $matches[1]
'Active Zone' = $matches[2]
'VIP' = $matches[3]
}
}
$DC | Format-Table -AutoSize
输出:
DataCenter Active Zone VIP
---------- ----------- ---
DC1 NY [ENABLED] BW Zone 1[1] 192.168.254.10
DC2 NJ [ENABLED] [DEFAULT DC] BW Zone 2[2] 192.168.253.10
正则表达式详细信息:
(?s) Match the remainder of the regex with the options: dot matches newline (s)
( Match the regular expression below and capture its match into backreference number 1
[\w ] Match a single character present in the list below
A word character (letters, digits, etc.)
The character “ ”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: Match the regular expression below
[ [\w\]] Match a single character present in the list below
One of the characters “ [”
A word character (letters, digits, etc.)
A ] character
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
)
. Match any single character
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Active\ Match the characters “Active ” literally
(?: Match the regular expression below
Portal\ Match the characters “Portal ” literally
)? Between zero and one times, as many times as possible, giving back as needed (greedy)
Zone Match the characters “Zone” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
: Match the character “:” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( Match the regular expression below and capture its match into backreference number 2
[^,] Match any character that is NOT a “,”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
, Match the character “,” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
VIP Match the characters “VIP” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
= Match the character “=” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( Match the regular expression below and capture its match into backreference number 3
[\d.] Match a single character present in the list below
A single digit 0..9
The character “.”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
我有以下监控 link 输出,我正在尝试将其解析为变量。
<html>
<head>
<style type="text/css"></style>
</head>
<body>
<div style="float:left;margin-right:50px">
<div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED:
<div><br><br> DataCenter: DC1 NY [ENABLED]
<div><br> Active Zone : BW Zone 1[1], VIP = 192.168.254.10</div>
<div><br> <a href=https://192.168.254.10/checkGlobalReplicationTier>https://192.168.254.10/checkGlobalReplicationTier</a>
[ACTIVE]</div>
<div> <a href=https://192.168.254.10/checkReplication>https://192.168.254.10/checkReplication</a></div>
<div><br> <a href=https://192.168.254.11/checkGlobalReplicationTier>https://192.168.254.11/checkGlobalReplicationTier</a>
[STANDBY]</div>
<div> <a href=https://192.168.254.11/checkReplication>https://192.168.254.11/checkReplication</a></div>
<div><br> Local Zones:</div>
<div> LC Zone 3[3], VIP = 192.168.254.13
<div> <a href=https://192.168.254.13/checkReplication>https://192.168.254.13/checkReplication</a>
[ACTIVE]</div>
<div><br><br> DataCenter: DC2 NJ [ENABLED]
[DEFAULT DC]</div>
<div><br> Active Portal Zone : BW Zone 2[2], VIP = 192.168.253.10</div>
<div><br> <a href=https://192.168.253.10/checkGlobalReplicationTier>https://192.168.253.10/checkGlobalReplicationTier</a>
[ACTIVE]</div>
<div> <a href=https://192.168.253.10/checkReplication>https://192.168.253.10/checkReplication</a></div>
<div><br> <a href=https://192.168.253.11/checkGlobalReplicationTier>https://192.168.253.11/checkGlobalReplicationTier</a>
[STANDBY]</div>
<div> <a href=https://192.168.253.11/checkReplication>https://192.168.253.11/checkReplication</a></div>
<div><br> Local Zones:</div>
<div> LC Zone 4[4], VIP = 192.168.253.13
<div> <a href=https://192.168.253.13/checkReplication>https://192.168.253.13/checkReplication</a>
[ACTIVE]</div>
<div> <a href=https://192.168.253.14/checkReplication>https://192.168.253.14/checkReplication</a>
[STANDBY]</div>
--> </div>
</div>
</body>
</html>
我想解析这个得到
Data Center Active Zone VIP Local Zone VIP
DC1 NY [Enabled] BW Zone 1[1] 192.168.254.10 LC Zone 3[3] 192.168.254.13
DC2 NJ [Enabled] [DEFAULT DC] BW Zone 2[2] 192.168.253.10 LC Zone 4[4] 192.168.253.13
代码似乎无法解析,正则表达式是解析此页面的最佳方式,还是我应该尝试其他方式。
$zone = "https://192.168.0.90/checkConfiguration"
$html = Invoke-WebRequest -Uri $zone -ErrorAction Stop
$DC = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div><br><br> DataCenter: *' }) | Foreach-Object {$_.outerText -replace '(?<!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}
为此你可以这样做:
$div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>*DataCenter:*' }
$DC = if ($div -and $div.outerText -match '(?s)DataCenter\s*:\s*(\w+).*Active Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d\.]+)') {
[PsCustomObject]@{
'DataCenter' = $matches[1]
'Active Zone' = $matches[2]
'VIP' = $matches[3]
}
}
$DC | Format-Table -AutoSize
输出:
DataCenter Active Zone VIP
---------- ----------- ---
DC1 BW Zone 192.168.0.95
或作为列表
$DC | Format-List
输出:
DataCenter : DC1
Active Zone : BW Zone
VIP : 192.168.0.95
当 html 文件中有多个数据中心时,这是一种不同的方法:
# use outerText to get the plain text for the surrounding <div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...</div>
$content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like '<div>DATA CENTERS*' }).outerText
$DC = $content -split 'DataCenter\s*:\s*' |
Where-Object { $_ -match '(?s)([\w ]+(?:[ [\w\]]*)).*Active (?:Portal )?Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d.]+)' } |
ForEach-Object {
[PsCustomObject]@{
'DataCenter' = $matches[1]
'Active Zone' = $matches[2]
'VIP' = $matches[3]
}
}
$DC | Format-Table -AutoSize
输出:
DataCenter Active Zone VIP
---------- ----------- ---
DC1 NY [ENABLED] BW Zone 1[1] 192.168.254.10
DC2 NJ [ENABLED] [DEFAULT DC] BW Zone 2[2] 192.168.253.10
正则表达式详细信息:
(?s) Match the remainder of the regex with the options: dot matches newline (s)
( Match the regular expression below and capture its match into backreference number 1
[\w ] Match a single character present in the list below
A word character (letters, digits, etc.)
The character “ ”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: Match the regular expression below
[ [\w\]] Match a single character present in the list below
One of the characters “ [”
A word character (letters, digits, etc.)
A ] character
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
)
. Match any single character
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Active\ Match the characters “Active ” literally
(?: Match the regular expression below
Portal\ Match the characters “Portal ” literally
)? Between zero and one times, as many times as possible, giving back as needed (greedy)
Zone Match the characters “Zone” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
: Match the character “:” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( Match the regular expression below and capture its match into backreference number 2
[^,] Match any character that is NOT a “,”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
, Match the character “,” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
VIP Match the characters “VIP” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
= Match the character “=” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( Match the regular expression below and capture its match into backreference number 3
[\d.] Match a single character present in the list below
A single digit 0..9
The character “.”
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)