xml 嵌套兄弟到 R 中的数据框
xml with nested siblings to data frame in R
我不熟悉在 R 中解析 XML。我正在尝试将 XML 解析为一个可行的数据框。我尝试了 XML 包中的一些 XPath 函数,但似乎无法得出正确答案。
这是我的 XML:
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
<SchoolData>
<SchoolDistrict>Puyallup</SchoolDistrict>
</SchoolData>
<View>Territorial</View>
</Listing>
<YearBuilt>1997</YearBuilt>
<InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
<Occupant>
<Name>Vacant</Name>
</Occupant>
<WaterFront/>
<Roof>Composition</Roof>
<Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>
当我运行:
ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))
父节点内子节点的值被压缩为:
11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial
如果我向下移动一个节点到 ,同样的事情会发生:
11111111thAvenue CtWA
子节点的值全部粘贴在一起
我也尝试了一种蛮力的方法,有点管用:
StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)
但有些列名称与预期不符:
> colnames(df)
[1] "StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus"
[6] "StatusChangeDate" "Latitude" "Longitude" "County" "SchoolDistrict"
[11] "text" "text" "Name" "text" "text"
colnames[11,12,14,15]
应该分别是"YearBuilt", "InteriorFeatures", "Roof", and "Exterior"
。 (旁注 - 为什么会发生这种情况?)
我试图找到一种方法将每个原子值排序到数据框的适当列中,列名是节点的名称,即使在嵌套的子节点中也是如此。另外,我的数据可能会随着时间的推移而改变,所以我正在寻找一个动态函数来符合数据,尽可能产生预期的结果。
我想这是一个有点常见的 XML 架构(带有嵌套子层),所以我很惊讶没有找到关于该主题的太多信息,尽管我可能只是在搜索中使用了错误的行话。我猜想有一个简单的答案。你有什么建议吗?
考虑到 xml
包含您的示例字符串,这是针对具有不同数量项目的住宅物业的另一种策略:
library(XML)
library(plyr)
# xml <- '<ResidentialProperty>........'
doc <- xmlParse(xml, asText = TRUE)
df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE))
}))
df
# StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate Latitude Longitude County SchoolDistrict View YearBuilt InteriorFeatures Name Roof Exterior
# 1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products
我不熟悉在 R 中解析 XML。我正在尝试将 XML 解析为一个可行的数据框。我尝试了 XML 包中的一些 XPath 函数,但似乎无法得出正确答案。
这是我的 XML:
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
<SchoolData>
<SchoolDistrict>Puyallup</SchoolDistrict>
</SchoolData>
<View>Territorial</View>
</Listing>
<YearBuilt>1997</YearBuilt>
<InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
<Occupant>
<Name>Vacant</Name>
</Occupant>
<WaterFront/>
<Roof>Composition</Roof>
<Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>
当我运行:
ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))
父节点内子节点的值被压缩为:
11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial
如果我向下移动一个节点到 ,同样的事情会发生:
11111111thAvenue CtWA
子节点的值全部粘贴在一起
我也尝试了一种蛮力的方法,有点管用:
StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)
但有些列名称与预期不符:
> colnames(df)
[1] "StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus"
[6] "StatusChangeDate" "Latitude" "Longitude" "County" "SchoolDistrict"
[11] "text" "text" "Name" "text" "text"
colnames[11,12,14,15]
应该分别是"YearBuilt", "InteriorFeatures", "Roof", and "Exterior"
。 (旁注 - 为什么会发生这种情况?)
我试图找到一种方法将每个原子值排序到数据框的适当列中,列名是节点的名称,即使在嵌套的子节点中也是如此。另外,我的数据可能会随着时间的推移而改变,所以我正在寻找一个动态函数来符合数据,尽可能产生预期的结果。
我想这是一个有点常见的 XML 架构(带有嵌套子层),所以我很惊讶没有找到关于该主题的太多信息,尽管我可能只是在搜索中使用了错误的行话。我猜想有一个简单的答案。你有什么建议吗?
考虑到 xml
包含您的示例字符串,这是针对具有不同数量项目的住宅物业的另一种策略:
library(XML)
library(plyr)
# xml <- '<ResidentialProperty>........'
doc <- xmlParse(xml, asText = TRUE)
df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE))
}))
df
# StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate Latitude Longitude County SchoolDistrict View YearBuilt InteriorFeatures Name Roof Exterior
# 1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products