从 TXT 文件生成 XML 格式
Generate XML format from TXT file
我有下面的输入 txt 文件,我正在尝试生成下面的 XMl 文件。我正在尝试用 awk 来实现,但是
我想我正在重新发明轮子。你建议我怎么做?谢谢
输入 txt 文件(示例,此输入可能更大)
Usw 1:1 Desktop
Usw 1:2 Netbooks
Usw 1:3 Servers, mainframes and supercomputers
Usw 1:4 Smart devices
Usw 1:5 Embedded devices
Usw 1:6 Gaming
Usw 1:7 Specialized uses
Usw 2:1 Precursors
Usw 2:2 Creation
Usw 2:5 Naming
Usw 2:6 Commercial and popular uptake
Usw 2:9 Current development
Des 1:1 User interface
Des 1:2 Video input infrastructure
Des 1:3 Hardware
Des 2:1 Community
Des 2:2 Programming on Linux
xml 需要的文件
<?xml version="1.0" encoding="utf-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
<title>Some title</title>
<creator>
</creator>
<subject>Some subject</subject>
<description>Some description</description>
<date>2010-05-12</date>
<type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">SerES, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
只是为了表明您不需要 XML 感知工具来生成 您需要 用于任何给定目的的特定 XML,这里有一个为你的例子做的方法:
$ cat tst.awk
BEGIN {
print "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
print ""
print "<XMLRT xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"SomeSchema.xsd\" bename=\"The name\" status=\"v\" version=\"1.4\" revision=\"1\" type=\"x-rt\">"
print "<INTRO>"
print " <title>Some title</title>"
print " <creator>"
print " </creator>"
print " <subject>Some subject</subject>"
print " <description>Some description</description>"
print " <date>2010-05-12</date>"
print " <type>Some text</type>"
print "</INTRO>"
rtBeg = "<RTBLOCK bname=\"%s\" bnumber=\"1\" bsname=\"1%s\">\n"
ctrBeg = " <CTR cnumber=\"%d\">\n"
esBody = " <ES vnumber=\"%d\">%s</ES>\n"
ctrEnd = " </CTR>\n"
rtEnd = "</RTBLOCK>\n"
xmlEnd = "</XMLRT>\n"
}
{
bname =
split(,tmp,/:/)
cnum = tmp[1]
vnum = tmp[2]
text = [=10=]
sub(/([^[:space:]]+[[:space:]]+){2}/,"",text)
}
bname != prevBname {
if (prevCnum != "") printf ctrEnd
if (prevBname != "") printf rtEnd
printf rtBeg, bname, substr(bname,1,1)
prevCnum = ""
prevBname = bname
}
cnum != prevCnum {
if (prevCnum != "") printf ctrEnd
printf ctrBeg, cnum
prevCnum = cnum
}
{ printf esBody, vnum, text }
END {
if (prevCnum != "") printf ctrEnd
if (prevBname != "") printf rtEnd
printf xmlEnd
}
.
$ awk -f tst.awk file
<?xml version="1.0" encoding="utf-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
<title>Some title</title>
<creator>
</creator>
<subject>Some subject</subject>
<description>Some description</description>
<date>2010-05-12</date>
<type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">Servers, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
以上内容将在任何 UNIX 机器上的任何 shell 中与任何 POSIX awk 一起高效、稳健且可移植地工作。
How do you suggest me to do it?
我建议使用像 Saxon by Saxonica 这样的 XSLT-2.0+ 处理器来输出想要的 XML 文件。但其他 XSLT-2.0 处理器也能正常工作。
以下 XSLT-2.0 样式表分两步工作:
- 将未解析的文本检索到
<xsl:variable>
- 通过
<xsl:analyze-string>
使用 RegEx 解析此(纯)文本变量
- 将生成的平面 XML 节点与
<xsl:for-each-group>
分组
因此样式表可能如下所示:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" />
<xsl:param name="text-encoding" as="xs:string" select="'utf-8'"/>
<xsl:param name="text-uri" as="xs:string" select="'file:///home/kubuntu/Downloads/input.txt'"/>
<xsl:template match="/">
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<!-- Step 1 ### get unparsed text -->
<xsl:variable name="input-text" select="unparsed-text($text-uri, $text-encoding)"/>
<!-- Step 2 ### Apply RegEx to every line to create <Line...> elements -->
<xsl:variable name="xmlStepOne">
<xsl:for-each select="tokenize($input-text,'
')">
<xsl:if test=".!=''"> <!-- Skip empty lines -->
<xsl:analyze-string select="." regex="([^\s]+)\s([^:]+):([^\s]+)\s(.*)$">
<xsl:matching-substring> <!-- Parse line with RegEx and create <Line...> XML -->
<Line str="{regex-group(1)}" idx1="{regex-group(2)}" idx2="{regex-group(3)}"><xsl:value-of select="regex-group(4)"/></Line>
</xsl:matching-substring>
<xsl:non-matching-substring> <!-- Output an error if a line cannot be processed -->
<xsl:message terminate="yes">Error processing line 
<xsl:value-of select="current()"/>
</xsl:message>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:if>
</xsl:for-each>
</xsl:variable>
<!-- Step 3 ### Group the linear flow of <Line...> elements -->
<xsl:for-each-group select="$xmlStepOne/Line" group-by="@str">
<RTBLOCK bname="{current-grouping-key()}" bnumber="1" bsname="{concat('1',substring(current-grouping-key(),1,1))}">
<xsl:for-each-group select="current-group()" group-by="@idx1">
<xsl:sort select="@idx1" />
<CTR cnumber="{@idx1}">
<xsl:for-each select="current-group()">
<xsl:sort select="@idx2" />
<ES vnumber="{@idx2}"><xsl:value-of select="."/></ES>
</xsl:for-each>
</CTR>
</xsl:for-each-group>
</RTBLOCK>
</xsl:for-each-group>
</XMLRT>
</xsl:template>
</xsl:stylesheet>
您可以在开头的两个参数中设置输入文件名和编码。
上面示例文件的输出是:
<?xml version="1.0" encoding="UTF-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">Servers, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
这种方法的另一个优点是您可以使用 XML/XSLT 处理所有事情,因此它知道字符编码以及使用 awk
或更简单的解决方案未涵盖的所有其他内容相似。
我有下面的输入 txt 文件,我正在尝试生成下面的 XMl 文件。我正在尝试用 awk 来实现,但是 我想我正在重新发明轮子。你建议我怎么做?谢谢
输入 txt 文件(示例,此输入可能更大)
Usw 1:1 Desktop
Usw 1:2 Netbooks
Usw 1:3 Servers, mainframes and supercomputers
Usw 1:4 Smart devices
Usw 1:5 Embedded devices
Usw 1:6 Gaming
Usw 1:7 Specialized uses
Usw 2:1 Precursors
Usw 2:2 Creation
Usw 2:5 Naming
Usw 2:6 Commercial and popular uptake
Usw 2:9 Current development
Des 1:1 User interface
Des 1:2 Video input infrastructure
Des 1:3 Hardware
Des 2:1 Community
Des 2:2 Programming on Linux
xml 需要的文件
<?xml version="1.0" encoding="utf-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
<title>Some title</title>
<creator>
</creator>
<subject>Some subject</subject>
<description>Some description</description>
<date>2010-05-12</date>
<type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">SerES, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
只是为了表明您不需要 XML 感知工具来生成 您需要 用于任何给定目的的特定 XML,这里有一个为你的例子做的方法:
$ cat tst.awk
BEGIN {
print "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
print ""
print "<XMLRT xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"SomeSchema.xsd\" bename=\"The name\" status=\"v\" version=\"1.4\" revision=\"1\" type=\"x-rt\">"
print "<INTRO>"
print " <title>Some title</title>"
print " <creator>"
print " </creator>"
print " <subject>Some subject</subject>"
print " <description>Some description</description>"
print " <date>2010-05-12</date>"
print " <type>Some text</type>"
print "</INTRO>"
rtBeg = "<RTBLOCK bname=\"%s\" bnumber=\"1\" bsname=\"1%s\">\n"
ctrBeg = " <CTR cnumber=\"%d\">\n"
esBody = " <ES vnumber=\"%d\">%s</ES>\n"
ctrEnd = " </CTR>\n"
rtEnd = "</RTBLOCK>\n"
xmlEnd = "</XMLRT>\n"
}
{
bname =
split(,tmp,/:/)
cnum = tmp[1]
vnum = tmp[2]
text = [=10=]
sub(/([^[:space:]]+[[:space:]]+){2}/,"",text)
}
bname != prevBname {
if (prevCnum != "") printf ctrEnd
if (prevBname != "") printf rtEnd
printf rtBeg, bname, substr(bname,1,1)
prevCnum = ""
prevBname = bname
}
cnum != prevCnum {
if (prevCnum != "") printf ctrEnd
printf ctrBeg, cnum
prevCnum = cnum
}
{ printf esBody, vnum, text }
END {
if (prevCnum != "") printf ctrEnd
if (prevBname != "") printf rtEnd
printf xmlEnd
}
.
$ awk -f tst.awk file
<?xml version="1.0" encoding="utf-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
<title>Some title</title>
<creator>
</creator>
<subject>Some subject</subject>
<description>Some description</description>
<date>2010-05-12</date>
<type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">Servers, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
以上内容将在任何 UNIX 机器上的任何 shell 中与任何 POSIX awk 一起高效、稳健且可移植地工作。
How do you suggest me to do it?
我建议使用像 Saxon by Saxonica 这样的 XSLT-2.0+ 处理器来输出想要的 XML 文件。但其他 XSLT-2.0 处理器也能正常工作。
以下 XSLT-2.0 样式表分两步工作:
- 将未解析的文本检索到
<xsl:variable>
- 通过
<xsl:analyze-string>
使用 RegEx 解析此(纯)文本变量
- 将生成的平面 XML 节点与
<xsl:for-each-group>
分组
因此样式表可能如下所示:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" />
<xsl:param name="text-encoding" as="xs:string" select="'utf-8'"/>
<xsl:param name="text-uri" as="xs:string" select="'file:///home/kubuntu/Downloads/input.txt'"/>
<xsl:template match="/">
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<!-- Step 1 ### get unparsed text -->
<xsl:variable name="input-text" select="unparsed-text($text-uri, $text-encoding)"/>
<!-- Step 2 ### Apply RegEx to every line to create <Line...> elements -->
<xsl:variable name="xmlStepOne">
<xsl:for-each select="tokenize($input-text,'
')">
<xsl:if test=".!=''"> <!-- Skip empty lines -->
<xsl:analyze-string select="." regex="([^\s]+)\s([^:]+):([^\s]+)\s(.*)$">
<xsl:matching-substring> <!-- Parse line with RegEx and create <Line...> XML -->
<Line str="{regex-group(1)}" idx1="{regex-group(2)}" idx2="{regex-group(3)}"><xsl:value-of select="regex-group(4)"/></Line>
</xsl:matching-substring>
<xsl:non-matching-substring> <!-- Output an error if a line cannot be processed -->
<xsl:message terminate="yes">Error processing line 
<xsl:value-of select="current()"/>
</xsl:message>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:if>
</xsl:for-each>
</xsl:variable>
<!-- Step 3 ### Group the linear flow of <Line...> elements -->
<xsl:for-each-group select="$xmlStepOne/Line" group-by="@str">
<RTBLOCK bname="{current-grouping-key()}" bnumber="1" bsname="{concat('1',substring(current-grouping-key(),1,1))}">
<xsl:for-each-group select="current-group()" group-by="@idx1">
<xsl:sort select="@idx1" />
<CTR cnumber="{@idx1}">
<xsl:for-each select="current-group()">
<xsl:sort select="@idx2" />
<ES vnumber="{@idx2}"><xsl:value-of select="."/></ES>
</xsl:for-each>
</CTR>
</xsl:for-each-group>
</RTBLOCK>
</xsl:for-each-group>
</XMLRT>
</xsl:template>
</xsl:stylesheet>
您可以在开头的两个参数中设置输入文件名和编码。
上面示例文件的输出是:
<?xml version="1.0" encoding="UTF-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
<CTR cnumber="1">
<ES vnumber="1">Desktop</ES>
<ES vnumber="2">Netbooks</ES>
<ES vnumber="3">Servers, mainframes and supercomputers</ES>
<ES vnumber="4">Smart devices</ES>
<ES vnumber="5">Embedded devices</ES>
<ES vnumber="6">Gaming</ES>
<ES vnumber="7">Specialized uses</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Precursors</ES>
<ES vnumber="2">Creation</ES>
<ES vnumber="5">Naming</ES>
<ES vnumber="6">Commercial and popular uptake</ES>
<ES vnumber="9">Current development</ES>
</CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
<CTR cnumber="1">
<ES vnumber="1">User interface</ES>
<ES vnumber="2">Video input infrastructure</ES>
<ES vnumber="3">Hardware</ES>
</CTR>
<CTR cnumber="2">
<ES vnumber="1">Community</ES>
<ES vnumber="2">Programming on Linux</ES>
</CTR>
</RTBLOCK>
</XMLRT>
这种方法的另一个优点是您可以使用 XML/XSLT 处理所有事情,因此它知道字符编码以及使用 awk
或更简单的解决方案未涵盖的所有其他内容相似。