如何使用 LibreOffice 的 Calc 从网站获取数据？

Question

我正在寻找一种使用 LibreOffice 的 Calc 从网站获取一些数据的方法。

我之前使用的是带有 IMPORTXML 函数的 Google 表格，但因为它非常不可靠，所以我想改用 Calc。

我的函数是这样的：

=IMPORTXML(E2; "//h3[@class='product-name']")

=IMPORTXML(E2; "//span[@class='price']")

正如您已经猜到的那样，URL 在 E2 中（f.i。http://www.killis.at/gin/monkey-47-gin-distiller-s-cut-2016-0-5-lt.html）。

在 Calc 中，我尝试了 =FILTERXML(WEBSERVICE(E2);"//h3[@class='product-name']") 结果只是 #VALUE!。

我的 LibreOffice 版本是 6.0.4.2，德语语言环境。我使用带“;”的英文函数名作为分隔符。

那么这个函数在 Calc 中的等价物是什么？产品名称和价格的适当命令是什么样的？

Answer 1

问题是，虽然 IMPORTXML claims to be able parsing tag soup HTML, which is not true in all cases, the FILTERXML 每个定义都需要一个有效的 XML 流。标签 soup HTML 不是有效的 XML 流。老实说，HTML 主要与有效的 XML 流相反。

所以唯一的方法是使用第三方标签汤解析器或将 HTML 标签汤作为字符串并使用字符串操作来查找字符串的所需部分。

第二种方法可能如下所示：

Public Function GETFROMHTML(sURL as String, sStartTag as String) as String

   on error goto onErrorExit
   oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
   oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
   oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
   dim delimiters() as long
   sContent = oInpDataStream.readString(delimiters(), false)

   lStartPos = instr(1, sContent, sStartTag )
   if lStartPos = 0 then
     GETFROMHTML = "tag " & sStartTag & " not found"
     exit function
   end if   
   lEndPos = instr(lStartPos, sContent, "</")
   lStartPos = lStartPos + 1 + len(sStartTag)
   sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))
   GETFROMHTML = sText

 onErrorExit:
   on error goto 0
End Function

像这样在 Calc 单元格中使用：

=GETFROMHTML(E2; "<h3 class=""product-name""")

或

=GETFROMHTML(E2; "<span class=""price""")

使用 Sub 可能如下所示：

sub getProductNameAndPrice()
   on error resume next

   oDoc = ThisComponent
   oSheet = oDoc.CurrentController.ActiveSheet

   for r = 0 to 9 'row 1 to 10 (0 based)

     sURL = oSheet.getCellByPosition(4, r).String 'get string value from column 4 (E)

     oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
     oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
     oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
     if not isNull(oInpDataStream.InputStream) then 
       dim delimiters() as long
       sContent = oInpDataStream.readString(delimiters(), false)

       sStartTag = "<h3 class=""product-name"""

       lStartPos = instr(1, sContent, sStartTag)
       if lStartPos <> 0 then
         lEndPos = instr(lStartPos, sContent, "</")
         lStartPos = lStartPos + 1 + len(sStartTag)
         sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))

         oSheet.getCellByPosition(5, r).String = sText
       end if 

       sStartTag = "<span class=""price"""

       lStartPos = instr(1, sContent, sStartTag)
       if lStartPos <> 0 then
         lEndPos = instr(lStartPos, sContent, "</")
         lStartPos = lStartPos + 1 + len(sStartTag)
         sText = trim(replace(replace(mid(sContent, lStartPos, lEndPos-lStartPos), chr(10), ""), chr(13), ""))

         oSheet.getCellByPosition(6, r).String = sText
       end if   

     end if

   next

   on error goto 0
end sub

此代码获取第 E 列第 1 至 10 行的 URL，并在第 F 列中写入产品名称，在该行的第 G 列中写入价格。

如何使用 LibreOffice 的 Calc 从网站获取数据？

How to fetch data from a website using LibreOffice's Calc?

libreoffice-calc