如何使用正则表达式获取字符串的特定部分

Question

我正在创建一个基于 Sinatra 的应用程序，并尝试使用正则表达式解析一个长字符串以从中提取 link。

这里是字符串的摘录，其中包含我需要提取的相关信息：

time=18ms\n[INFO] Calculating CPD for 0 files\n[INFO] CPD calculation finished\n[INFO] Analysis report generated in 325ms, dir size=14 KB\n[INFO] Analysis reports compressed in 187ms, zip size=8 KB\n[INFO] Analysis report uploaded in 31ms\n[INFO] ANALYSIS SUCCESSFUL, you can browse http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject\n[INFO] Note that you will be able to access the updated dashboard once the server has processed the submitted analysis report\n[INFO] More about the report processing at http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn\n[INFO] -----------------------------------------------------------------------

我需要能够拉取以下内容：

http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn

我得到的最接近的结果是 /(?=http).[a*-z]*/，但这并不接近我需要的结果，因为它找到了 615 个匹配项而不是 1 个。

问题还在于 ID AVhFxTkyob-dgWZqnfIn 不是静态的，每次构建都会更改。

我一直在使用 Rubular.com 来找到我需要使用的正确正则表达式。

Answer 1

>> string = '[your long string here]'
>> regex = /(http:[\w\/.?=-]+)(\n)/
>> string.scan(regex).first.first
=> "http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn"

按照上面的示例，我最终将正则表达式修改为以下内容：

(http:\/\/sonar[\w\/.?=-]+task[\w\/.?=-]+(?!.\n))

.. 和 return 是这样的：

string.scan(regex).first.first

我修改正则表达式的原因是因为之前的正则表达式在插入完整字符串而不是 OP 中的摘录时得到了很多结果。

Answer 2

有一些经过充分测试的工具可以使您的任务更轻松。我建议使用 URI 的 extract 方法：

require 'uri'

str = "time=18ms\n[INFO] Calculating CPD for 0 files\n[INFO] CPD calculation finished\n[INFO] Analysis report generated in 325ms, dir size=14 KB\n[INFO] Analysis reports compressed in 187ms, zip size=8 KB\n[INFO] Analysis report uploaded in 31ms\n[INFO] ANALYSIS SUCCESSFUL, you can browse http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject\n[INFO] Note that you will be able to access the updated dashboard once the server has processed the submitted analysis report\n[INFO] More about the report processing at http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn\n[INFO] -----------------------------------------------------------------------"

URI.extract(str)
# => ["http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject",
#     "http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn"]

那么找到您想要的 link 并使用它就很简单了。

您还需要注意 URI 为聚会带来的所有其他方法，因为它了解如何根据 RFC 拆分和构建 URI。

不要使用您自己的代码或正则表达式来做别人已经做过的事情，尤其是当该代码经过良好测试时。您将避免其他人掉入的陷阱。 URI 的 authors/maintainers 管理内置模式，因此我们不必这样做。而且，它比你想象的要满足RFC要复杂得多，例如：

URI::REGEXP::PATTERN::ABS_URI
"[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?://(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?|/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)(?:\?(?:(?:[\-_.!~*'()a-zA-Z\d;/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?|(?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)"

如何使用正则表达式获取字符串的特定部分

How to use a regex to get a particular part of a string

ruby

regex

sinatra