如何懒惰地清理和 trim 两个字符串?

How to clean and trim two strings lazily?

这是我之前 .

的后续

假设我需要从两个输入字符串 s1s2 中删除某些字符,然后 return 它们的子字符串 t1t2 如下所示:

我可以编写一个扫描整个输入的次优实现,如下所示:

def cleanTrim(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  val cleaned1 = s1.filterNot(chars)
  val cleaned2 = s2.filterNot(chars)
  val k1 = math.min(cleaned1.length, k)
  val k2 = math.min(cleaned2.length, k)
  val n = math.min(k1, k2)
  val t1 = cleaned1.substring(0, n)
  val t2 = cleaned2.substring(0, n)
  (t1, t2)
}

您建议如何懒惰地编写它(例如使用 Stream)?

您可以像这样使用 Streams 延迟过滤您的字符串:

def filterCharsLazy(s: String, chars: Set[Char], k: Int): String = {
  val s2: Stream[Char] = s.toStream
  s2.filter(a => !chars(a)).take(k).mkString
}

有趣的是 filterNot 似乎不允许延迟执行,所以我将其替换为普通的 filter

测试:

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block    // call-by-name
  val t1 = System.nanoTime()
  println("Elapsed time: " + (t1 - t0) + "ns")
  result
}
val str = "asdfasdfasdfasdfasdf"*1000000
val chars = Set('a','s','b','c','e','g','h','j','k')

time { filterCharsLazy(str, chars, 10) }

懒惰地做这件事的关键是,您可以 zip 两个过滤后的 streams/iterators/views 同时遍历它们并将较长的切割成与较短的相同的大小。

我已经完成了这种方法的多个实现,以比较函数式实现和命令式实现的性能。这是方法的代码:

def cleanTrim_Streams(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  def stream(s: String) = s.toStream.filterNot(chars)
  val (stream1, stream2) = stream(s1).zip(stream(s2)).take(k).unzip
  (stream1.mkString, stream2.mkString)
}

def cleanTrim_IteratorsFold(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  def iter(s: String) = s.iterator.filterNot(chars)
  iter(s1).zip(iter(s2)).take(k).foldLeft(("", "")) {
    case ((r1, r2), (c1, c2)) => (r1 + c1, r2 + c2)
  }
}

def cleanTrim_Views(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  def view(s: String) = s.view.filterNot(chars)
  val (v1, v2) = view(s1).zip(view(s2)).take(k).unzip
  (v1.mkString, v2.mkString)
}

def cleanTrim_FullTraverse(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  val cleaned1 = s1.filterNot(chars)
  val cleaned2 = s2.filterNot(chars)
  val k1 = math.min(cleaned1.length, k)
  val k2 = math.min(cleaned2.length, k)
  val n = math.min(k1, k2)
  val t1 = cleaned1.substring(0, n)
  val t2 = cleaned2.substring(0, n)
  (t1, t2)
}

def cleanTrim_IteratorsImperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  def iter(s: String) = s.iterator.filterNot(chars)

  val b1 = new StringBuilder
  val b2 = new StringBuilder
  for ((c1, c2) <- iter(s1).zip(iter(s2)).take(k)) {
    b1 += c1
    b2 += c2
  }

  (b1.result(), b2.result())
}

def cleanTrim_Imperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
  var i1 = 0
  var i2 = 0

  val b1 = new StringBuilder
  val b2 = new StringBuilder

  while (b1.size < k && b2.size < k) {

    while (i1 < s1.length && chars.contains(s1(i1))) i1 += 1
    while (i2 < s2.length && chars.contains(s2(i2))) i2 += 1

    if (i1 >= s1.length || i2 >= s2.length) return (b1.result(), b2.result())

    b1 += s1(i1); i1 += 1
    b2 += s2(i2); i2 += 1
  }

  (b1.result(), b2.result())
}

这是我的基准测试结果,s1.size = 100,s2.size = 200,chars.size = 3,修剪大小 = 86 和 k = 任一个50 或 100。

[info] Benchmark                       (maxLength)  Mode  Cnt    Score    Error  Units
[info] Benchmarks.fullTraverse                  50  avgt   10    5,591 ±  2,586  us/op
[info] Benchmarks.fullTraverse                 100  avgt   10    5,678 ±  2,799  us/op
[info] Benchmarks.imperative                    50  avgt   10    1,091 ±  0,066  us/op
[info] Benchmarks.imperative                   100  avgt   10    2,384 ±  0,931  us/op
[info] Benchmarks.iteratorsFold                 50  avgt   10    4,164 ±  0,214  us/op
[info] Benchmarks.iteratorsFold                100  avgt   10   11,783 ±  8,251  us/op
[info] Benchmarks.iteratorsImperative           50  avgt   10    4,104 ±  1,241  us/op
[info] Benchmarks.iteratorsImperative          100  avgt   10    9,695 ±  5,554  us/op
[info] Benchmarks.streams                       50  avgt   10   38,670 ±  3,547  us/op
[info] Benchmarks.streams                      100  avgt   10  116,573 ± 72,291  us/op
[info] Benchmarks.views                         50  avgt   10   17,209 ± 30,554  us/op
[info] Benchmarks.views                        100  avgt   10   17,124 ±  0,818  us/op

这些结果的一些要点:

  • 当然,没有什么比直接的命令式实现更好的了。
  • fullTraverse 代码(取自您的问题)实际上在测试数据大小时非常有效。最好使用它。
  • 出于惰性,函数式实现 iteratorsFold 做得最好并且击败了 fullTraverse
  • 懒惰地完成如此简单的任务的开销非常大。
  • 正如预期的那样,
  • Stream效率极低。