如何懒惰地清理和 trim 两个字符串?
How to clean and trim two strings lazily?
这是我之前 .
的后续
假设我需要从两个输入字符串 s1
和 s2
中删除某些字符,然后 return 它们的子字符串 t1
和 t2
如下所示:
t1
和 t2
是 "cleaned"
t1
和t2
长度相同
t1
和 t2
长度最多为 k
t1
和t2
尽可能长
我可以编写一个扫描整个输入的次优实现,如下所示:
def cleanTrim(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
val cleaned1 = s1.filterNot(chars)
val cleaned2 = s2.filterNot(chars)
val k1 = math.min(cleaned1.length, k)
val k2 = math.min(cleaned2.length, k)
val n = math.min(k1, k2)
val t1 = cleaned1.substring(0, n)
val t2 = cleaned2.substring(0, n)
(t1, t2)
}
您建议如何懒惰地编写它(例如使用 Stream
)?
您可以像这样使用 Streams 延迟过滤您的字符串:
def filterCharsLazy(s: String, chars: Set[Char], k: Int): String = {
val s2: Stream[Char] = s.toStream
s2.filter(a => !chars(a)).take(k).mkString
}
有趣的是 filterNot
似乎不允许延迟执行,所以我将其替换为普通的 filter
。
测试:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
val str = "asdfasdfasdfasdfasdf"*1000000
val chars = Set('a','s','b','c','e','g','h','j','k')
time { filterCharsLazy(str, chars, 10) }
懒惰地做这件事的关键是,您可以 zip
两个过滤后的 streams/iterators/views 同时遍历它们并将较长的切割成与较短的相同的大小。
我已经完成了这种方法的多个实现,以比较函数式实现和命令式实现的性能。这是方法的代码:
def cleanTrim_Streams(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def stream(s: String) = s.toStream.filterNot(chars)
val (stream1, stream2) = stream(s1).zip(stream(s2)).take(k).unzip
(stream1.mkString, stream2.mkString)
}
def cleanTrim_IteratorsFold(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def iter(s: String) = s.iterator.filterNot(chars)
iter(s1).zip(iter(s2)).take(k).foldLeft(("", "")) {
case ((r1, r2), (c1, c2)) => (r1 + c1, r2 + c2)
}
}
def cleanTrim_Views(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def view(s: String) = s.view.filterNot(chars)
val (v1, v2) = view(s1).zip(view(s2)).take(k).unzip
(v1.mkString, v2.mkString)
}
def cleanTrim_FullTraverse(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
val cleaned1 = s1.filterNot(chars)
val cleaned2 = s2.filterNot(chars)
val k1 = math.min(cleaned1.length, k)
val k2 = math.min(cleaned2.length, k)
val n = math.min(k1, k2)
val t1 = cleaned1.substring(0, n)
val t2 = cleaned2.substring(0, n)
(t1, t2)
}
def cleanTrim_IteratorsImperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def iter(s: String) = s.iterator.filterNot(chars)
val b1 = new StringBuilder
val b2 = new StringBuilder
for ((c1, c2) <- iter(s1).zip(iter(s2)).take(k)) {
b1 += c1
b2 += c2
}
(b1.result(), b2.result())
}
def cleanTrim_Imperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
var i1 = 0
var i2 = 0
val b1 = new StringBuilder
val b2 = new StringBuilder
while (b1.size < k && b2.size < k) {
while (i1 < s1.length && chars.contains(s1(i1))) i1 += 1
while (i2 < s2.length && chars.contains(s2(i2))) i2 += 1
if (i1 >= s1.length || i2 >= s2.length) return (b1.result(), b2.result())
b1 += s1(i1); i1 += 1
b2 += s2(i2); i2 += 1
}
(b1.result(), b2.result())
}
这是我的基准测试结果,s1.size
= 100,s2.size
= 200,chars.size
= 3,修剪大小 = 86 和 k
= 任一个50 或 100。
[info] Benchmark (maxLength) Mode Cnt Score Error Units
[info] Benchmarks.fullTraverse 50 avgt 10 5,591 ± 2,586 us/op
[info] Benchmarks.fullTraverse 100 avgt 10 5,678 ± 2,799 us/op
[info] Benchmarks.imperative 50 avgt 10 1,091 ± 0,066 us/op
[info] Benchmarks.imperative 100 avgt 10 2,384 ± 0,931 us/op
[info] Benchmarks.iteratorsFold 50 avgt 10 4,164 ± 0,214 us/op
[info] Benchmarks.iteratorsFold 100 avgt 10 11,783 ± 8,251 us/op
[info] Benchmarks.iteratorsImperative 50 avgt 10 4,104 ± 1,241 us/op
[info] Benchmarks.iteratorsImperative 100 avgt 10 9,695 ± 5,554 us/op
[info] Benchmarks.streams 50 avgt 10 38,670 ± 3,547 us/op
[info] Benchmarks.streams 100 avgt 10 116,573 ± 72,291 us/op
[info] Benchmarks.views 50 avgt 10 17,209 ± 30,554 us/op
[info] Benchmarks.views 100 avgt 10 17,124 ± 0,818 us/op
这些结果的一些要点:
- 当然,没有什么比直接的命令式实现更好的了。
fullTraverse
代码(取自您的问题)实际上在测试数据大小时非常有效。最好使用它。
- 出于惰性,函数式实现
iteratorsFold
做得最好并且击败了 fullTraverse
。
- 懒惰地完成如此简单的任务的开销非常大。
正如预期的那样,Stream
效率极低。
这是我之前
假设我需要从两个输入字符串 s1
和 s2
中删除某些字符,然后 return 它们的子字符串 t1
和 t2
如下所示:
t1
和t2
是 "cleaned"t1
和t2
长度相同t1
和t2
长度最多为k
t1
和t2
尽可能长
我可以编写一个扫描整个输入的次优实现,如下所示:
def cleanTrim(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
val cleaned1 = s1.filterNot(chars)
val cleaned2 = s2.filterNot(chars)
val k1 = math.min(cleaned1.length, k)
val k2 = math.min(cleaned2.length, k)
val n = math.min(k1, k2)
val t1 = cleaned1.substring(0, n)
val t2 = cleaned2.substring(0, n)
(t1, t2)
}
您建议如何懒惰地编写它(例如使用 Stream
)?
您可以像这样使用 Streams 延迟过滤您的字符串:
def filterCharsLazy(s: String, chars: Set[Char], k: Int): String = {
val s2: Stream[Char] = s.toStream
s2.filter(a => !chars(a)).take(k).mkString
}
有趣的是 filterNot
似乎不允许延迟执行,所以我将其替换为普通的 filter
。
测试:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
val str = "asdfasdfasdfasdfasdf"*1000000
val chars = Set('a','s','b','c','e','g','h','j','k')
time { filterCharsLazy(str, chars, 10) }
懒惰地做这件事的关键是,您可以 zip
两个过滤后的 streams/iterators/views 同时遍历它们并将较长的切割成与较短的相同的大小。
我已经完成了这种方法的多个实现,以比较函数式实现和命令式实现的性能。这是方法的代码:
def cleanTrim_Streams(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def stream(s: String) = s.toStream.filterNot(chars)
val (stream1, stream2) = stream(s1).zip(stream(s2)).take(k).unzip
(stream1.mkString, stream2.mkString)
}
def cleanTrim_IteratorsFold(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def iter(s: String) = s.iterator.filterNot(chars)
iter(s1).zip(iter(s2)).take(k).foldLeft(("", "")) {
case ((r1, r2), (c1, c2)) => (r1 + c1, r2 + c2)
}
}
def cleanTrim_Views(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def view(s: String) = s.view.filterNot(chars)
val (v1, v2) = view(s1).zip(view(s2)).take(k).unzip
(v1.mkString, v2.mkString)
}
def cleanTrim_FullTraverse(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
val cleaned1 = s1.filterNot(chars)
val cleaned2 = s2.filterNot(chars)
val k1 = math.min(cleaned1.length, k)
val k2 = math.min(cleaned2.length, k)
val n = math.min(k1, k2)
val t1 = cleaned1.substring(0, n)
val t2 = cleaned2.substring(0, n)
(t1, t2)
}
def cleanTrim_IteratorsImperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
def iter(s: String) = s.iterator.filterNot(chars)
val b1 = new StringBuilder
val b2 = new StringBuilder
for ((c1, c2) <- iter(s1).zip(iter(s2)).take(k)) {
b1 += c1
b2 += c2
}
(b1.result(), b2.result())
}
def cleanTrim_Imperative(s1: String, s2: String, chars: Set[Char], k: Int): (String, String) = {
var i1 = 0
var i2 = 0
val b1 = new StringBuilder
val b2 = new StringBuilder
while (b1.size < k && b2.size < k) {
while (i1 < s1.length && chars.contains(s1(i1))) i1 += 1
while (i2 < s2.length && chars.contains(s2(i2))) i2 += 1
if (i1 >= s1.length || i2 >= s2.length) return (b1.result(), b2.result())
b1 += s1(i1); i1 += 1
b2 += s2(i2); i2 += 1
}
(b1.result(), b2.result())
}
这是我的基准测试结果,s1.size
= 100,s2.size
= 200,chars.size
= 3,修剪大小 = 86 和 k
= 任一个50 或 100。
[info] Benchmark (maxLength) Mode Cnt Score Error Units
[info] Benchmarks.fullTraverse 50 avgt 10 5,591 ± 2,586 us/op
[info] Benchmarks.fullTraverse 100 avgt 10 5,678 ± 2,799 us/op
[info] Benchmarks.imperative 50 avgt 10 1,091 ± 0,066 us/op
[info] Benchmarks.imperative 100 avgt 10 2,384 ± 0,931 us/op
[info] Benchmarks.iteratorsFold 50 avgt 10 4,164 ± 0,214 us/op
[info] Benchmarks.iteratorsFold 100 avgt 10 11,783 ± 8,251 us/op
[info] Benchmarks.iteratorsImperative 50 avgt 10 4,104 ± 1,241 us/op
[info] Benchmarks.iteratorsImperative 100 avgt 10 9,695 ± 5,554 us/op
[info] Benchmarks.streams 50 avgt 10 38,670 ± 3,547 us/op
[info] Benchmarks.streams 100 avgt 10 116,573 ± 72,291 us/op
[info] Benchmarks.views 50 avgt 10 17,209 ± 30,554 us/op
[info] Benchmarks.views 100 avgt 10 17,124 ± 0,818 us/op
这些结果的一些要点:
- 当然,没有什么比直接的命令式实现更好的了。
fullTraverse
代码(取自您的问题)实际上在测试数据大小时非常有效。最好使用它。- 出于惰性,函数式实现
iteratorsFold
做得最好并且击败了fullTraverse
。 - 懒惰地完成如此简单的任务的开销非常大。 正如预期的那样,
Stream
效率极低。