Scala:在常规语言中查找长度不超过 n 的所有字符串
Scala: Find all strings of length up to n in regular language
我有一种(可能是无限的)正则语言,我用正则表达式来描述它。从这种常规语言中,我想使用 scala 获取所有长度不超过 n 的字符串。一些快速的谷歌搜索告诉我有一些图书馆可以帮助我。在使用外部库之前,我想知道在 Scala 中这是否很容易(像一个体面的程序员可以在 15 分钟内实现的东西)。如果没有,您是否可以为此推荐一些好的库?
让我想要的更具体。假设我有语言 A*B*
而我的 n
是 3,那么我想要字符串 "", "A", "B", "AA", "AB", "BB", "AAA", "AAB", "ABB", "BBB"
.
我没有完全理解你的意思,这样可以吗?
scala> def generator(chars: Seq[Char], n: Int): Iterator[String] =
| (0 to n).iterator flatMap (i => (chars flatMap (_.toString*i) mkString) combinations i)
generator: (chars: Seq[Char], n: Int)Iterator[String]
scala>
scala> generator("AB", 3) toList
res0: List[String] = List("", A, B, AA, AB, BB, AAA, AAB, ABB, BBB)
scala> generator("ABC", 3) toList
res1: List[String] = List("", A, B, C, AA, AB, AC, BB, BC, CC, AAA, AAB, AAC, ABB, ABC, ACC, BBB, BBC, BCC, CCC)
回答
编辑
- 26Nov,4:30pm - 添加了 基于迭代器的 版本以减少运行时间和内存消耗。基于序列的 canonic 版本位于 (1)
下方的底部
- 26Nov,2:45pm - 为 canonic 添加了工作 seq-based 版本,非工作旧版本的 canonic 在底部 (2)
方法
- 规范地生成给定字母表的所有可能的单词,最大长度为
n
。
- 通过正则表达式(在这种情况下是您的常规语言)过滤生成的单词
代码
object SO {
import scala.annotation.tailrec
import scala.collection.{AbstractIterator, Iterator}
import scala.util.matching.Regex
def canonic(alphabet: Seq[Char], n: Int): Iterator[String] =
if (n < 0) Iterator.empty
else {
val r: IndexedSeq[Iterator[String]] = for (i <- 1 to n)
yield new CanonicItr(alphabet, i)
r.reduce(_ ++ _)
}
private class CanonicItr(alphabet: Seq[Char], width: Int) extends AbstractIterator[String] {
val aSize = alphabet.size
val alph = alphabet.toVector
val total = aSizePower(width)
println("total " + total)
private var pos = 0L
private def aSizePower(r: Int): Long = scala.math.pow(aSize, r).toLong
def stringFor(id: Long): String = {
val r = for {
i <- (0 until width).reverse
// (738 / 10^0) % 10 = 8
// (738 / 10^1) % 10 = 3
// (738 / 10^2) % 10 = 7
charIdx = ((id / (aSizePower(i))) % aSize).toInt
} yield alph(charIdx)
r.mkString("")
}
override def hasNext: Boolean = pos < total
override def next(): String = {
val s = stringFor(pos)
pos = pos + 1
s
}
}
def main(args: Array[String]): Unit = {
// create all possible words with the given alphabet
val canonicWordSet = canonic(Seq('a', 'b', 'c'), 8)
// formal regular language definition
val languageDef: Regex = "a*b".r
// determine words of language by filtering the canocic set.
val wordsOfLanguage = canonicWordSet.filter(word => languageDef.pattern.matcher(word).matches)
println(wordsOfLanguage.toList)
}
}
1) canonic
的工作版本,但内存要求高
object SO {
import scala.util.matching.Regex
/**
* Given a sequence of characters (e.g. Seq('a', 'b', 'c') )
* generates all combinations up to lneght of n (incl.).
*
* @param alphabet sequence of characters
* @param n is the max length
* @return all combinations of up to length n.
*/
def canonic(alphabet:Seq[Char], n: Int): Seq[String] = {
def combination( input: Seq[String], chars: Seq[Char]) = {
for {
i <- input
c <- chars
} yield (i+c)
}
@tailrec
def rec(left: Int, current: Seq[String], accum: Seq[String] ) : Seq[String] = {
left match {
case 0 => accum
case _ => {
val next = combination( current, alphabet )
rec( left-1, next, accum ++ next )
}
}
}
rec(n, Seq(""), Seq(""))
}
def main(args: Array[String]) : Unit = {
// create all possible words with the given alphabet
val canonicWordSet= canonic( Seq('a', 'b', 'c'), 3)
// formal regular language definition
val languageDef: Regex = "a*b".r
// determine words of language by filtering the canocic set.
val wordsOfLanguage = canonicWordSet.filter( word => languageDef.pattern.matcher(word).matches )
println( wordsOfLanguage.toList )
}
}
2) canonic
的非工作版本无法正常工作
def canonic(alphabet:Seq[Char], n: Int): Iterator[String] = {
for {
i <- (0 to n).iterator
combi <- alphabet.combinations(i).map(cs => cs.mkString)
} yield combi
}
我有一种(可能是无限的)正则语言,我用正则表达式来描述它。从这种常规语言中,我想使用 scala 获取所有长度不超过 n 的字符串。一些快速的谷歌搜索告诉我有一些图书馆可以帮助我。在使用外部库之前,我想知道在 Scala 中这是否很容易(像一个体面的程序员可以在 15 分钟内实现的东西)。如果没有,您是否可以为此推荐一些好的库?
让我想要的更具体。假设我有语言 A*B*
而我的 n
是 3,那么我想要字符串 "", "A", "B", "AA", "AB", "BB", "AAA", "AAB", "ABB", "BBB"
.
我没有完全理解你的意思,这样可以吗?
scala> def generator(chars: Seq[Char], n: Int): Iterator[String] =
| (0 to n).iterator flatMap (i => (chars flatMap (_.toString*i) mkString) combinations i)
generator: (chars: Seq[Char], n: Int)Iterator[String]
scala>
scala> generator("AB", 3) toList
res0: List[String] = List("", A, B, AA, AB, BB, AAA, AAB, ABB, BBB)
scala> generator("ABC", 3) toList
res1: List[String] = List("", A, B, C, AA, AB, AC, BB, BC, CC, AAA, AAB, AAC, ABB, ABC, ACC, BBB, BBC, BCC, CCC)
回答
编辑
- 26Nov,4:30pm - 添加了 基于迭代器的 版本以减少运行时间和内存消耗。基于序列的 canonic 版本位于 (1) 下方的底部
- 26Nov,2:45pm - 为 canonic 添加了工作 seq-based 版本,非工作旧版本的 canonic 在底部 (2)
方法
- 规范地生成给定字母表的所有可能的单词,最大长度为
n
。 - 通过正则表达式(在这种情况下是您的常规语言)过滤生成的单词
代码
object SO {
import scala.annotation.tailrec
import scala.collection.{AbstractIterator, Iterator}
import scala.util.matching.Regex
def canonic(alphabet: Seq[Char], n: Int): Iterator[String] =
if (n < 0) Iterator.empty
else {
val r: IndexedSeq[Iterator[String]] = for (i <- 1 to n)
yield new CanonicItr(alphabet, i)
r.reduce(_ ++ _)
}
private class CanonicItr(alphabet: Seq[Char], width: Int) extends AbstractIterator[String] {
val aSize = alphabet.size
val alph = alphabet.toVector
val total = aSizePower(width)
println("total " + total)
private var pos = 0L
private def aSizePower(r: Int): Long = scala.math.pow(aSize, r).toLong
def stringFor(id: Long): String = {
val r = for {
i <- (0 until width).reverse
// (738 / 10^0) % 10 = 8
// (738 / 10^1) % 10 = 3
// (738 / 10^2) % 10 = 7
charIdx = ((id / (aSizePower(i))) % aSize).toInt
} yield alph(charIdx)
r.mkString("")
}
override def hasNext: Boolean = pos < total
override def next(): String = {
val s = stringFor(pos)
pos = pos + 1
s
}
}
def main(args: Array[String]): Unit = {
// create all possible words with the given alphabet
val canonicWordSet = canonic(Seq('a', 'b', 'c'), 8)
// formal regular language definition
val languageDef: Regex = "a*b".r
// determine words of language by filtering the canocic set.
val wordsOfLanguage = canonicWordSet.filter(word => languageDef.pattern.matcher(word).matches)
println(wordsOfLanguage.toList)
}
}
1) canonic
的工作版本,但内存要求高
object SO {
import scala.util.matching.Regex
/**
* Given a sequence of characters (e.g. Seq('a', 'b', 'c') )
* generates all combinations up to lneght of n (incl.).
*
* @param alphabet sequence of characters
* @param n is the max length
* @return all combinations of up to length n.
*/
def canonic(alphabet:Seq[Char], n: Int): Seq[String] = {
def combination( input: Seq[String], chars: Seq[Char]) = {
for {
i <- input
c <- chars
} yield (i+c)
}
@tailrec
def rec(left: Int, current: Seq[String], accum: Seq[String] ) : Seq[String] = {
left match {
case 0 => accum
case _ => {
val next = combination( current, alphabet )
rec( left-1, next, accum ++ next )
}
}
}
rec(n, Seq(""), Seq(""))
}
def main(args: Array[String]) : Unit = {
// create all possible words with the given alphabet
val canonicWordSet= canonic( Seq('a', 'b', 'c'), 3)
// formal regular language definition
val languageDef: Regex = "a*b".r
// determine words of language by filtering the canocic set.
val wordsOfLanguage = canonicWordSet.filter( word => languageDef.pattern.matcher(word).matches )
println( wordsOfLanguage.toList )
}
}
2) canonic
的非工作版本无法正常工作
def canonic(alphabet:Seq[Char], n: Int): Iterator[String] = {
for {
i <- (0 to n).iterator
combi <- alphabet.combinations(i).map(cs => cs.mkString)
} yield combi
}