T-SQL - 计算变量中的唯一字符
T-SQL - Count unique characters in a variable
目标:以最快的方式计算变量中不同字符的数量。
DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
我找到了一些关于列中的不同字符、按字符分组等的帖子,但没有找到适合这种情况的帖子。
以NGrams8K
为基础,您可以将输入参数更改为nvarchar(4000)
并调整DATALENGTH
,使NGramsN4K
。然后你可以使用它将字符串拆分成单个字符并计算它们:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(@String1,1) NG;
改变 NGrams8K
:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE @maxN int = 100;
SELECT N = @maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,@maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string varchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less
than the length of @string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N
parameters to prevent a NULL string or NULL @N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,N''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0 AND @N <= (DATALENGTH(@string)/2); -- Protection against bad parameter values
获取 NGrams8k 的副本,您可以这样做:
DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(@String1),(@String2),(@String3),(@String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
已更新
只是基于@Larnu 的 post 和评论的快速更新。我没有注意到 OP 正在处理 Unicode,例如NVARCHAR。 - 类似于上面的@Larnu posted。我刚刚更新了 return 令牌以使用 Latin1_General_BIN 排序规则。
SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N)
这 return 是正确答案:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(@String5,1) AS ng;
如果没有适当的整理,您可以使用 what Larnu posted 并得到正确的答案,如下所示:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(@String5,1) AS ng;
这是我更新的 NGramsN4K 函数:
ALTER FUNCTION dbo.NGramsN4K
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0 -- Protection against bad parameter values:
AND @N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)));
您可以使用 CTE 和一些字符串操作在 SQL 服务器中本地执行此操作:
DECLARE @TestString NVARCHAR(4000);
SET @TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
@TestString AS Stri,
MAX(LEN(@TestString)) AS MaxPos,
SUBSTRING(@TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
@TestString,
MaxPos,
SUBSTRING(@TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
示例输出:
SET @TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET @TestString = N'1A^';
{execute code}
LetterCount = 3
SET @TestString = N'1';
{execute code}
LetterCount = 1
SET @TestString = N'*';
{execute code}
LetterCount = 1
CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1
这是另一个使用理货功能的替代方案 table。它被称为 "Swiss Army Knife of T-SQL"。我保留了一个计数 table 作为我系统的视图,这使得它非常快。
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
现在我们可以在任何需要的时候使用它,就像这个练习一样。
declare @Something table
(
String1 nvarchar(4000)
)
insert @Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from @Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
老实说,我不知道这会比 Larnu 使用 NGrams 更快,但在大型 table 上进行测试会很有趣。
----- 编辑 -----
感谢 Shnugo 的想法。在这里使用交叉应用到相关子查询实际上是一个很大的改进。
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from @Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
速度快得多的原因是它不再使用三角连接,三角连接真的很慢。我也确实用索引物理计数 table 切换了视图。改进在较大的数据集上很明显,但不如使用交叉应用那么大。
如果您想阅读有关三角连接的更多信息以及为什么我们应该避免使用它们,Jeff Moden 有一篇关于该主题的精彩文章。 https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins
目标:以最快的方式计算变量中不同字符的数量。
DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
我找到了一些关于列中的不同字符、按字符分组等的帖子,但没有找到适合这种情况的帖子。
以NGrams8K
为基础,您可以将输入参数更改为nvarchar(4000)
并调整DATALENGTH
,使NGramsN4K
。然后你可以使用它将字符串拆分成单个字符并计算它们:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(@String1,1) NG;
改变 NGrams8K
:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE @maxN int = 100;
SELECT N = @maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,@maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string varchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less
than the length of @string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N
parameters to prevent a NULL string or NULL @N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,N''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0 AND @N <= (DATALENGTH(@string)/2); -- Protection against bad parameter values
获取 NGrams8k 的副本,您可以这样做:
DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(@String1),(@String2),(@String3),(@String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
已更新
只是基于@Larnu 的 post 和评论的快速更新。我没有注意到 OP 正在处理 Unicode,例如NVARCHAR。
SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N)
这 return 是正确答案:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(@String5,1) AS ng;
如果没有适当的整理,您可以使用 what Larnu posted 并得到正确的答案,如下所示:
DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(@String5,1) AS ng;
这是我更新的 NGramsN4K 函数:
ALTER FUNCTION dbo.NGramsN4K
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0 -- Protection against bad parameter values:
AND @N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)));
您可以使用 CTE 和一些字符串操作在 SQL 服务器中本地执行此操作:
DECLARE @TestString NVARCHAR(4000);
SET @TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
@TestString AS Stri,
MAX(LEN(@TestString)) AS MaxPos,
SUBSTRING(@TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
@TestString,
MaxPos,
SUBSTRING(@TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
示例输出:
SET @TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET @TestString = N'1A^';
{execute code}
LetterCount = 3
SET @TestString = N'1';
{execute code}
LetterCount = 1
SET @TestString = N'*';
{execute code}
LetterCount = 1
CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1
这是另一个使用理货功能的替代方案 table。它被称为 "Swiss Army Knife of T-SQL"。我保留了一个计数 table 作为我系统的视图,这使得它非常快。
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
现在我们可以在任何需要的时候使用它,就像这个练习一样。
declare @Something table
(
String1 nvarchar(4000)
)
insert @Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from @Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
老实说,我不知道这会比 Larnu 使用 NGrams 更快,但在大型 table 上进行测试会很有趣。
----- 编辑 -----
感谢 Shnugo 的想法。在这里使用交叉应用到相关子查询实际上是一个很大的改进。
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from @Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
速度快得多的原因是它不再使用三角连接,三角连接真的很慢。我也确实用索引物理计数 table 切换了视图。改进在较大的数据集上很明显,但不如使用交叉应用那么大。
如果您想阅读有关三角连接的更多信息以及为什么我们应该避免使用它们,Jeff Moden 有一篇关于该主题的精彩文章。 https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins