找到数据库中的所有 nvarchar 字段并对它们进行替换(<field>,"CHAR(10)",'')
find all nvarchar fields in database and do a replace(<field>,"CHAR(10)",'') on them
我通过 xml 文件获取数据。我为此使用第三方组件。
(Zapsys,我与他们没有任何关系,但也许有人知道他们的产品)
XML 中的数据如下所示:
<customer>
"Johnny"
</customer>
我最终在 table(客户)中得到的是具有以下内容的 nvarchar(姓氏):
CHAR(10)JohnnyCHAR(10)
这是从 XML 读取的每个 nvarchar 字段中的。该组件实际上确实提取了它读取的内容。但是这些字符弄乱了很多语句。
select * from customers where surname = 'Johnny'
没有结果。
select * from customers where surname like '%Johnny%'
或
select * from customers where replace(surname,char(10),'') = 'Johnny
做。
不是很漂亮。
解决这个问题的一种方法是使用带有大量替换语句的视图。
但是,如果我可以 运行 一个从每个 nvarchar 字段中擦除这些 CHAR(10) 的过程,那不是很好吗?
必须可以编写一个更新语句来查找所有 nvarchar 字段并对它们执行 replace(,"CHAR(10)",'') 吗?
更清楚一点:我知道更新语句是如何工作的。我正在寻找一种方法来避免为 (n)varchar
类型的数据库中的每个字段编写更新语句
更新:
根据@matt 的建议想出了这段代码(参见标记为解决方案的答案)
declare @temptable table (id
int identity(1,1), sql nvarchar(4000))
insert into @temptable(sql)
SELECT 'UPDATE '+quotename(i.TABLE_SCHEMA)+'.'+quotename(i.TABLE_NAME) +' SET
'+quotename(i.COLUMN_NAME)+' = REPLACE('+quotename(i.COLUMN_NAME)+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS i
inner join sys.tables t on i.TABLE_NAME = t.name
WHERE DATA_TYPE = 'NVARCHAR'
and t.type = 'U'
and TABLE_SCHEMA = 'myschema'
declare @i as int = 1
declare @sql as nvarchar(max)
declare @max as int = (select max(id) from @temptable)
while @i <= @max
BEGIN
set @sql = (select [sql] from @temptable where id = @i)
exec sp_executesql @sql
--print cast(@i as varchar(5)) + '/'+cast(@max as varchar(5)) + ' done, ' +cast(@max-@i as varchar(5)) + ' to go...'
set @sql = ''
set @i = @i+1
END
当然,您可以 运行 在导入过程中对该姓氏字段进行更新。这样的东西对你有用:
UPDATE customers
SET surname = replace(surname,char(10),'')
或者您可以像这样使用一些动态 SQL 来生成更新语句,您可以快速更改它以便它执行:
SELECT 'UPDATE '+TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+COLUMN_NAME+' SET
'+COLUMN_NAME+' = REPLACE('+COLUMN_NAME+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE = 'NVARCHAR'
这应该会为您提供一个列列表,以围绕这些列构建游标:
select COLUMN_NAME
from INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in ('varchar','nvarchar')
and TABLE_NAME = [your table name]
这个工作起来更流畅。
首先你需要一个好的 N-Grams function such as the one covered here. The version I am including below is the NVARCHAR(4000) version (Kudos to Larnu for his contribution.) I used NGramsN4K to build a NVARCHAR(4000) PatReplace 函数。我为我的函数使用不同的模式,但 dbo 可以正常工作。
请注意:
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
Returns: ABC123XYZ
不匹配此模式的所有字符:[^0-9a-zA-Z]
已被排除。现在让我们针对包含不良字符的记录使用该函数,删除它们,然后将它们连接到具有良好值的 table。注意我的评论。
-- Sample data
DECLARE @Customers TABLE (CustomerId INT IDENTITY, Surname NVARCHAR(100));
DECLARE @GoodValues TABLE (Surname NVARCHAR(100));
INSERT @Customers (Surname) VALUES (CHAR(10)+'Johnny'+CHAR(10)),('Smith'),('Jones'+CHAR(160));
INSERT @goodvalues (Surname) VALUES('Johnny'),('Smith'),('Jones'),('James');
-- Fail:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
JOIN @GoodValues AS g
ON c.Surname = g.Surname;
-- Success:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
CROSS APPLY samd.patreplaceN4K(c.Surname,'[^0-9a-zA-Z ]','') AS pr
JOIN @GoodValues AS g
ON pr.newString = g.Surname;
samd.NGramsN4K
CREATE FUNCTION samd.NGramsN4K
(
@string NVARCHAR(4000), -- Input string
@N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 NVARCHAR characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM samd.NGramsN4K(@string,@N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY samd.NGramsN4K(s.SomeValue,@N) AS ng;
[Parameters]:
@string = The input string to split into tokens.
@N = The size of each token returned.
[Returns]:
Position = bigint; the position of the token in the input string
token = NVARCHAR(4000); a @N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Turn the string, 'ɰɰXɰɰ' into unigrams, bigrams and trigrams
DECLARE @string NVARCHAR(4000) = N'ɰɰXɰɰ';
BEGIN
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,1) AS ng; -- unigrams (@N=1)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,2) AS ng; -- bigrams (@N=2)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,3) AS ng; -- trigrams (@N=3)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,4) AS ng; -- 4-grams (@N=4)
END
--===== 2. Scenarios where the function would not return rows
SELECT ng.Position, ng.Token FROM samd.NGramsN4K('abcd',5) AS ng; -- 5-grams (@N=5)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', 0) AS ng;
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', NULL) AS ng;
This will fail:
--SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x',-1) AS ng;
--===== 3. How many times the substring "ƒƓ" appears in each record
BEGIN
DECLARE @table TABLE(stringID int identity primary key, string NVARCHAR(100));
INSERT @table(string)
VALUES (N'ƒƓ123ƒƓ'),(N'123ƒƓƒƓƒƓ'),(N'!ƒƓ!ƒƓ!'),(N'ƒƓ-ƒƓ-ƒƓ-ƒƓ-ƒƓ');
SELECT t.String, Occurances = COUNT(*)
FROM @table AS t
CROSS APPLY samd.NGramsN4K(t.string,2) AS ng
WHERE ng.token = N'ƒƓ'
GROUP BY t.string;
END;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20180829 - Changed TOP logic and startup-predicate logic in the WHERE clause
- Alan Burstein
Rev 02 - 20191129 - Redesigned to leverage rangeAB - Alan Burstein
Rev 03 - 20200416 - changed the cast from NCHAR(4000) to NVARCHAR(4000)
- Removed: WHERE @N BETWEEN 1 AND s.Ln; this must now be handled
manually moving forward. - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN, -- Token Position
Token = CAST(SUBSTRING(@string,r.RN,@N) AS NVARCHAR(4000)) -- @N-Sized Token
FROM (VALUES(DATALENGTH(ISNULL(NULLIF(@string,N''),N'X'))/2)) AS s(Ln)
CROSS APPLY core.rangeAB(1,s.Ln-(ISNULL(@N,1)-1),1,1) AS r
GO
samd.patReplaceN4K
CREATE FUNCTION samd.patReplaceN4K
(
@string NVARCHAR(4000), -- Input String
@pattern NVARCHAR(50), -- Pattern to match/replace
@replace NVARCHAR(20) -- What to replace the matched pattern with
)
/*****************************************************************************************
[Purpose]:
Given a string (@string), a pattern (@pattern), and a replacement character (@replace)
patReplaceN4K will replace any character in @string that matches the @Pattern parameter
with the character, @replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplaceN4K(@String,@Pattern,@Replace) AS pr;
[Parameters]:
@string = NVARCHAR(4000); The input string to manipulate
@pattern = NVARCHAR(50); The pattern to match/replace
@replace = NVARCHAR(20); What to replace the matched pattern with
[Returns]:
Inline Table Valued Function returns:
NewString = NVARCHAR(4000); The new string with all instances of @Pattern replaced with
The value of @Replace.
[Dependencies]:
core.ngramsN4k (ITVF)
[Developer Notes]:
1. @Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. As is the case with functions which leverage samd.ngrams or samd.ngramsN4k,
samd.patReplaceN4K is almost always dramatically faster with a parallel execution
plan. One way to get a parallel query plan (if the optimizer does not choose one) is
to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplaceN4K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplaceN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Remove non alphanumeric characters
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
--===== 2. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplaceN4K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 3. Using againsts a table
DECLARE @table TABLE(OldString varchar(60));
INSERT @table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM @table AS t
CROSS APPLY samd.patReplaceN4K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 01 - 20200422 - Created - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = a.Blank THEN a.Blank ELSE
CASE WHEN PATINDEX(@pattern,a.Token)&0x01=0 THEN ng.token ELSE @replace END END
FROM samd.NGramsN4K(@string,1) AS ng
CROSS APPLY (VALUES(CAST('' AS NVARCHAR(4000)),
ng.token COLLATE Latin1_General_BIN)) AS a(Blank,Token)
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'NVARCHAR(4000)');
GO
我通过 xml 文件获取数据。我为此使用第三方组件。 (Zapsys,我与他们没有任何关系,但也许有人知道他们的产品) XML 中的数据如下所示:
<customer>
"Johnny"
</customer>
我最终在 table(客户)中得到的是具有以下内容的 nvarchar(姓氏):
CHAR(10)JohnnyCHAR(10)
这是从 XML 读取的每个 nvarchar 字段中的。该组件实际上确实提取了它读取的内容。但是这些字符弄乱了很多语句。
select * from customers where surname = 'Johnny'
没有结果。
select * from customers where surname like '%Johnny%'
或
select * from customers where replace(surname,char(10),'') = 'Johnny
做。
不是很漂亮。
解决这个问题的一种方法是使用带有大量替换语句的视图。 但是,如果我可以 运行 一个从每个 nvarchar 字段中擦除这些 CHAR(10) 的过程,那不是很好吗?
必须可以编写一个更新语句来查找所有 nvarchar 字段并对它们执行 replace(,"CHAR(10)",'') 吗?
更清楚一点:我知道更新语句是如何工作的。我正在寻找一种方法来避免为 (n)varchar
类型的数据库中的每个字段编写更新语句更新:
根据@matt 的建议想出了这段代码(参见标记为解决方案的答案)
declare @temptable table (id
int identity(1,1), sql nvarchar(4000))
insert into @temptable(sql)
SELECT 'UPDATE '+quotename(i.TABLE_SCHEMA)+'.'+quotename(i.TABLE_NAME) +' SET
'+quotename(i.COLUMN_NAME)+' = REPLACE('+quotename(i.COLUMN_NAME)+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS i
inner join sys.tables t on i.TABLE_NAME = t.name
WHERE DATA_TYPE = 'NVARCHAR'
and t.type = 'U'
and TABLE_SCHEMA = 'myschema'
declare @i as int = 1
declare @sql as nvarchar(max)
declare @max as int = (select max(id) from @temptable)
while @i <= @max
BEGIN
set @sql = (select [sql] from @temptable where id = @i)
exec sp_executesql @sql
--print cast(@i as varchar(5)) + '/'+cast(@max as varchar(5)) + ' done, ' +cast(@max-@i as varchar(5)) + ' to go...'
set @sql = ''
set @i = @i+1
END
当然,您可以 运行 在导入过程中对该姓氏字段进行更新。这样的东西对你有用:
UPDATE customers
SET surname = replace(surname,char(10),'')
或者您可以像这样使用一些动态 SQL 来生成更新语句,您可以快速更改它以便它执行:
SELECT 'UPDATE '+TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+COLUMN_NAME+' SET
'+COLUMN_NAME+' = REPLACE('+COLUMN_NAME+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE = 'NVARCHAR'
这应该会为您提供一个列列表,以围绕这些列构建游标:
select COLUMN_NAME
from INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in ('varchar','nvarchar')
and TABLE_NAME = [your table name]
这个工作起来更流畅。
首先你需要一个好的 N-Grams function such as the one covered here. The version I am including below is the NVARCHAR(4000) version (Kudos to Larnu for his contribution.) I used NGramsN4K to build a NVARCHAR(4000) PatReplace 函数。我为我的函数使用不同的模式,但 dbo 可以正常工作。
请注意:
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
Returns: ABC123XYZ
不匹配此模式的所有字符:[^0-9a-zA-Z]
已被排除。现在让我们针对包含不良字符的记录使用该函数,删除它们,然后将它们连接到具有良好值的 table。注意我的评论。
-- Sample data
DECLARE @Customers TABLE (CustomerId INT IDENTITY, Surname NVARCHAR(100));
DECLARE @GoodValues TABLE (Surname NVARCHAR(100));
INSERT @Customers (Surname) VALUES (CHAR(10)+'Johnny'+CHAR(10)),('Smith'),('Jones'+CHAR(160));
INSERT @goodvalues (Surname) VALUES('Johnny'),('Smith'),('Jones'),('James');
-- Fail:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
JOIN @GoodValues AS g
ON c.Surname = g.Surname;
-- Success:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
CROSS APPLY samd.patreplaceN4K(c.Surname,'[^0-9a-zA-Z ]','') AS pr
JOIN @GoodValues AS g
ON pr.newString = g.Surname;
samd.NGramsN4K
CREATE FUNCTION samd.NGramsN4K
(
@string NVARCHAR(4000), -- Input string
@N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 NVARCHAR characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM samd.NGramsN4K(@string,@N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY samd.NGramsN4K(s.SomeValue,@N) AS ng;
[Parameters]:
@string = The input string to split into tokens.
@N = The size of each token returned.
[Returns]:
Position = bigint; the position of the token in the input string
token = NVARCHAR(4000); a @N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Turn the string, 'ɰɰXɰɰ' into unigrams, bigrams and trigrams
DECLARE @string NVARCHAR(4000) = N'ɰɰXɰɰ';
BEGIN
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,1) AS ng; -- unigrams (@N=1)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,2) AS ng; -- bigrams (@N=2)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,3) AS ng; -- trigrams (@N=3)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,4) AS ng; -- 4-grams (@N=4)
END
--===== 2. Scenarios where the function would not return rows
SELECT ng.Position, ng.Token FROM samd.NGramsN4K('abcd',5) AS ng; -- 5-grams (@N=5)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', 0) AS ng;
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', NULL) AS ng;
This will fail:
--SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x',-1) AS ng;
--===== 3. How many times the substring "ƒƓ" appears in each record
BEGIN
DECLARE @table TABLE(stringID int identity primary key, string NVARCHAR(100));
INSERT @table(string)
VALUES (N'ƒƓ123ƒƓ'),(N'123ƒƓƒƓƒƓ'),(N'!ƒƓ!ƒƓ!'),(N'ƒƓ-ƒƓ-ƒƓ-ƒƓ-ƒƓ');
SELECT t.String, Occurances = COUNT(*)
FROM @table AS t
CROSS APPLY samd.NGramsN4K(t.string,2) AS ng
WHERE ng.token = N'ƒƓ'
GROUP BY t.string;
END;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20180829 - Changed TOP logic and startup-predicate logic in the WHERE clause
- Alan Burstein
Rev 02 - 20191129 - Redesigned to leverage rangeAB - Alan Burstein
Rev 03 - 20200416 - changed the cast from NCHAR(4000) to NVARCHAR(4000)
- Removed: WHERE @N BETWEEN 1 AND s.Ln; this must now be handled
manually moving forward. - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN, -- Token Position
Token = CAST(SUBSTRING(@string,r.RN,@N) AS NVARCHAR(4000)) -- @N-Sized Token
FROM (VALUES(DATALENGTH(ISNULL(NULLIF(@string,N''),N'X'))/2)) AS s(Ln)
CROSS APPLY core.rangeAB(1,s.Ln-(ISNULL(@N,1)-1),1,1) AS r
GO
samd.patReplaceN4K
CREATE FUNCTION samd.patReplaceN4K
(
@string NVARCHAR(4000), -- Input String
@pattern NVARCHAR(50), -- Pattern to match/replace
@replace NVARCHAR(20) -- What to replace the matched pattern with
)
/*****************************************************************************************
[Purpose]:
Given a string (@string), a pattern (@pattern), and a replacement character (@replace)
patReplaceN4K will replace any character in @string that matches the @Pattern parameter
with the character, @replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplaceN4K(@String,@Pattern,@Replace) AS pr;
[Parameters]:
@string = NVARCHAR(4000); The input string to manipulate
@pattern = NVARCHAR(50); The pattern to match/replace
@replace = NVARCHAR(20); What to replace the matched pattern with
[Returns]:
Inline Table Valued Function returns:
NewString = NVARCHAR(4000); The new string with all instances of @Pattern replaced with
The value of @Replace.
[Dependencies]:
core.ngramsN4k (ITVF)
[Developer Notes]:
1. @Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. As is the case with functions which leverage samd.ngrams or samd.ngramsN4k,
samd.patReplaceN4K is almost always dramatically faster with a parallel execution
plan. One way to get a parallel query plan (if the optimizer does not choose one) is
to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplaceN4K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplaceN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Remove non alphanumeric characters
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
--===== 2. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplaceN4K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 3. Using againsts a table
DECLARE @table TABLE(OldString varchar(60));
INSERT @table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM @table AS t
CROSS APPLY samd.patReplaceN4K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 01 - 20200422 - Created - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = a.Blank THEN a.Blank ELSE
CASE WHEN PATINDEX(@pattern,a.Token)&0x01=0 THEN ng.token ELSE @replace END END
FROM samd.NGramsN4K(@string,1) AS ng
CROSS APPLY (VALUES(CAST('' AS NVARCHAR(4000)),
ng.token COLLATE Latin1_General_BIN)) AS a(Blank,Token)
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'NVARCHAR(4000)');
GO