如何在 SQL table 中查找无效字符
How to find invalid Char in a SQL table
过去几周,这一直是一个令人头疼的问题。我有一个较大的 table(165 列 x 11000+ 行)。在这个table中,有几个评论栏被设置为varchar(max)
。特别是有一个不断收到各种用户粘贴到其中的无效字符。这会导致 SSRS 中的报告失败。然后我必须去找到这些无效字符并将它们删除。这是一项非常费时费力的工作。
我想做的是找到一种方法来自动搜索这些无效字符并将它们替换为空字符。问题是我不知道如何直接搜索这些字符。这是它们的样子:
这是另一张相同的图片:
下面是我将它粘贴到 Notepad++ 时的样子:
我不确定它是否会像我看到的那样工作和显示,但字符如下:
㹊潮Ņࢹᖈư㹨ƶ槹鎤⻄ƺ綐ڌ⸀ƺ삸)䀤ƍ샄)Ņᛡ鎤ꗘᖃᒨ쬵Ğᘍ鎤ᐜᏰ>֔υ赸Ƹ쳰డ촜)鉀촜)쮜)Ἡ屰山舰霡ࣆ 耏Аం畠Ư놐ᓜતᏛ֔Ꮫ֨Ꮫᓜƒ
邰厰ఆ邰드)抉鎤듄)繟Ĺ띨)ࢹ䮸ࣉࢹ䮸ࣉ샰)ԌƏŅᕄ홑Ņᛙ鎤ꗘᖃᒨࢹ
它们看起来像是中文或类似的东西,但我尝试使用 Google 翻译,它检测到它们是英文。
对于找出搜索这些内容的方法有什么帮助吗?做一个Function或者SP只要能用就好了!
更新
我已经尝试了我在此处找到的部分解决方案:How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?
并使用了这个:
-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)
-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
begin
-- Uses | to escape, could be any character
set @str = @str + '|' + char(@i)
set @i = @i + 1
end
select MEETING_NOTES from pmdb.TrackerData
where MEETING_NOTES like '%[^' + @str + ']%' escape '|'
但它返回的行数比应有的多很多。我目前只有 1 行包含这些无效字符,它返回 1708。
更新 2
我创建了一个 Function
来尝试删除所有无效字符,如下所示:
ALTER FUNCTION [dbo].[RemoveNonPrintable]
(
@inputtext nvarchar(max)
)
RETURNS nvarchar(max)
AS
BEGIN
DECLARE @counter int = 1;
DECLARE @colString nvarchar(1000)
set @inputtext = REPLACE(@inputtext, char(0), '') -- 'NULL'
set @inputtext = REPLACE(@inputtext, char(1), '') -- 'Start of Heading'
set @inputtext = REPLACE(@inputtext, char(2), '') -- 'Start of Text'
set @inputtext = REPLACE(@inputtext, char(3), '') -- 'End of Text'
set @inputtext = REPLACE(@inputtext, char(4), '') -- 'End of Transmission'
set @inputtext = REPLACE(@inputtext, char(5), '') -- 'Enquiry'
set @inputtext = REPLACE(@inputtext, char(6), '') -- 'Acknowledgement'
set @inputtext = REPLACE(@inputtext, char(7), '') -- 'Bell'
set @inputtext = REPLACE(@inputtext, char(8), '') -- 'Backspace'
set @inputtext = REPLACE(@inputtext, char(9), '') -- 'Horizontal Tab'
-- replace line feed with blank, so words that were in different lines before are still separated
set @inputtext = REPLACE(@inputtext, char(10), ' ') -- 'Line Feed'
set @inputtext = REPLACE(@inputtext, char(11), '') -- 'Vertical Tab'
set @inputtext = REPLACE(@inputtext, char(12), '') -- 'Form Feed'
-- replace carriage return with blank, so words that were in different lines before are still separated
set @inputtext = REPLACE(@inputtext, char(13), ' ') -- 'Carriage Return'
set @inputtext = REPLACE(@inputtext, char(14), '') -- 'Shift Out'
set @inputtext = REPLACE(@inputtext, char(15), '') -- 'Shift In'
set @inputtext = REPLACE(@inputtext, char(16), '') -- 'Data Link Escape'
set @inputtext = REPLACE(@inputtext, char(17), '') -- 'Device Control 1'
set @inputtext = REPLACE(@inputtext, char(18), '') -- 'Device Control 2'
set @inputtext = REPLACE(@inputtext, char(19), '') -- 'Device Control 3'
set @inputtext = REPLACE(@inputtext, char(20), '') -- 'Device Control 4'
set @inputtext = REPLACE(@inputtext, char(21), '') -- 'Negative Acknowledgment'
set @inputtext = REPLACE(@inputtext, char(22), '') -- 'Synchronous Idle'
set @inputtext = REPLACE(@inputtext, char(23), '') -- 'End of Transmission Block'
set @inputtext = REPLACE(@inputtext, char(24), '') -- 'Cancel'
set @inputtext = REPLACE(@inputtext, char(25), '') -- 'End of Medium'
set @inputtext = REPLACE(@inputtext, char(26), '') -- 'Substitute'
set @inputtext = REPLACE(@inputtext, char(27), '') -- 'Escape'
set @inputtext = REPLACE(@inputtext, char(28), '') -- 'File Separator'
set @inputtext = REPLACE(@inputtext, char(29), '') -- 'Group Separator'
set @inputtext = REPLACE(@inputtext, char(30), '') -- 'Record Separator'
set @inputtext = REPLACE(@inputtext, char(31), '') -- 'Unit Separator'
set @inputtext = REPLACE(@inputtext, char(127), '') -- 'Delete'
set @colString = @inputtext
WHILE @counter <= DATALENGTH(@colString)
BEGIN
set @colString = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
set @colString = REPLACE(@colString,'|','')
SET @counter = @counter + 1
END
return @inputtext
END
我这样称呼它:
BEGIN TRAN --COMMIT ROLLBACK
update pmdb.TrackerData
set CIRCUIT_COMMENTS = [dbo].[RemoveNonPrintable](CIRCUIT_COMMENTS),
COE_COMMENTS = [dbo].[RemoveNonPrintable](COE_COMMENTS),
MEETING_NOTES = [dbo].[RemoveNonPrintable](MEETING_NOTES),
OSP_COMMENTS = [dbo].[RemoveNonPrintable](OSP_COMMENTS),
COE_COMMENTS2 = [dbo].[RemoveNonPrintable](COE_COMMENTS2)
然后我运行上次更新的代码看看有没有什么不同。没有区别。是什么赋予了?我做错了吗?
编辑 3
我已经更新了我的函数以拥有这个:
set @colString = @inputtext
WHILE @counter <= DATALENGTH(@colString)
BEGIN
--set @colString = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
--set @colString = REPLACE(@colString,'|','')
if (UNICODE(SUBSTRING(@colString, @counter,1)) > 126)
BEGIN
SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
END
ELSE IF(UNICODE(SUBSTRING(@colString, @counter, 1)) < 32)
BEGIN
SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
END
set @inputtext = @colString
SET @counter = @counter + 1
END
它删除了大部分无效字符,但随后留下了其他字符。我在我创建的临时 table 上调用它,它包含上面显示的无效字符样本,如下所示:
update #Temp
set Notes = [dbo].[RemoveNonPrintable](Notes),
Notes2 = [dbo].[RemoveNonPrintable](Notes2)
然后我在两个注释中留下以下内容:
Notes: ????N???u?z?????????)???)?N??????G????>???????)???)?)???????? ????U?????????? ???????)???)?L?)?????????)?????N???N???????
Notes2: ࢹᖈ 㹨 ⻄ ⸀ )䀤 ) ᛡ ꗘᖃᒨ ᘍ ᐜᏰ>֔ ) ) )Ἡ ࣆ ᓜ Ꮫ֔Ꮫ֨Ꮫᓜ ) ) )ࢹ䮸ࣉࢹ䮸ࣉ )Ԍ ᕄ ᛙ ꗘᖃᒨࢹ
这比我开始的要好,但还不够好。
我在另一个用户问题中找到了解决方案 here
虽然我稍微修改了一下。最终对我有用的是:
ALTER FUNCTION [dbo].[RemoveNonASCII]
(
-- Parameters
@nstring nvarchar(max)
)
RETURNS varchar(max)
AS
BEGIN
-- Variables
DECLARE @Result varchar(max) = '',@nchar nvarchar(1), @position int
-- T-SQL statements to compute the return value
set @position = 1
while @position <= LEN(@nstring)
BEGIN
set @nchar = SUBSTRING(@nstring, @position, 1)
if UNICODE(@nchar) between 32 and 127
set @Result = @Result + @nchar
set @position = @position + 1
set @Result = REPLACE(@Result,'))','')
set @Result = REPLACE(@Result,'?','')
END
-- Return the result
RETURN @Result
END
过去几周,这一直是一个令人头疼的问题。我有一个较大的 table(165 列 x 11000+ 行)。在这个table中,有几个评论栏被设置为varchar(max)
。特别是有一个不断收到各种用户粘贴到其中的无效字符。这会导致 SSRS 中的报告失败。然后我必须去找到这些无效字符并将它们删除。这是一项非常费时费力的工作。
我想做的是找到一种方法来自动搜索这些无效字符并将它们替换为空字符。问题是我不知道如何直接搜索这些字符。这是它们的样子:
这是另一张相同的图片:
下面是我将它粘贴到 Notepad++ 时的样子:
我不确定它是否会像我看到的那样工作和显示,但字符如下:
㹊潮Ņࢹᖈư㹨ƶ槹鎤⻄ƺ綐ڌ⸀ƺ삸)䀤ƍ샄)Ņᛡ鎤ꗘᖃᒨ쬵Ğᘍ鎤ᐜᏰ>֔υ赸Ƹ쳰డ촜)鉀촜)쮜)Ἡ屰山舰霡ࣆ 耏Аం畠Ư놐ᓜતᏛ֔Ꮫ֨Ꮫᓜƒ 邰厰ఆ邰드)抉鎤듄)繟Ĺ띨)ࢹ䮸ࣉࢹ䮸ࣉ샰)ԌƏŅᕄ홑Ņᛙ鎤ꗘᖃᒨࢹ
它们看起来像是中文或类似的东西,但我尝试使用 Google 翻译,它检测到它们是英文。
对于找出搜索这些内容的方法有什么帮助吗?做一个Function或者SP只要能用就好了!
更新
我已经尝试了我在此处找到的部分解决方案:How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table? 并使用了这个:
-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)
-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
begin
-- Uses | to escape, could be any character
set @str = @str + '|' + char(@i)
set @i = @i + 1
end
select MEETING_NOTES from pmdb.TrackerData
where MEETING_NOTES like '%[^' + @str + ']%' escape '|'
但它返回的行数比应有的多很多。我目前只有 1 行包含这些无效字符,它返回 1708。
更新 2
我创建了一个 Function
来尝试删除所有无效字符,如下所示:
ALTER FUNCTION [dbo].[RemoveNonPrintable]
(
@inputtext nvarchar(max)
)
RETURNS nvarchar(max)
AS
BEGIN
DECLARE @counter int = 1;
DECLARE @colString nvarchar(1000)
set @inputtext = REPLACE(@inputtext, char(0), '') -- 'NULL'
set @inputtext = REPLACE(@inputtext, char(1), '') -- 'Start of Heading'
set @inputtext = REPLACE(@inputtext, char(2), '') -- 'Start of Text'
set @inputtext = REPLACE(@inputtext, char(3), '') -- 'End of Text'
set @inputtext = REPLACE(@inputtext, char(4), '') -- 'End of Transmission'
set @inputtext = REPLACE(@inputtext, char(5), '') -- 'Enquiry'
set @inputtext = REPLACE(@inputtext, char(6), '') -- 'Acknowledgement'
set @inputtext = REPLACE(@inputtext, char(7), '') -- 'Bell'
set @inputtext = REPLACE(@inputtext, char(8), '') -- 'Backspace'
set @inputtext = REPLACE(@inputtext, char(9), '') -- 'Horizontal Tab'
-- replace line feed with blank, so words that were in different lines before are still separated
set @inputtext = REPLACE(@inputtext, char(10), ' ') -- 'Line Feed'
set @inputtext = REPLACE(@inputtext, char(11), '') -- 'Vertical Tab'
set @inputtext = REPLACE(@inputtext, char(12), '') -- 'Form Feed'
-- replace carriage return with blank, so words that were in different lines before are still separated
set @inputtext = REPLACE(@inputtext, char(13), ' ') -- 'Carriage Return'
set @inputtext = REPLACE(@inputtext, char(14), '') -- 'Shift Out'
set @inputtext = REPLACE(@inputtext, char(15), '') -- 'Shift In'
set @inputtext = REPLACE(@inputtext, char(16), '') -- 'Data Link Escape'
set @inputtext = REPLACE(@inputtext, char(17), '') -- 'Device Control 1'
set @inputtext = REPLACE(@inputtext, char(18), '') -- 'Device Control 2'
set @inputtext = REPLACE(@inputtext, char(19), '') -- 'Device Control 3'
set @inputtext = REPLACE(@inputtext, char(20), '') -- 'Device Control 4'
set @inputtext = REPLACE(@inputtext, char(21), '') -- 'Negative Acknowledgment'
set @inputtext = REPLACE(@inputtext, char(22), '') -- 'Synchronous Idle'
set @inputtext = REPLACE(@inputtext, char(23), '') -- 'End of Transmission Block'
set @inputtext = REPLACE(@inputtext, char(24), '') -- 'Cancel'
set @inputtext = REPLACE(@inputtext, char(25), '') -- 'End of Medium'
set @inputtext = REPLACE(@inputtext, char(26), '') -- 'Substitute'
set @inputtext = REPLACE(@inputtext, char(27), '') -- 'Escape'
set @inputtext = REPLACE(@inputtext, char(28), '') -- 'File Separator'
set @inputtext = REPLACE(@inputtext, char(29), '') -- 'Group Separator'
set @inputtext = REPLACE(@inputtext, char(30), '') -- 'Record Separator'
set @inputtext = REPLACE(@inputtext, char(31), '') -- 'Unit Separator'
set @inputtext = REPLACE(@inputtext, char(127), '') -- 'Delete'
set @colString = @inputtext
WHILE @counter <= DATALENGTH(@colString)
BEGIN
set @colString = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
set @colString = REPLACE(@colString,'|','')
SET @counter = @counter + 1
END
return @inputtext
END
我这样称呼它:
BEGIN TRAN --COMMIT ROLLBACK
update pmdb.TrackerData
set CIRCUIT_COMMENTS = [dbo].[RemoveNonPrintable](CIRCUIT_COMMENTS),
COE_COMMENTS = [dbo].[RemoveNonPrintable](COE_COMMENTS),
MEETING_NOTES = [dbo].[RemoveNonPrintable](MEETING_NOTES),
OSP_COMMENTS = [dbo].[RemoveNonPrintable](OSP_COMMENTS),
COE_COMMENTS2 = [dbo].[RemoveNonPrintable](COE_COMMENTS2)
然后我运行上次更新的代码看看有没有什么不同。没有区别。是什么赋予了?我做错了吗?
编辑 3
我已经更新了我的函数以拥有这个:
set @colString = @inputtext
WHILE @counter <= DATALENGTH(@colString)
BEGIN
--set @colString = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
--set @colString = REPLACE(@colString,'|','')
if (UNICODE(SUBSTRING(@colString, @counter,1)) > 126)
BEGIN
SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
END
ELSE IF(UNICODE(SUBSTRING(@colString, @counter, 1)) < 32)
BEGIN
SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
END
set @inputtext = @colString
SET @counter = @counter + 1
END
它删除了大部分无效字符,但随后留下了其他字符。我在我创建的临时 table 上调用它,它包含上面显示的无效字符样本,如下所示:
update #Temp
set Notes = [dbo].[RemoveNonPrintable](Notes),
Notes2 = [dbo].[RemoveNonPrintable](Notes2)
然后我在两个注释中留下以下内容:
Notes: ????N???u?z?????????)???)?N??????G????>???????)???)?)???????? ????U?????????? ???????)???)?L?)?????????)?????N???N???????
Notes2: ࢹᖈ 㹨 ⻄ ⸀ )䀤 ) ᛡ ꗘᖃᒨ ᘍ ᐜᏰ>֔ ) ) )Ἡ ࣆ ᓜ Ꮫ֔Ꮫ֨Ꮫᓜ ) ) )ࢹ䮸ࣉࢹ䮸ࣉ )Ԍ ᕄ ᛙ ꗘᖃᒨࢹ
这比我开始的要好,但还不够好。
我在另一个用户问题中找到了解决方案 here
虽然我稍微修改了一下。最终对我有用的是:
ALTER FUNCTION [dbo].[RemoveNonASCII]
(
-- Parameters
@nstring nvarchar(max)
)
RETURNS varchar(max)
AS
BEGIN
-- Variables
DECLARE @Result varchar(max) = '',@nchar nvarchar(1), @position int
-- T-SQL statements to compute the return value
set @position = 1
while @position <= LEN(@nstring)
BEGIN
set @nchar = SUBSTRING(@nstring, @position, 1)
if UNICODE(@nchar) between 32 and 127
set @Result = @Result + @nchar
set @position = @position + 1
set @Result = REPLACE(@Result,'))','')
set @Result = REPLACE(@Result,'?','')
END
-- Return the result
RETURN @Result
END