How to Scrub Characters from Text in SQL Server

One particularly tedious task that folks developing ETL operations in T-SQL face at least once in their careers is the task of scrubbing textual data of unwanted – and sometimes non-displayable – characters.

Somehow these special characters that cannot be displayed end up in the data; few know how or why and obviously they cannot be seen so we find out about them when some process chokes in the dead of night on a weekend because the antiquated “BEL” character somehow showed up in the data.

Perhaps your marketing team copy/pastes product descriptions from a manufacturer’s website (special characters and all) into a product’s description field. The bullet character is preserved and other applications are unsure how to display this alien character.

We don’t want to remove the bullet character, rather we’d like to replace it with an asterisk; maybe we are storing text data in a data-type that cannot handle characters with diacritical marks and we want to make ñ characters just be n.

The task of scrubbing the data has fallen to us and we are not quite sure how best to approach the problem. We are sort of interested in a performant solution, but we want accuracy and versatility. Below we’ll walk through a simple solution to remove or replace characters in text data.

Let’s start by outlining the major components:

1. A scalar function that text data runs through and comes out clean
2. A table that stores our character substitution mappings. Characters to be removed map to an empty value.

The below solution uses PATINDEX to identify characters and STUFF to replace them:

DECLARE @strInput NVARCHAR(4000);

DECLARE @intPosition SMALLINT; –Position of bad character

DECLARE @strPattern NVARCHAR(4000); –Bad characters to look for

SET @strInput = ‘José Calderón’;

DECLARE @tblCharacterMappings TABLE

(

tKey NVARCHAR(2) PRIMARY KEY,

tValue NVARCHAR(2)

);

INSERT INTO @tblCharacterMappings (tKey, tValue) VALUES

(N’é’, N’e’),

(N’ó’, N’o’);

SELECT @strPattern = SUBSTRING

(

SELECT

” + tKey

FROM

@tblCharacterMappings

FOR XML PATH(”)

), 1, 10000

);

SET @strPattern = ‘%[‘+ @strPattern + ‘]%’;

SET @intPosition = PATINDEX(@strPattern, @strInput);

WHILE @intPosition > 0

BEGIN

SELECT

@strInput = STUFF(@strInput, @intPosition, 1, tValue),

@intPosition = PATINDEX(@strPattern, @strInput)

FROM

@tblCharacterMappings

WHERE

tKey = SUBSTRING(@strInput, @intPosition, 1);

END

SELECT @strInput;

Closing Thoughts

This simple solution could easily be made into a scalar function, if you want; however, be cautious of how you use scalar functions. Try to keep them out of the WHERE clause unless you know what you are doing and certainly keep them out of your JOIN logic!

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to Scrub Characters from Text in SQL Server

Ryan Geide

Categories

Archives

411 University Ridge Suite 230
Greenville, SC 29601

(864) 421-9247

info@kopisusa.com

Quick Links

Affiliations

How to Scrub Characters from Text in SQL Server

Ryan Geide

Categories

Archives

411 University Ridge Suite 230 Greenville, SC 29601

(864) 421-9247

info@kopisusa.com

Quick Links

Affiliations

Book A Discovery Call

411 University Ridge Suite 230
Greenville, SC 29601