One particularly tedious task that folks developing ETL operations in T-SQL face at least once in their careers is the task of scrubbing textual data of unwanted – and sometimes non-displayable – characters.
Somehow these special characters that cannot be displayed end up in the data; few know how or why and obviously they cannot be seen so we find out about them when some process chokes in the dead of night on a weekend because the antiquated “BEL” character somehow showed up in the data.
Perhaps your marketing team copy/pastes product descriptions from a manufacturer’s website (special characters and all) into a product’s description field. The bullet character is preserved and other applications are unsure how to display this alien character.
We don’t want to remove the bullet character, rather we’d like to replace it with an asterisk; maybe we are storing text data in a data-type that cannot handle characters with diacritical marks and we want to make ñ characters just be n.
The task of scrubbing the data has fallen to us and we are not quite sure how best to approach the problem. We are sort of interested in a performant solution, but we want accuracy and versatility. Below we’ll walk through a simple solution to remove or replace characters in text data.
Let’s start by outlining the major components:
1. A scalar function that text data runs through and comes out clean
2. A table that stores our character substitution mappings. Characters to be removed map to an empty value.
The below solution uses PATINDEX to identify characters and STUFF to replace them:
DECLARE @strInput NVARCHAR(4000);
DECLARE @intPosition SMALLINT; –Position of bad character
DECLARE @strPattern NVARCHAR(4000); –Bad characters to look for
SET @strInput = ‘José Calderón’;
DECLARE @tblCharacterMappings TABLE
(
tKey NVARCHAR(2) PRIMARY KEY,
tValue NVARCHAR(2)
);
INSERT INTO @tblCharacterMappings (tKey, tValue) VALUES
(N’é’, N’e’),
(N’ó’, N’o’);
SELECT @strPattern = SUBSTRING
(
(
SELECT
” + tKey
FROM
@tblCharacterMappings
FOR XML PATH(”)
), 1, 10000
);
SET @strPattern = ‘%[‘+ @strPattern + ‘]%’;
SET @intPosition = PATINDEX(@strPattern, @strInput);
WHILE @intPosition > 0
BEGIN
SELECT
@strInput = STUFF(@strInput, @intPosition, 1, tValue),
@intPosition = PATINDEX(@strPattern, @strInput)
FROM
@tblCharacterMappings
WHERE
tKey = SUBSTRING(@strInput, @intPosition, 1);
END
SELECT @strInput;
Closing Thoughts
This simple solution could easily be made into a scalar function, if you want; however, be cautious of how you use scalar functions. Try to keep them out of the WHERE clause unless you know what you are doing and certainly keep them out of your JOIN logic!