This post isn’t related to the series I’m currently writing on Full-Text search operators , but it is in the same ball-park so I thought I’d throw it up here.

Today I was working on our search term auto-suggest system (i.e. the system that shows a list of suggestions in a drop-down list below the search box as you type your search term in), and needed to improve the way we strip out “illegal” characters from our dictionary terms.

Illegal characters in our case refers to pretty much anything that isn’t alphanumeric, or a space, dollar sign, ampersand, hyphen, or period.  We don’t even want any accented characters (i.e. letters with diacritics ) – although in this case we want to replace accented characters with their un-accented counterparts (e.g. Å should be replaced by A).

The way I did this (and I’m sure there are many, many better ways) is to insert illegal characters into a table variable, along with a replacement character, and then update the input variable for each row in this table variable – replacing any matching illegal characters each time with it’s replacement value.

IF OBJECT_ID('[dbo].[fn_clean_search_term]') IS NULL
 EXEC('CREATE FUNCTION [dbo].[fn_clean_search_term] () RETURNS INT AS BEGIN RETURN(0); END ');
GO
/***************************************************************************************************************************
This accepts a string input, and outputs the same string, but with any invalid characters removed, or replaced if they
are accented characters. Note: not all accented characters are accounted for, only the most common ones in our data.
Also, this does not handle Unicode characters.

Author : David Curlewis
Date : 07/2009

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
http://creativecommons.org/licenses/by-nc/3.0/
****************************************************************************************************************************/
ALTER FUNCTION [dbo].[fn_clean_search_term] (
 @keyword VARCHAR(2048)
) RETURNS VARCHAR(2048) WITH SCHEMABINDING
AS
BEGIN

 -- Only do the heavy lifting if there are non-alphanumeric characters in the string
 IF PATINDEX('%[^A-Za-z0-9 ]%', @keyword) > 0
 BEGIN

 -- if the string contains anything other than "valid" characters, strip them out
 DECLARE @chars TABLE ([ascii] TINYINT NOT NULL, [new_char] VARCHAR(1));

 ;WITH illegal_chars AS
 (
 SELECT 0 AS [ascii]
 UNION ALL
 SELECT [ascii] + 1
 FROM illegal_chars
 WHERE [ascii] 9
 AND [ascii] NOT BETWEEN 65 AND 90 -- A -> Z
 AND [ascii] NOT BETWEEN 97 AND 122 -- a -> z
 AND [ascii] NOT IN ( -- explicitly allow the following characters
 32, -- 
 36, -- $
 38, -- &
 45, -- -
 46 -- .
 )
 OPTION (MAXRECURSION 255);

 UPDATE @chars
 SET @keyword = REPLACE(@keyword, CHAR([ascii]), [new_char])

 END

 -- remove multiple spaces
 WHILE CHARINDEX(' ', @keyword) > 0
 SELECT @keyword = REPLACE(@keyword, ' ', ' ');

 SELECT @keyword = LTRIM(RTRIM(@keyword)); -- trim leading & trailing whitespace
 SELECT @keyword = LOWER(@keyword); -- return lower-case only

 RETURN @keyword;
END
GO

Let me know if you have any questions, or different/better ways of doing this (yes, I know I could do it using a couple of lines of .Net in a CLR function, but I don’t have CLR enabled on this server). 😊

Cheers

DB Dave