A Tip to Optimise LIKE Searches by Erland Sommarskog

This article first appeared in the SQLServerGeeks Magazine.
Author: Erland Sommarskog
Subscribe to get your copy.

Consider this query

SELECT * FROM tbl WHERE col LIKE '%alter%'

The table has some size and the text in the column is long and the query is slow. What options are there to speed it up?

The best option in many situations is to use full-text search, which is a feature that ships with SQL Server (although, at Setup time it is optional to install full-text support). With a full-text index in place the query can be written as:

SELECT * FROM tbl WHERE CONTAINS(col, 'alter')

And the response time will be really good.
Unfortunately, full-text is not an option if users want to find alter also if it is in the middle of a word such as inalterable. This is because full-text builds an index on words. Likewise, full-text is not an option, if users want include punctuation characters in their search strings, because full-text strips those out.
There is a second option to build a fast index-based solution, by using something known as n-grams, but there is no built-in support for this in SQL Server and it is very heavy artillery. Thus, in most cases you will have to let it suffice with LIKE and wildcard patterns. You may have understood that one reason LIKE searches with leading wildcards are slow is that no index can be used, so they result in a scan.
However, there is a second reason why these searches are slow, and I like to share a tip on how you easily can factor out that part to speed things up a little bit.

To understand this second reason, we start by creating this somewhat non-sensical table:

CREATE TABLE guids (
ident    bigint       NOT NULL IDENTITY,
Windows_nvarchar nvarchar(200)
COLLATE Latin1_General_100_CI_AS NOT NULL,
SQL_nvarchar     nvarchar(200)
COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
Windows_varchar  varchar(200)
COLLATE Latin1_General__100_CI_AS NOT NULL,
SQL_varchar      varchar(200)
COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)

You may note that there is one pair of nvarchar columns and one pair of varchar columns, likewise there is one pair of columns with a Windows collation and one pair of columns with an SQL collation.

Next, we fill it up with some data:

INSERT guids (Windows_nvarchar, SQL_nvarchar, Windows_varchar, SQL_varchar)
SELECT TOP (1000000) 
concat_ws(' ', newid(), newid(), newid(), newid()),
concat_ws(' ', newid(), newid(), newid(), newid()),
concat_ws(' ', newid(), newid(), newid(), newid()),
concat_ws(' ', newid(), newid(), newid(), newid())
   FROM   sys.columns a
   CROSS  JOIN sys.columns b

This results in a lot of strings that are 147 characters long.

Let’s now run a LIKE search on the Windows_nvarchar column:

DECLARE @d         datetime2(3) = sysdatetime(),
@searchstr varchar(50)  = '%abc%',
@cnt       int
SELECT @cnt = COUNT(*) FROM guids WHERE Windows_nvarchar LIKE @searchstr
PRINT concat('Windows_nvarchar: ', datediff(ms, @d, sysdatetime()), ' ms. ',
'Count: ', @cnt)

The output I got was:

Windows_nvarchar: 4422 ms. Count: 23070

When I run the same search for the columns SQL_nvarchar and Windows_varchar, the result is similar:

SQL_nvarchar: 3752 ms. Count: 23239
Windows_varchar: 4328 ms. Count: 23073

But when I try the last column, the result is strikingly different:

SQL_varchar: 1062 ms. Count: 22921

Compared to the original search, this is a speed-up with a factor of four.
Why does this happen? With a LIKE search with a leading wildcard, SQL Server does not only have to scan the index; it also has to scan the strings. For a column which does not include the search string, SQL Server has to scan the string from start almost to the end to see if there is a character that is equal to a according to the current collation. Depending on the type of collation, that could be an A, á or Å. Then again, if the collation is accent-sensitive, an a followed by a combining acute accent is not a match, because logically that is the same as á. You may not have heard of combining accents before, but they are a feature of Unicode, and applying the full rules of Unicode is quite complex, and when SQL Server has to do this for every character, it starts to add up.

This obviously happens if you have nvarchar, since that is Unicode. But it also happens if you have a Windows collation and varchar, because with a Windows collation all operations are performed in Unicode also for varchar.

With an SQL collation and varchar, this is different. The definition of varchar in SQL collations goes back to SQL 6.5 and earlier when SQL Server did not support Unicode. For this reason, SQL collations have a completely different library with its own set of rules. And particularly, this library only has to bother about the 255 characters in the code page that the SQL collation supports for varchar and therefore character comparison is a much simpler and faster business.

So is the lesson that we should use SQL collations with varchar for LIKE searches? Not really. If you design a new application, there is no reason not to design it for full international support to be future-proof. This means that you should use nvarchar, or, if you are on SQL 2019 or Azure, use varchar with a UTF8 collation. But varchar with an SQL collation is a poor choice.

However, there is a second way to avoid the complex Unicode rules: cast the search column to a binary collation:

DECLARE @d         datetime2(3) = sysdatetime(),
@searchstr varchar(50)  = '%abc%',
@cnt       int
SELECT @cnt = COUNT(*) FROM guids 
WHERE Windows_nvarchar COLLATE Latin1_General_100_BIN2 LIKE @searchstr
PRINT concat('Binary collation: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
'Count: ', @cnt)

This is even faster that the SQL collation:

Binary collation: 859 ms. Count: 0

With a binary collation, all that matters is the code point. But this also has the side effect that the comparison is case-, accent- and everything-else sensitive. You can see that the count is 0, and this is due to that the search string is lowercase but the column data is all uppercase.

In most applications, users want a case-insensitive search, so this does not seem useful. However, this is a problem we can overcome with a small modification: we can apply the upper function to both the column and the search string:

DECLARE @d         datetime2(3) = sysdatetime(),
@searchstr varchar(50)  = '%abc%',
@cnt       int
SELECT @cnt = COUNT(*) FROM guids 
WHERE upper(Windows_nvarchar) COLLATE Latin1_General_100_BIN2 
LIKE upper(@searchstr)
PRINT concat('Binary collation with upper: ', 
datediff(ms, @d, sysdatetime()), ' ms. ', 
'Count: ', @cnt)

With this change, I got this output:

Binary collation with upper: 890 ms. Count: 23070

It is almost five times faster than the original search! And notice that this works no matter the column is varchar or nvarchar. It is still not going be blazingly fast, because there is still an index or table scan. But at least we have reduced the cost of the string scan – and with very simple measures.

I should add the disclaimer that this is not 100 % faithful to the result you will get with a case-insensitive collation, but there are situations where you will get different results. However, these cases are quite much corner cases, and I would suggest that it will be good enough. On the other hand, if the users also require the search to be accent-insensitive so that resume matches résumé, this trick is not an option for you.

Closing note: above I mentioned n-grams as a solution to this problem. This is actually something I have written about, and it is published as a chapter in the book SQL Server MVP Deep Dives, where a number of us MVPs wrote one or two chapters about our favourite topics. We do not make any money from this book, as our royalties go to War Child International, a network of independent organisations, working across the world to help children affected by war. If you want a really fast implementation for this type of searches, you may want to check out that book. But as I said, it is heavy artillery. By the way, the same book also includes an excellent introduction to full-text indexes, written by Robert C. Cain.