Join syntax when specifying join types.
Hi everybody. This is my first blog, so be indulgent about me as a first time author.
The idea of this article spurred to me about a question i saw in a forum asking to explained the difference between an Inner loop join and a loop join. I also want to point out when it is valuable to use the loop join albeit I recommand to look for some alternatives before using this optimization method.
First some basis about join syntax is necessary. «Inner Join» or «Join» clause specified alone, are synonyms. It is mandatory to specify the type of join for other type of join like «left», «right» or «full» joins which all have in common a specific way to deal with missing data on one side of the join. Cross Join is a way to express a Cartesian join.
I think SQL Server granted to developers the shortcut to omit the type of join for Inner join because this is the most common type of join and because for all other types of join the type of join is required.
Forcing the join algorithm in the syntax.
Aside the fact that two different type of join imply some differences in results expected, there are three methods frenquently used by SQL Server for performing all the join types.
It can be loop join, hash join, or merge join. The hint clauses «loop», «hash» and «merge» allows you to force SQL Server to choose the join algorithm over Inner, Left, Right and Full joins.
The hint clause goes between the type of join and before the «join» keword. I guess that for ease of syntax processing it becomes mandatory to specify the word inner when the type of join method is specified. Here are some examples of expressing joins, without hints and with hints.
declare @t table (i int)
declare @t2 table (i int)
-- samples when Join algorithm is not specified,
-- for all the type of joins
-- let SQL Server guess the best algorithm
select * from @t t join @t2 t2 on t2.i = t.i
select * from @t t inner join @t2 t2 on t2.i = t.i -- synonym to the previous query
select * from @t t left join @t2 t2 on t2.i = t.i
select * from @t t right join @t2 t2 on t2.i = t.i
select * from @t t full join @t2 t2 on t2.i = t.i
select * from @t t cross join @t2 t2
-- Samples with forcing the type of join algorithm (any of loop, hash, merge)
select * from @t t inner loop join @t2 t2 on t2.i = t.i -- inner is then mandatory
select * from @t t left hash join @t2 t2 on t2.i = t.i
select * from @t t right hash join @t2 t2 on t2.i = t.i
select * from @t t left merge join @t2 t2 on t2.i = t.i
An exception to this are Cross joins which are implicitely performed as a nested loop join without index because this is the semantic of the Cross join. For this type of join it is meaningless to specify other join algorithm.
SQL Server determines mostly all of the time, the best type of algorithm to perform for joins, depending on indexes available and information it has on data distribution (data distribution statistics) that he maintains. It compares this info against the query like search arguments, join conditions, and columns returned. He chooses the best algorithm depending on anticipated costs.
Join hints are a possible solution to force SQL Server to do adopt a specific join algorithm. When you force a inner loop join, you mandates SQL in a way that, for each row produced before in the query, he must get right away the matching row that comes from the next table. It is like a nested loop in procedural language where the outer loop deals from rows with the first table, and the inner loop gets rows matching each row of the outer loop. You are better to have a suitable index to do this type of join!
This is my method of choice to force the use of an index without specifying its name in the query, which is not bad from maintainability standpoint. Why? When you force such a join, the query cost becomes way too expensive to avoid the use the available index. So SQL Server uses it! That doesn’t means that this is a great solution. You have to verify that it is indeed cheaper to force the algorithm this way.
I was a great fan at forcing loop join in the past (SQL7 SQL2000). Now since SQL2005, it is better to let SQL Server decides, and watch if it takes good decision. If few rows qualify before the join clause, usually it is better to use the index, if there is too many, another algorithm could be better.
SQL Server is usually good at creating algorithm to solve your queries (called access plans), provided that data distribution statistics are accurate. Join algorithm are amongs operators of the access plans. Distribution statistics are by default automatically (and approximately) maintained but needed to be recalculated periodically (in a maintenance plan or with maintenance script as YourSqlDba). There is an option in SQL Server management studio to display query access plans, and some information in book online about how it undestand them. Search “logical and physical operators reference” in SQL Server books online. It will display a table of all operators used in access plan. From then, have a look at each type of join. Usually join operators are the most expensive part of a query, so it is valuable to understand them.
Loop join are the natural way of procedural programmers to perform a join. In a loop, you get a row of one kind, and then you seek the matching one in another table hoping that the correct index to optimize the search exists. In that case the join in not really performed on SQL Server, it is performed on the client side by a cursor loop and several single queries to get the data to match (inner loop). In fact such a practice produces one query for each source row. Much object-relational mapping tools do that. It is not a great manual join method!
This is better to let SQL Server perform it, because only one query is sent, optimize and the overall cost of the query is more visible. The costly part of a loop join is how SQL Server finds the row in the inner part of the loop. If no index is suitable for seeking the row, it is very costly. I assume that the table upon which the «seek» happen is of reasonable size to make this assertion. This is not the case for a few pages table. Anyway even if an index reduces «seeking» work there is still some processing traversing index tree.
A hash join, is a more complex approach. You store a set or rows in memory and implement a hash key that translate to an index or a pointer to them (build phase). This index is tailored from the join condition. The hash join method is only suitable to join that use euqality condition in the ON clause of the join. The costly part of this join is memory allocation of the build phase. With the second set of row you use this same hash algorithm to find out the matching row in memory (probe phase). Usually, SQL Server does that if no index is suitable, and if memory consumption to perform it is not too great. I won’t enter into some management details of the «magic» of finding a row just by mathematic and table and indices, but many of us know what is a hash table access (for the others look in wikipedia).
The last algorithm is merge join. Merging is familiar to old-timers programmers that were used to match input data by previously sorting it and then doing the match in a parallel a sequential read of both inputs. It was before the first index access method where designed, or when it was less costly that having a permanent index to be managed. The costly part of merge join is sorting. Sometimes some index exists that help getting the data in the right order, without sorting one of the inputs.
Now let me tell you a wise advice. SQL Server is much faster at doing this than trying to implement it yourself, and is save a lot a query parsing and network traffic. Also try to run an update statistics statement against the tables involved in the query before trying to play with join hints. If updating data distributions tatistics solves the problem, ask your DBA if update statistics could be done on a regular basis on the databases. Join hints are there for the rare times SQL Server miss an optimization on a specific query. Also forcing a type of join means the this type of join could be less favorable to perform some other searches with different search arguments with the same query. You need to have an good idea of data distribution values before doing that. Loop join are good if data distribution is uniform and search arguments usually present few rows, in other words, loop join are good when the number of rows selected by search arguments are few in most of the tables of the query (to make a rough approximation).