2/25/2012 11:43:35 AM
Ahmad Osama -
This one is about removing duplicate records in SQL 2005 and above. Consider the below table with duplicate records.
Create table tblDup(Sno int identity, Col1 varchar(50))
INSERT into tbldup
The above insert statement is valid only from sql server 2008 onwards. The below code can be used to delete duplicate records from the above table.
SELECT sno,col1,ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum
Delete from Cte Where RowNum>1
Delete tblsub from
( Select ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum FROM tblDup ) tblsub
The table tbldup now as 5 unique rows has shown below.
Query1 and Query 2 are similar in performance as shown in the below execution plan.
Removing duplicate using above queries can be much slower for large tables having lot of duplicate records. It’s because, lot of T-log entries will be created for the rows being deleted also the triggers and indexes can make delete statement to crawl.The fastest way is to bulk-insert the unique rows into a new table. Bulk-Insert is a minimally logged operation under Bulk-Logged and simple recover model. The code below bulk-inserts the unique rows from the table tbldup into the table tblUnique.
WITH CTEUnique AS
Select Sno,col1,ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum FROM tblDup
SELECT sno,col1 INTO tblunique FROM CTEUnique Where RowNum=1
If you liked this post, do like us on FaceBook at http://www.FaceBook.com/SQLServerGeeks
Ahmad Osama (Member since: 1/25/2012 1:03:07 PM)
Ahmad Osama is a MCP Database Administrator/Developer, an avid gamer and a chicken lover. Ahmad started his career in the sales industry working as database executive; responsible for report writing, application development and basic database administration. In 2008 he joined World Fashion Exchange as Database Administrator. While in this role he focused on troubleshooting and performance tuning. In 2010 he joined The Perfect Future Technologies and has been awarded as best new comer and expert of the year working as database administrator on one of largest political database in the world.
Ahmad has experience in database administration, SSIS, performance tuning, VLDBs and web development.
When not working on SQL Server, he can be found glued to his Xbox.
View Ahmad Osama 's profile
Thanks Yashpal - for bulky tables I would suggest using bulk-insert.
Yashpal, I prefer CTE over temp tables, because they need not be dropped explicitly at the end. Nevertheless, composing CTE is an art.
Nice example of CTE and it is used a lot in data synchrnization for exactly this purpose. But it is usually just the beginning. The de-duplication of data is way more complex than just removing rows from a single isolated table.
Here are a few examples:
1. The table has a children table. In that case we'd have to update FKs first. Not a big deal.
2. PK from our table is used as a reference in an external system. (E.g. DW/BI system). If the data cleansing wasn't part of the DW project (for this table at least) the ETL has to be modified to handle de-duplication correctly. This can be a big deal, especially if we don't have a natural key or it is not copied to the external system.
3. Some rows are "similar" and we have to decide if similar means equal and then which one to delete. We can have this case when we have duplicated records of the same person, but each row points to a different address. One is correct and the others are obsolete or incorrect. Do we just keep the latest updated? (If we are lucky enough to have last updated date). Or do we keep the history of changes?
Food for thought.
In the real world , duplicates have some logic beyond it ,For example a duplicate product inStock lines caused by different manufacturers of item sending the merchandise. In this case you might want to aggregate the data ( Sum(qty)) with A group by clause , thus, eliminating duplication without deleting any rows of relevant data.
Leave a comment