This one is about SQL Server removing duplicate records in SQL Server 2005 and above. Consider the below table with duplicate records.
Create table tblDup(Sno int identity, Col1 varchar(50)) GO INSERT into tbldup values('Value1'),('Value1'),('Value1'), ('Value2'),('Value2'),('Value2'), ('Value3'),('Value3'),('Value4'),('Value5') GO
The above insert statement is valid only from sql server 2008 onwards. The below code can be used to delete duplicate records from the above table.
- Query 1: Using Row_Number with CTE
;WITH CTE AS ( SELECT sno,col1,ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum FROM tblDup ) Delete from Cte Where RowNum>1
- Query 2: Using Row_Number in a subquery
Delete tblsub from ( Select ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum FROM tblDup ) tblsub where rownum>1
The table tbldup now as 5 unique rows has shown below.
Query1 and Query 2 are similar in performance as shown in the below execution plan.
Removing duplicate using above queries can be much slower for large tables having lot of duplicate records. It’s because, lot of T-log entries will be created for the rows being deleted also the triggers and indexes can make delete statement to crawl.The fastest way is to bulk-insert the unique rows into a new table. Bulk-Insert is a minimally logged operation under Bulk-Logged and simple recover model. The code below bulk-inserts the unique rows from the table tbldup into the table tblUnique.
WITH CTEUnique AS ( Select Sno,col1,ROW_NUMBER() OVER(partition by col1 Order by sno) As RowNum FROM tblDup ) SELECT sno,col1 INTO tblunique FROM CTEUnique Where RowNum=1