SQL Server: Removing Duplicate Records

Who is online?  132 guests and 0 members
home  »  blogs  »  Ahmad Osama  »  SQL Server: Removing Duplicate Records
  Rate This Blog Entry:  register  or  login

Author

AhmadOsama Ahmad Osama (Member since: 1/25/2012 1:03:07 PM)
Ahmad Osama is a MCP Database Administrator/Developer, an avid gamer and a chicken lover. Ahmad started his career in the sales industry working as database executive; responsible for report writing, application development and basic database administration. In 2008 he joined World Fashion Exchange as Database Administrator. While in this role he focused on troubleshooting and performance tuning. In 2010 he joined The Perfect Future Technologies and has been awarded as best new comer and expert of the year working as database administrator on one of largest political database in the world. Ahmad has experience in database administration, SSIS, performance tuning, VLDBs and web development. When not working on SQL Server, he can be found glued to his Xbox.

View Ahmad Osama 's profile

Comments (5)

Yashpal Shah
2/26/2012 5:29:37 PM Yashpal Shah said:
Good methodical implementation. I am just thinking how it would be if I used a local temp table instead in CTE. I think it's worth a try for bulky tables. Thanks.
by
AhmadOsama
2/26/2012 6:29:54 PM Ahmad Osama said:

Thanks Yashpal - for bulky tables I would suggest using bulk-insert. 

by
Raunak
3/3/2012 8:04:41 AM Raunak said:

Yashpal, I prefer CTE over temp tables, because they need not be dropped explicitly at the end. Nevertheless, composing CTE is an art.

by
Dragan
3/3/2012 5:47:26 PM Dragan said:

Nice example of CTE and it is used a lot in data synchrnization for exactly this purpose. But it is usually just the beginning. The de-duplication of data is way more complex than just removing rows from a single isolated table. Here are a few examples:

1. The table has a children table. In that case we'd have to update FKs first. Not a big deal.

2. PK from our table is used as a reference in an external system. (E.g. DW/BI system). If the data cleansing wasn't part of the DW project (for this table at least) the ETL has to be modified to handle de-duplication correctly. This can be a big deal, especially if we don't have a natural key or it is not copied to the external system.

3. Some rows are "similar" and we have to decide if similar means equal and then which one to delete. We can have this case when we have duplicated records of the same person, but each row points to a different address. One is correct and the others are obsolete or incorrect. Do we just keep the latest updated? (If we are lucky enough to have last updated date). Or do we keep the history of changes? Food for thought.

by
Yaniv Feder
3/15/2012 8:46:15 PM Yaniv Feder said:

In the real world , duplicates have some logic beyond it ,For example a duplicate product inStock lines caused by different manufacturers of item sending the merchandise. In this case you might want to aggregate the data ( Sum(qty)) with A group by clause , thus, eliminating duplication without deleting any rows of relevant data.

by

Leave a comment

Name:
  *  
Email:  (your email is kept secure)
 *  
Website:

example: "http://www.SQLServerGeeks.com"

 
Type the characters:
 *
 
   

Training on Microsoft Products & Technologies

Email Subscriptions

   Get the Most Recent Blogs in your inbox
Blogs RSS Feed

Ahmad Osama's latest blog posts

  • A few days back I was resolving a memory issue which caused sql server to stop responding. I was able to figure out the issue by looking into DBCC MemoryStatus output. It was the full text search whic...
  • This is an update to the DB-Migrate ( http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/797/db-migrate-a-powershell-module-to-migrate-databases ) powershell module to migrate database between sq...
  • This blog is part of the series The TSQL Classes: Stored Procedure The links to previous post are given below. http://sqlservergeeks.com/blogs/AhmadOsama/personal/762/the-tsql-classes-stored-procedure...
  • Recently I wrote powershell scripts to move databases, logins and SQL Agent Jobs between instances. Another one I wrote was to fix orphan users. I have now combined all of these different functions in...
  • This blog is part of the series The TSQL Classes: Stored Procedure The link to previous post are given below. http://sqlservergeeks.com/blogs/AhmadOsama/personal/762/the-tsql-classes-stored-procedures...
  • This blog is part of the series The TSQL Classes: Stored Procedure The link to previous post are given below. http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/764/the-tsql-classes-stored-proced...
  • Migrating jobs from one server to other can be easily done with powershell. Let’s look into the script. At first, include reference to the relevant SMO assemblies and then create the source and ...
  • Here's a simple powershell script to shrink log files across database. Shrinking T-log isn't a good practise, however at times we might need to release space to OS in development/UAT servers. The scri...
  • One of the major migration activities is to fix orphan users. Though, it can be easily fixed by sp_change_users_login procedure, however what if you need to do this for multiple servers. A powershell ...
  • This blog is part of the series The TSQL Classes: Stored Procedure. http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/764/the-tsql-classes-stored-procedures-nested-proc-recursion-and-plan-cachin...
  • It’s sometime needed to move database between instances as in when migrating to a different instance. Let’s see how this can be achieved with powershell. This script assumes that both the ...
  • Moving logins between instances is one of the critical migration activities. Let’s see how it can be done in powershell with the help of SMO.When transferring logins, two things need to be looke...
  • Recently I was involved in a project where in non-clustered indexes were to be moved to a different file group for the sake of performance benefit. It has to be done for multiple databases. I came up ...
  • This blog is part of the series The TSQL Classes: Stored Procedure. http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/762/the-tsql-classes-stored-procedures-a-primer In this class I’ll tal...
  • Transaction retry is one of the easiest way to get control of deadlock, however it might get tedious depending on number of deadlock an application is experiencing. The deadlock occurs because of inco...
  • This blog is part of series The TSQL Classes: Stored Procedures. Introduction: To put simply, a stored procedures is set of queries put together primarily to reuse a complex/iterative business logic. ...
  • Cyclic deadlock occurs when a process say P 1 is waiting on a process say P 2 to release a lock and process P 2 is waiting on P 1 to release a lock. The two process involved form a cyclic dependency r...
  • I had a scenario where in I had to automate index maintenance in a mirroring environment. Database mirroring requires a database to be in full recovery mode and thus index rebuild results in huge logs...
  • Today, I came across a surprising situation when running sp_spaceused on one of my dummy table “Tblemployee”. The sp_spaceused told me that it has over 200,000 records, however it returned...
  • Recently a developer came to me seeking my help in generating dates between a date range for a report. Being a .Net developer he said that one solution he had is to do row by row processing either by ...
  • When posting my last blog on sequence I realized that sequences can also be used to generate dummy data for test tables. On further analysis I found that it’s faster than the conventional while ...
  • This post compares the performance of adding an identity column to a table using Identity property versus by using sequence objects. The below code creates a table and fills it with demo data. DROP TA...
  • This blog is the continuation of my last blog http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/602/sql-server-2012-adding-not-null-columns-to-an-existing-table where in I wrote about a new feat...
  • Most of you must have come across the pain of adding a not null column with a default value to an existing big table. It takes minutes to add columns. I recently found out that this problem has been r...
  • This is another good feature of DTA. There are lot of different scripts available over internet to find unused indexes, however why not use the SQL Servers very own DTA to figure this out. It’s ...
Blogs RSS Feed

Latest community blog posts

  • A few days back I was resolving a memory issue which caused sql server to stop responding. I was able to figure out the issue by looking into DBCC MemoryStatus output. It was the full text search whic...
  • This is an update to the DB-Migrate ( http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/797/db-migrate-a-powershell-module-to-migrate-databases ) powershell module to migrate database between sq...
  • This blog is part of the series The TSQL Classes: Stored Procedure The links to previous post are given below. http://sqlservergeeks.com/blogs/AhmadOsama/personal/762/the-tsql-classes-stored-procedure...
  • Reporting With SSRS Part 1 : In this serires of blogs, i will try to introduce you with some of the key features of the tool which will help you get on board to become a proficient report developer.If...
  • Hi Folks, I am sure that all of you would be aware about the role of Identity column in a table, i.e., it is column which keeps on incrementing without supplying the value explicitly during insertion....
  • Recently I wrote powershell scripts to move databases, logins and SQL Agent Jobs between instances. Another one I wrote was to fix orphan users. I have now combined all of these different functions in...
  • Hi SQL Geeks, Here are the blog posts by Ahmad Osama for the month of April. We would expect more of his informative blogs this month. Happy Learning! If you like our blogs do like us on Facebook at h...
  • This blog is part of the series The TSQL Classes: Stored Procedure The link to previous post are given below. http://sqlservergeeks.com/blogs/AhmadOsama/personal/762/the-tsql-classes-stored-procedures...
  • As DBAs we are always challenged of unexpected size growth of log or data files and need to shrink files based on available free space in each file. Most of the times we are stuck and do a lot of scri...
  • One of the major migration activities is to fix orphan users. Though, it can be easily fixed by sp_change_users_login procedure, however what if you need to do this for multiple servers. A powershell ...
  • This blog is part of the series The TSQL Classes: Stored Procedure. http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/764/the-tsql-classes-stored-procedures-nested-proc-recursion-and-plan-cachin...