How to Delete Duplicate records from SQL table

Good Evening friends..Today I am going to talk about an interesting topic in SQL world. If you Google this topic you will find multiple ways of deleting duplicate data from the table. I am not going to write something very new but yes I will talk about performance issue while deleting duplicate data using traditional approach.

Deleting duplicate rows from SQL 2000:- I have created a table DuplicateData and inserted few duplicate rows based on EmpId.


Create table DuplicateData(EmpId int,Name varchar(100))-->Table Creation


insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(5,'ankit')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(1,'Neeraj')
insert into DuplicateData values(1,'Neeraj')

insert into DuplicateData values(1,'Neeraj')

Traditional way of deleting duplicate rows from table in SQL 2000 :-If we run the below batch in query analyzer, it will remove all the duplicate values from table DuplicateData. This query is "OK" if you are doing it in test environment or on dummy data. But if you have millions of record or large data, this query would be the worst query in terms of performance. It can take few hours or may be days depending on the volume of the data in the intended table.

Reason:- Below query is a correlated sub query which will execute for each EmpId present in the table and check if count for each EmpId is > 1 then delete each record one by one. That is the reason for its slow performance.


set rowcount 1

delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1
while @@rowcount>0
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1

set rowcount 0

We can create a stored procedure to overcome this performance issue. Below is the example.


declare @tmp table(empid int,cnt int, rowid int identity)--> declare table variable


declare @maxcounter as integer--> Declaration of variables

declare @mincounter as integer
declare @rowcnt as integer
declare @empid as int-->End of Declaration

insert into @tmp(empid,cnt)-->Inserting duplicate empid along with no of duplicate entries

select empid,count(empid) from duplicatedata 
group by empid having count(empid)>1

select @mincounter=min(rowid),@maxcounter=max(rowid) from @tmp -->assigning minimum and maximum rowid to variables.

while @mincounter <=@maxcounter

begin
select @rowcnt=cnt,@empid=empid from @tmp where rowid=@mincounter 
set @rowcnt =@rowcnt-1
set rowcount @rowcnt
delete from duplicatedata where empid=@empid
set rowcount 0
set @mincounter=@mincounter +1
end

lets understand the above while loop, we have all duplicate records in @tmp table with no of duplicate entries. Now we will loop over each record present in @tmp table, for that reason we have assigned minimum and maximum rowid to variables(@maxcounter, @mincounter).


In While loop body, we are assigning "no of duplicate records" values to variable @rowcnt and empid to variable @empid

In the next statement we are setting @rowcnt=@rowcnt-1, we are doing this because this variable contains no of duplicate records for a particular empid, but we want to keep one empid from the duplicate ones.
in next statement we are setting up the rowcount with one value lesser than the no of duplicate records for that particular empid.

Next statement reset the rowcount with 0 and last statement is increasing the @mincounter value to fetch next record from @tmp table.

Deleting duplicate rows from SQL 2005/2008:- From version 2005 onwards, we have few rank functions available which can be used for this purpose. Let me show you an example of a rank function for deleting duplicate data from a table.

withCTE_dupData
as(Select ROW_NUMBER() over(partition by EmpId order by EmpId) as rno,EmpId from DuplicateData
)delete from CTE_dupData where rno>1

You can see that how rank function has reduced the line of code for deleting duplicate data from a table. But this query will not run on SQL server 2000 and its earlier versions. We are using here with clause that is also a new concept "Common table expression" which was introduced in SQL 2005 and later versions.

Row_Number() :- Row number is a rank function that was introduced in SQL 2005 and later versions.
With clause(Common table expression):- which was introduced in SQL 2005 and later versions.
we will talk about rank functions and CTE(common table expression) in my next article so that you can understand the above query easily.

I hope this article will help you in deleting duplicate records from a table in SQL 2000 without any performance issue.

No comments: