Pages

Tuesday, May 19, 2009

TOP 5 ways to delete Duplicate Records in a Table

Today I got an email asking how to delete duplicate records in a table? This is very common and very interesting question. The perfect answer to this question depends on the following points:
1) Volume of the data
2) Downtime of the system
3) Dependency on the table
4) Restrictions on writing the code.

I am sure every database developer deals with “Delete Duplicate Records” issue at least once in a lifetime.

There are so many ways we can eliminate the duplicate data from a table. In this article I would like to explain various ways to delete the duplicate records.

--Create Demo Table
CREATE TABLE #Employee
(
EMP_ID INT,
FIRST_NAME VARCHAR(30),
LAST_NAME VARCHAR(30)
)

--Insert Dummy Data
INSERT INTO #Employee Values(1,'Stefan','Vachev');
INSERT INTO #Employee Values(2,'Michael','Allen');
INSERT INTO #Employee Values(3,'Glenn','Jhonson');
INSERT INTO #Employee Values(4,'Ray','Muran');
INSERT INTO #Employee Values(1,'Stefan','Vachev');
INSERT INTO #Employee Values(2,'Michael','Allen');
INSERT INTO #Employee Values(3,'Glenn','Jhonson');
INSERT INTO #Employee Values(4,'Ray','Muran');
INSERT INTO #Employee Values(5,'Vijaya','Kadiyala');
INSERT INTO #Employee Values(5,'Vijaya','Kadiyala');
INSERT INTO #Employee Values(6,'Alexi','Kal');
INSERT INTO #Employee Values(5,'Vijaya','Kadiyala');


Method 1: Using Temporary Table based on all the columns

select * into #DuplicateEmployee from #Employee where 1 = 0
insert #DuplicateEmployee select distinct * from #Employee
begin tran
delete #Employee
insert #Employee select * from #DuplicateEmployee
commit tran
drop table #DuplicateEmployee

This is method is used when all the columns are duplicated.

Method 2: Using Temporary Table based on set of columns

select * into #DuplicateEmployee from #Employee where 1 = 0
insert #DuplicateEmployee select EMP_ID,FIRST_NAME,LAST_NAME from #Employee group by EMP_ID,FIRST_NAME,LAST_NAME having count(*) > 1
begin tran
delete #Employee from #DuplicateEmployee
where #Employee.EMP_ID = #DuplicateEmployee.EMP_ID
and #Employee.FIRST_NAME = #DuplicateEmployee.FIRST_NAME
and #Employee.LAST_NAME = #DuplicateEmployee.LAST_NAME
insert #Employee select * from #DuplicateEmployee
commit tran
drop table #DuplicateEmployee


Method 3: Using Row Count based on set of columns

set rowcount 1
select 1
while @@rowcount > 0
delete #Employee
where 1 < (select count(*)
from #Employee a2
where #Employee.EMP_ID = a2.EMP_ID
and #Employee.FIRST_NAME = a2.FIRST_NAME
and #Employee.LAST_NAME = a2.LAST_NAME)
set rowcount 0

Method 4: Using Analytical Functions based on set of columns

WITH #DeleteEmployee AS (
SELECT ROW_NUMBER ( )
OVER ( PARTITION BY Emp_ID, First_Name, Last_Name ORDER BY Emp_ID ) AS RNUM FROM #Employee
)
DELETE FROM #DeleteEmployee WHERE RNUM > 1


Method 5: Using New Identity Column based on set of columns

Alter table #Employee add UNIQ_ID int IDENTITY(1,1)

DELETE FROM #Employee
WHERE UNIQ_ID < (SELECT MAX(UNIQ_ID) FROM #Employee a2
where #Employee.EMP_ID = a2.EMP_ID
and #Employee.FIRST_NAME = a2.FIRST_NAME
and #Employee.LAST_NAME = a2.LAST_NAME)

Alter table #Employee DROP column UNIQ_ID

1 comment:

Bob said...

perfect solutions