Saturday, March 14, 2015

Delete duplicate rows using CTE and ROW_NUMBER in SQL

Ever had duplicate rows in your table because you forgot to use a 'Unique' contraint? Finding it hard to delete the duplicate rows as not all of the columns in the rows are the same? Well you can use two of SQL Server 2005's new features to overcome this problem.

I'm going to create a table with duplicate rows to demonstrate this problem:

CREATE TABLE CustomerEmails
(
  ID int IDENTITY (1, 1),
  CustomerID int,
  Email varchar(100),
  CreatedDate smalldatetime
)

INSERT INTO CustomerEmails
SELECT 135, 'apple@spamme.com', '2010-03-31' UNION ALL
SELECT 267, 'orange@spamme.com','2010-03-21' UNION ALL
SELECT 222, 'pear@spamme.com''2010-03-11' UNION ALL
SELECT 333, 'kiwi@spamme.com''2010-03-01' UNION ALL
SELECT 333, 'lemon@spamme.com', '2010-02-28' UNION ALL
SELECT 333, 'lime@spamme.com''2010-02-21' UNION ALL
SELECT 222, 'grape@spamme.com', '2010-02-11' UNION ALL
SELECT 492, 'banana@spamme.com','2010-02-01';

So customers 222 and 333 are in there multiple times and if I do a lookup to find their email I get multiple results. Someone has been inserting new rows without trying to update existing rows first... the lunatic!

I remember a DBA faffing with this problem for ages. After realising you can't DISTINCT a single column, trying to DELETE WHERE and ID exists IN a GROUP SELECT statement HAVING COUNT(*) > 1 except for the TOP row with an ORDER BY on the date DESC... *yawn*

Well, here is the way for cool people to do it: You create a CTE using a column for a ROW_NUMBER() to count the number of duplicates for each customer:

 WITH Duplicates AS
(
  SELECT
    CustomerID,
    CreatedDate,
    Email,
    ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CreatedDate DESC) as cnt
  FROM
    CustomerEmails
)

The ROW_NUMBER() has a partition on CustomerID which means the count will start again for each customer. I have oredered the count by CreatedDate DESC too which means the row with a count of 1 will be the newest.
Now you simply DELETE from the CTE where the count is higher than 1, removing all the duplicates:

DELETE FROM
  Duplicates
WHERE
  cnt > 1;

Have a look:

SELECT * FROM CustomerEmails;

Great! You no longer have any duplicates! Now quickly whack a UNIQUE contraint on that column before that lunatic developer starts messing with your data again.
ALTER TABLE CustomerEmails ADD UNIQUE (CustomerID);
 

0 comments:

Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More