SQL Find Duplicates

Data entry errors are a common source of duplicate records in a database.

Another source is merging data from multiple sources into a single database.

Finding duplicates is the first step in cleaning up the data.

There are different SQL techniques to locate these duplicates.

Example

#

Find the duplicates in the Order table.

SELECT FirstName, LastName, OrderDate,
       COUNT(O.Id) AS Occurences
  FROM [Order] O
  JOIN Customer C ON C.Id = O.CustomerId
 GROUP BY OrderDate, FirstName, LastName
HAVING COUNT(O.Id) > 1

Note: These are not true duplicates, but it gets the point across. These 'duplicates' are customers that placed more than 1 order on the same day.

Result:  7 records
FirstName LastName OrderDate Occurences
Elizabeth Lincoln 2013-01-10 00:00:00.000 2
Philip Cramer 2013-02-25 00:00:00.000 2
Howard Snyder 2013-07-31 00:00:00.000 2
Jose Pavarotti 2013-10-22 00:00:00.000 2
Felipe Izquierdo 2014-01-19 00:00:00.000 2
Daniel Tonini 2014-03-24 00:00:00.000 2
Jose Pavarotti 2014-04-17 00:00:00.000 2

As you can see, with COUNT, GROUP BY and HAVING we can confirm that duplicates exist.


How to find Duplicates

#

Next, we want the actual rows that are duplicates. This is accomplished by joining the above query with a second one, like so:

SELECT A.Id AS OrderId, FirstName, LastName, 
       A.OrderNumber, A.OrderDate
  FROM [Order] A
  JOIN (

        SELECT C.Id, FirstName, LastName, OrderDate,
               COUNT(O.Id) AS Occurences
          FROM [Order] O
          JOIN Customer C ON C.Id = O.CustomerId
         GROUP BY C.Id, OrderDate, FirstName, LastName
        HAVING COUNT(O.Id) > 1) AS B

    ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
Result:  14 records
OrderId FirstName LastName OrderNumber OrderDate
163 Elizabeth Lincoln 542540 2013-01-10 00:00:00.000
164 Elizabeth Lincoln 542541 2013-01-10 00:00:00.000
209 Philip Cramer 542586 2013-02-25 00:00:00.000
210 Philip Cramer 542587 2013-02-25 00:00:00.000
369 Howard Snyder 542746 2013-07-31 00:00:00.000
370 Howard Snyder 542747 2013-07-31 00:00:00.000
466 Jose Pavarotti 542843 2013-10-22 00:00:00.000
467 Jose Pavarotti 542844 2013-10-22 00:00:00.000
591 Felipe Izquierdo 542968 2014-01-19 00:00:00.000
592 Felipe Izquierdo 542969 2014-01-19 00:00:00.000
725 Daniel Tonini 543102 2014-03-24 00:00:00.000
726 Daniel Tonini 543103 2014-03-24 00:00:00.000
783 Jose Pavarotti 543160 2014-04-17 00:00:00.000
784 Jose Pavarotti 543161 2014-04-17 00:00:00.000

Here we have a list with paired duplicates, one after the other. This allows us to make decisions about which rows to keep and which ones to delete.

Preventing duplicates

One way to prevent duplicates from entering a database is to use constraints. Relevant constraints are Identity, UNIQUE columns, and UNIQUE Indexes. Click the respective links for details.

Note: SQL Server creates a UNIQUE index on a column that is declared UNIQUE. So, the end-result of UNIQUE columns and UNIQUE Indexes are the same.


More Examples

SQL Find Duplicates with ROW_NUMBER

Duplicates can also be found by using the ROW_NUMBER Window function.

CUSTOMER
Id
FirstName
LastName
City
Country
Phone
ORDER
Id
OrderDate
OrderNumber
CustomerId
TotalAmount
Problem: Find duplicate rows in the Order table.
WITH Cte(FirstName, LastName, 
         OrderDate, RowNumber) AS (
  SELECT FirstName, LastName, OrderDate,
         ROW_NUMBER() OVER (
		     PARTITION BY OrderDate, FirstName, LastName
             ORDER BY OrderDate, FirstName, LastName) AS RowNumber
    FROM [Order] O
    JOIN Customer C ON C.Id = O.CustomerId
) 

SELECT * 
  FROM Cte 
 WHERE RowNumber > 1

Note: This returns the same results as the first query on this page.

Result:  7 records
FirstName LastName OrderDate Occurences
Elizabeth Lincoln 2013-01-10 00:00:00.000 2
Philip Cramer 2013-02-25 00:00:00.000 2
Howard Snyder 2013-07-31 00:00:00.000 2
Jose Pavarotti 2013-10-22 00:00:00.000 2
Felipe Izquierdo 2014-01-19 00:00:00.000 2
Daniel Tonini 2014-03-24 00:00:00.000 2
Jose Pavarotti 2014-04-17 00:00:00.000 2

You may also like



Guides


vsn 3.1