Dofactory.com
Dofactory.com
Earn income with your data and sql skills
Sign up and we'll send you the best freelance opportunities straight to your inbox.
We're building the largest freelancing marketplace for people like you.
By adding your name & email you agree to our terms, privacy and cookie policies.

SQL Delete Duplicates

De-duplication is the process of deleting duplicate rows from a database.

Duplicate rows is a problem that data analysts frequently deal with.

Example

#

Before deleting duplicates you have to find them.
In this example, duplicates are identified using a CTE on the Order table.

WITH Cte(FirstName, LastName, 
         OrderDate, RowNumber) AS (
  SELECT FirstName, LastName, OrderDate,
         ROW_NUMBER() OVER (
           PARTITION BY OrderDate, FirstName, LastName
           ORDER BY OrderDate, FirstName, LastName) AS RowNumber
   FROM [Order] O
   JOIN Customer C ON C.Id = O.CustomerId
) 

SELECT * 
  FROM Cte 
  WHERE RowNumber > 1

Note: These are not true duplicates, but it gets the point across. This query returns customers that placed more than one order on the same day.

Result:  7 records
FirstName LastName OrderDate RowNumber
Elizabeth Lincoln 2013-01-10 00:00:00.000 2
Philip Cramer 2013-02-25 00:00:00.000 2
Howard Snyder 2013-07-31 00:00:00.000 2
Jose Pavarotti 2013-10-22 00:00:00.000 2
Felipe Izquierdo 2014-01-19 00:00:00.000 2
Daniel Tonini 2014-03-24 00:00:00.000 2
Jose Pavarotti 2014-04-17 00:00:00.000 2

The ROW_NUMBER() function sequentially numbers rows that are grouped by FirstName, LastName, and OrderDate. Those with a RowNumber > 1 are duplicates of another row.


De-duplication

#

Next, the duplicate rows need deletion.
Changing SELECT to DELETE in the above query will accomplish this.

WITH Cte(Id, CustomerId, OrderDate, RowNumber) AS (
  SELECT Id, CustomerId, OrderDate,
         ROW_NUMBER() OVER (
           PARTITION BY CustomerId, OrderDate
           ORDER BY OrderDate, CustomerId) AS RowNumber
   FROM [Order] O
) 

DELETE
  FROM Cte
 WHERE RowNumber > 1

Note: This query always deletes the second row it finds which may, or may not, be desirable. A manual inspection step of the duplicates is often required.

Result:  7 records deleted

More Examples

List original and duplicate rows

The query below lists duplicate pairs of Order records.
This allows a data analyst to decide which row in each pair to delete.
SELECT A.Id AS OrderId, FirstName, LastName, 
       A.OrderNumber, A.OrderDate
  FROM [Order] A
  JOIN (

        SELECT C.Id, FirstName, LastName, OrderDate,
               COUNT(O.Id) AS Occurences
          FROM [Order] O
          JOIN Customer C ON C.Id = O.CustomerId
         GROUP BY C.Id, OrderDate, FirstName, LastName
        HAVING COUNT(O.Id) > 1) AS B

    ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
Result:  14 records
OrderId FirstName LastName OrderNumber OrderDate
163 Elizabeth Lincoln 542540 2013-01-10 00:00:00.000
164 Elizabeth Lincoln 542541 2013-01-10 00:00:00.000
209 Philip Cramer 542586 2013-02-25 00:00:00.000
210 Philip Cramer 542587 2013-02-25 00:00:00.000
369 Howard Snyder 542746 2013-07-31 00:00:00.000
370 Howard Snyder 542747 2013-07-31 00:00:00.000
466 Jose Pavarotti 542843 2013-10-22 00:00:00.000
467 Jose Pavarotti 542844 2013-10-22 00:00:00.000
591 Felipe Izquierdo 542968 2014-01-19 00:00:00.000
592 Felipe Izquierdo 542969 2014-01-19 00:00:00.000
725 Daniel Tonini 543102 2014-03-24 00:00:00.000
726 Daniel Tonini 543103 2014-03-24 00:00:00.000
783 Jose Pavarotti 543160 2014-04-17 00:00:00.000
784 Jose Pavarotti 543161 2014-04-17 00:00:00.000

Here we have a list with paired duplicates, one after the other.


You may also like



Last updated on Dec 21, 2023

Earn income with your data and sql skills
Sign up and we'll send you the best freelance opportunities straight to your inbox.
We're building the largest freelancing marketplace for people like you.
By adding your name & email you agree to our terms, privacy and cookie policies.