De-duplication is the process of deleting duplicate rows from a database.
Duplicate rows is a problem that data analysts frequently deal with.
Before deleting duplicates you have to find them.
In this example, duplicates are identified using a CTE on the Order table.
WITH Cte(FirstName, LastName,
OrderDate, RowNumber) AS (
SELECT FirstName, LastName, OrderDate,
ROW_NUMBER() OVER (
PARTITION BY OrderDate, FirstName, LastName
ORDER BY OrderDate, FirstName, LastName) AS RowNumber
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
)
SELECT *
FROM Cte
WHERE RowNumber > 1
Note: These are not true duplicates, but it gets the point across. This query returns customers that placed more than one order on the same day.
FirstName | LastName | OrderDate | RowNumber |
---|---|---|---|
Elizabeth | Lincoln | 2013-01-10 00:00:00.000 | 2 |
Philip | Cramer | 2013-02-25 00:00:00.000 | 2 |
Howard | Snyder | 2013-07-31 00:00:00.000 | 2 |
Jose | Pavarotti | 2013-10-22 00:00:00.000 | 2 |
Felipe | Izquierdo | 2014-01-19 00:00:00.000 | 2 |
Daniel | Tonini | 2014-03-24 00:00:00.000 | 2 |
Jose | Pavarotti | 2014-04-17 00:00:00.000 | 2 |
The ROW_NUMBER() function sequentially numbers rows that are grouped by FirstName, LastName, and OrderDate. Those with a RowNumber > 1 are duplicates of another row.
Next, the duplicate rows need deletion.
Changing SELECT to DELETE in the above query will accomplish this.
WITH Cte(Id, CustomerId, OrderDate, RowNumber) AS (
SELECT Id, CustomerId, OrderDate,
ROW_NUMBER() OVER (
PARTITION BY CustomerId, OrderDate
ORDER BY OrderDate, CustomerId) AS RowNumber
FROM [Order] O
)
DELETE
FROM Cte
WHERE RowNumber > 1
Note: This query always deletes the second row it finds which may, or may not, be desirable. A manual inspection step of the duplicates is often required.
SELECT A.Id AS OrderId, FirstName, LastName,
A.OrderNumber, A.OrderDate
FROM [Order] A
JOIN (
SELECT C.Id, FirstName, LastName, OrderDate,
COUNT(O.Id) AS Occurences
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
GROUP BY C.Id, OrderDate, FirstName, LastName
HAVING COUNT(O.Id) > 1) AS B
ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
OrderId | FirstName | LastName | OrderNumber | OrderDate |
---|---|---|---|---|
163 | Elizabeth | Lincoln | 542540 | 2013-01-10 00:00:00.000 |
164 | Elizabeth | Lincoln | 542541 | 2013-01-10 00:00:00.000 |
209 | Philip | Cramer | 542586 | 2013-02-25 00:00:00.000 |
210 | Philip | Cramer | 542587 | 2013-02-25 00:00:00.000 |
369 | Howard | Snyder | 542746 | 2013-07-31 00:00:00.000 |
370 | Howard | Snyder | 542747 | 2013-07-31 00:00:00.000 |
466 | Jose | Pavarotti | 542843 | 2013-10-22 00:00:00.000 |
467 | Jose | Pavarotti | 542844 | 2013-10-22 00:00:00.000 |
591 | Felipe | Izquierdo | 542968 | 2014-01-19 00:00:00.000 |
592 | Felipe | Izquierdo | 542969 | 2014-01-19 00:00:00.000 |
725 | Daniel | Tonini | 543102 | 2014-03-24 00:00:00.000 |
726 | Daniel | Tonini | 543103 | 2014-03-24 00:00:00.000 |
783 | Jose | Pavarotti | 543160 | 2014-04-17 00:00:00.000 |
784 | Jose | Pavarotti | 543161 | 2014-04-17 00:00:00.000 |
Here we have a list with paired duplicates, one after the other.