How To Delete Duplicate Rows in MySQL
In the previous tutorial, we have shown you how to find duplicate values in a table. Once the duplicates rows are identified, you may want to delete them to clean up your data.
Prepare sample data
The following script creates tablecontacts
and inserts sample data into the contacts
table for the demonstration.
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts (
id INT PRIMARY KEY AUTO_INCREMENT,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(255) NOT NULL
);
INSERT INTO contacts (first_name,last_name,email)
VALUES ('Carine ','Schmitt','carine.schmitt@verizon.net'),
('Jean','King','jean.king@me.com'),
('Peter','Ferguson','peter.ferguson@google.com'),
('Janine ','Labrune','janine.labrune@aol.com'),
('Jonas ','Bergulfsen','jonas.bergulfsen@mac.com'),
('Janine ','Labrune','janine.labrune@aol.com'),
('Susan','Nelson','susan.nelson@comcast.net'),
('Zbyszek ','Piestrzeniewicz','zbyszek.piestrzeniewicz@att.net'),
('Roland','Keitel','roland.keitel@yahoo.com'),
('Julie','Murphy','julie.murphy@yahoo.com'),
('Kwai','Lee','kwai.lee@google.com'),
('Jean','King','jean.king@me.com'),
('Susan','Nelson','susan.nelson@comcast.net'),
('Roland','Keitel','roland.keitel@yahoo.com');
Note that you can execute this script to recreate test data after you execute a DELETE
statement.This query returns data from the contacts table:
SELECT * FROM contacts
ORDER BY email;
The following query returns the duplicate emails in the contacts
table:
SELECT
email, COUNT(email)
FROM
contacts
GROUP BY
email
HAVING
COUNT(email) > 1;
As you can see, we have four rows with duplicate emails.
A) Delete duplicate rows using DELETE JOIN
statement
MySQL provides you with the DELETE JOIN
statement that allows you to remove duplicate rows quickly.The following statement deletes duplicate rows and keeps the highest id:
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.The output is:
Query OK, 4 rows affected (0.10 sec)It indicated that four rows have been deleted. You can execute the query that find duplicate emails again to verify the delete:
SELECT
email,
COUNT(email)
FROM
contacts
GROUP BY
email
HAVING
COUNT(email) > 1;
The query returns an empty set, which means that the duplicate rows have been deleted.Let’s verify data from the
contacts
table:
SELECT * FROM contacts;
The rows with id 2, 4, 7, and 9 have been deleted.In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
Note that you can execute the script for creating contacts
table again and test this query. The following output shows the data of the contacts
table after removing duplicate rows.B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:- Create a new table with the structure the same as the original table that you want to delete duplicate rows.
- Insert distinct rows from the original table to the immediate table.
- Drop the original table and rename the immediate table to the original table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
For example, the following statements delete rows with duplicate emails from the contacts
table:
-- step 1
CREATE TABLE contacts_temp
LIKE contacts;
-- step 2
INSERT INTO contacts_temp
SELECT *
FROM contacts
GROUP BY email;
-- step 3
DROP TABLE contacts;
ALTER TABLE contacts_temp
RENAME TO contacts;
C) Delete duplicate rows using the ROW_NUMBER()
function
Note that the
The following statement uses the ROW_NUMBER()
function has been supported since MySQL version 8.02 so you should check your MySQL version before using the function.ROW_NUMBER()
function to assign a sequential integer number to each row. If the email is duplicate, the row number will be greater than one.
SELECT
id,
email,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY email
) AS row_num
FROM contacts;
The following statement returns id list of the duplicate rows:
SELECT
id
FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY email) AS row_num
FROM
contacts
) t
WHERE
row_num > 1;
And you just delete the duplicate rows from the contacts
table using the DELETE
statement with a subquery in the WHERE
clause:
DELETE FROM contacts
WHERE
id IN (
SELECT
id
FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY email) AS row_num
FROM
contacts
) t
WHERE row_num > 1
);
MySQL issued the following message:4 row(s) affectedIn this tutorial, you have learned how to delete duplicate rows in MySQL by using the the
DELETE JOIN
statement or an intermediate table.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.