Removing Duplicate Rows In T-SQL
Been doing some interviewing lately and finding that people think they are really good at T-SQL (claiming they are a 6 out of 10), however almost none of them write their T-SQL by hand. If you are using the WYSIWYG query builder you are a 3 out of 10, or less. Here is one of my favorite interview problems that comes up in real life (my life) a lot.
Given a table like this:
CREATE T1 (ID int PRIMARY KEY, Name varchar(50))
That has duplicate rows in the name column, write a query to "clean" the duplicates leaving only on row per name. So if "Tom Smith" shows up in two rows (ID: 2, and 4), the query would delete either rows with ID 2 or 4 and leave the other one. My favorite answer is:
DELETE T1
WHERE NOT T1.ID IN (SELECT MIN(T1.ID) FROM T1 GROUP BY Name)
This includes an IN clause where is terrible for performance -- however the best solution I know. Most people try it with a cursor which is much harder.
Note: This is how you "clean" a table of duplicates before you add a unique index to the name column. A typical scenario when you have a lookup table that has some extra data and you are normalizing the database.
{6230289B-5BEE-409e-932A-2F01FA407A92}
Comments
Post a Comment