Removing Duplicate Rows In T-SQL

Been doing some interviewing lately and finding that people think they are really good at T-SQL (claiming they are a 6 out of 10), however almost none of them write their T-SQL by hand.  If you are using the WYSIWYG query builder you are a 3 out of 10, or less.  Here is one of my favorite interview problems that comes up in real life (my life) a lot.

Given a table like this:

CREATE T1 (ID int PRIMARY KEY, Name varchar(50))

That has duplicate rows in the name column, write a query to "clean" the duplicates leaving only on row per name.  So if "Tom Smith" shows up in two rows (ID: 2, and 4), the query would delete either rows with ID 2 or 4 and leave the other one.  My favorite answer is:

DELETE T1
WHERE NOT T1.ID IN (SELECT MIN(T1.ID) FROM T1 GROUP BY Name)

This includes an IN clause where is terrible for performance -- however the best solution I know.  Most people try it with a cursor which is much harder.

Note:  This is how you "clean" a table of duplicates before you add a unique index to the name column.  A typical scenario when you have a lookup table that has some extra data and you are normalizing the database.

{6230289B-5BEE-409e-932A-2F01FA407A92}

 

Comments

Popular posts from this blog

Simple WP7 Mango App for Background Tasks, Toast, and Tiles: Code Explanation

Yet once more into the breech (of altered programming logic)

How to convert SVG data to a Png Image file Using InkScape