rowid is not an indexed column - it is a "pseudocolumn'. the in () subselect will read thru the entire select statement's result set each time. When I get back in a while I'll write something that is faster. You may need to add an index.
Matrix,
Your solution is very simple and easy to understand.
Although it works fine for small to medium tables, or for tables with low access/update,
for very large tables, or tables with heavy access/update, it may:
1) Generate an abnormally long transaction.
2) Fill the logs.
3) Be involved with another process in a deadlock.
4) Run for a very long time.
It is also important to note that for large tables, the internals of your query will be
very ineficient as the system will store a rowid for each unique key and loop thru
each one for every row.
The best and optimized solution would be to write a program to loop thru each row
in the table, begin a transaction and commit every number of deleted rows -- usually
one to five thousand is very quick, safe and easy on the database.
I don't know if this is more efficient, but it seems like a positive approach might be better where there are only a few duplicates.
DELETE FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*) > 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1)
Am not able to understand the optimization that you have made to make the query run faster.
Basically, with combination of using selected columns ( column1 ) and the specific rowids ( rowid ), what is the need to specify an extra condition with a separate subquery to extract column1.
Isnt that redundant ? Or how does it make the query optimized and more efficient.
This is really equivalent to for each record with a rowid ' x ', necessarily two subqueries should be executed for all the records.
Yes you are right that it makes 2 sub-selects. If you have many duplicates, then your original post may be better. But suppose you only have a single duplicate. My subquery1 returns only a single value of column1 (the duplicate), where your subquery returns many values (the non duplicates). So now I have a single value to compare against tableA where you have many values. Yes, I still have to make subquery2, but it is against a much smaller set.
Subquery2 makes sure that, for a given value of column1, we don't delete the row with the smallest rowid.
Look - the query has to search the ENTIRE output of the IN (SELECT ..)
for each set of rows it deletes. I know that rowid's are what you get from an index.
Instead of me going on about this - look in Tom Kyte's book 'EXpoert One on One' and look in the analytical function chapter - there is an example of how to optimize a query very like this. Or try the asktom oracle site:
From there:
from
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1224636375004
delete from tableA a
where rowid <> ( select max(rowid)
from tableA b
where b.column1 = a.column1)
/
I made a test table with a single varchar field and no index. I loaded it with 30,000 records and an additional 30 duplicate records. I ran the following queries.
select count(*) FROM tableA
WHERE rowid not in
(SELECT MIN(rowid) FROM tableA GROUP BY column1);
1 hour 12 minutes 58 seconds
select count(*) FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*)
> 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1);
1 second
select count(*) from tableA a
where rowid <> ( select max(rowid)
from tableA b
where b.column1 = a.column1);
6 min 29 sec
Have you decided if anything works better than your original query? If so, can you let us know what worked and some idea of the time difference? Thanks.