remove all duplicate lines from all files in one folder

lowmaster · May 30, 2009, 1:32am

Hi,

is it possible to remove all duplicate lines from all txt files in a specific folder?

This is too hard for me maybe someone could help.

lets say we have an amount of textfiles 1 or 2 or 3 or... maximum 50
each textfile has lines with text.

I want all lines of all textfiles together to be unique. but the not duplicate lines must remain in txt file where they are.

it does not matter, in what txt-file the dupicate lines are deleted, but one occurance has to stay in least one txt file... An even better solution would delete the duplicate occourances first in textfile 1 then in 2 then in 3, so that the amount of lines deleted are spread to all txt files.

example with 4 textfiles (amount can vary, up to 50) we also do not know how many lines.

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
aaaaaaa
ccccccc
ddddddd

txt3
ccccccc
ddddddd
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a result could look for example like this:

txt1:
aaaaaaa
bbbbbbb
ccccccc

txt2:
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

a perfect result (if possible) looks like this:

txt1:
aaaaaaa
bbbbbbb

txt2:
ccccccc
ddddddd

txt3
eeeeeee

txt4
ggggggg
hhhhhhh
kkkkkkkk

vidyadhar85 · May 30, 2009, 2:14am

try using commands like sort and uniq to get rid of duplicate lines refer man pages and give it a try if not possible revert back will help you

-vidya

devtakh · May 30, 2009, 2:54am

I am not sure why you would need that..

why dont you just combine all the files and then use sort to get the uniq ones and then finally split the files..

cat *.txt | sort -u > newfile

man split

-Devaraj Takhellambam

ghostdog74 · May 30, 2009, 2:55am

@dev, requirement is different. OP still needs to keep his files.

colemar · May 30, 2009, 3:30am

This is a tough one.

Could be done with awk, but to simplify its work I believe it would be a good idea to first combine all the files in one file in such a way that all the original information is retained:

txt1 aaaaaaa
txt1 bbbbbbb
txt1 ccccccc
txt2 aaaaaaa
txt2 ccccccc
txt2 ddddddd
txt3 ccccccc
txt3 ddddddd
txt3 eeeeeee
txt4 ggggggg
txt4 hhhhhhh
txt4 kkkkkkkk

This way you have the original filename in the first column. The file can be sorted on the second colum, then you can apply an awk program that appends each field $2 as a line to a file named after field $1 but only if field $2 did not appear on the previous input line.
The delete operations would be automatically spread over the filenames.
.
.
.
.
.
.
.
.
.
.
.
.
.
Now you are wondering how to combine the files, sort the result, process it.

awk '{print FILENAME,$0}' * | sort -k 2 > bigfile
awk 'someprogram' bigfile

Of course you don't need to have an intermediate file:

awk '{print FILENAME,$0}' * | sort -k 2 | awk 'someprogram'

.
.
.
.
.
.
.
.
.
.
someprogram:

$2 != p { print $2 >> $1 }
{ p = $2 }

.
.
.
.
.
.
.
.
.
Perhaps better (to avoid the risk to exceed the system limit on the number of open files in one process):

awk '{print FILENAME,$0}' * | sort -k 2 | awk '$2!=p;{p=$2}' | sort -k 1 | awk '{$1!=p&&p{close(p)};{print$2>>$1;p=$1}'

This could be schematized as:
muxer | sort | filter | sort | splitter

devtakh · May 30, 2009, 4:24am

awk '{s=FILENAME}!a[$0]++ && FILENAME == s{print $0>FILENAME }' txt*

-Devaraj Takhellambam

colemar · May 30, 2009, 5:13am

FILENAME==s is always true since it can happen only just after s=FILENAME.

>FILENAME is writing to the same file that awk is reading, I believe this is not a good idea. Plus, to append to a file you need to use >>.

The code can be reworked as:

mkdir tmp
awk '!a[$0]++{print$0>>"tmp/"FILENAME}' txt*

devtakh · May 30, 2009, 5:34am

colemar:

FILENAME==s is always true since it can happen only just after s=FILENAME.

>FILENAME is writing to the same file that awk is reading, I believe this is not a good idea. Plus, to append to a file you need to use >>.

The code can be reworked as:
mkdir tmp
awk '!a[$0]++{print$0>>"tmp/"FILENAME}' txt*

You should not use >> unless you want to preserve what was there in the file before the awk script runs. you should use a > operator.

Again, if there are many files, you should close them or elsem due to the OS limitation, you may find some errors.It is always a good idea to close them explicitly. use

close(filename)

colemar · May 30, 2009, 7:45am

Right. In this case however >> does not hurt, since the files are newly created.

Right. It is funny that I didn't include a close() because you originally didn't provide it.

mkdir tmp
awk 'FILENAME!=s{if(s)close(s);s=FILENAME}!a[$0]++{print$0>"tmp/"s}' txt*

This code tends to delete more lines in files that come later in ASCII-order, hence it is not the best solution according to the original poster.