bash keep only duplicate lines in file

vlm · June 9, 2012, 1:23pm

hello all

in my bash script I have a file and I only want to keep the lines that appear twice in the file.Is there a way to do this?
thanks in advance!

agama · June 9, 2012, 1:57pm

To do this a bit more information is needed:

1) is the file sorted, or are the lines you wish to 'keep' adjacent to each other in the file?

2) is the order of the output important? Do the lines 'kept' need to be in the same order that they appeared in the input?

3) do some lines appear more than twice, and should those be kept as well, or do you want to keep the lines that appear exactly twice?

4) how big is the file in terms of number of lines?

vlm · June 9, 2012, 2:02pm

1)yes the files are sorted

2)no the output order is not at all important

3)lines appear either once or twice

4)that's not known..some files have 500 lines or more and some 4...

I found this:

comm -3 file1 file2 > file3

but it stores to file3 the lines that appear only in the one file and not in the other,the exact opposite from what I want:(

Actually

comm -12 file1 file2 > file3

is what I want but it makes the execution much slower

agama · June 9, 2012, 2:15pm

Comm can output lines that are common to both files, but from your initial post you suggest that you only have one file to work with and comm won't help with that.

Given that your file is already sorted this is the easy case and something like this will probably do what you need:

awk ' p == $0; { p = $0 }' input-file >output-file

It will write each line that is duplicated once onto standard output.

---------- Post updated at 14:15 ---------- Previous update was at 14:13 ----------

Ok, looks like we crossed posts and you have two files, not one as suggested in your initial question.

Using comm -12 is the easiest, and probably most efficient method.

psychognite · June 9, 2012, 9:58pm

uniq -d is meant for displaying duplicate lines....