remove one of each similar lines in a file

tukuyomi · November 23, 2010, 1:04pm

Hello folks
I have a question for you gurus of sed or grep (maybe awk, but I would prefer the first two)
I have a file (f1) that says:
(actually, these are not numbers but md5sum, but for simplicity, let's assume these numbers.)

And I have a file (f2) that says

1|a
1|b
1|c
2|d
2|e
2|f
2|g
3|h
3|i
4|j
4|k
4|l
5|m
5|n

I would like to keep either

one of each line starting with the same number

1|a
2|d
3|h
4|j
5|m

or all other lines starting with the same number (I'll chose the most efficient)

1|b
1|c
2|e
2|f
2|g
3|i
4|k
4|l
5|n

I already accomplished miracles with sed and grep on previous steps of my final script, so I hope someone will get something simple for this problem.

Here is what I get with bash (It works but is slow...). Only f2 is needed in this example

while read l; do
    n="$md5"; md5="${l%%|*}"
    [ "$n" = "$md5" ] && { echo "$l" >> "$TMP1"; }
done < "f2"

In this script, all the second and later lines of similar md5 go to $TMP1 file to be processed later.

All datas are sorted by 1st field

Thank you in advance.

Scrutinizer · November 23, 2010, 1:20pm

awk -F\| 'A[$1]++==1' f1 f2

tukuyomi · November 23, 2010, 1:30pm

Wow I think I'll keep this awk solution after all
Thank you Scrutinizer

ctsgnb · November 23, 2010, 3:02pm

Assuming both files are presorted just like in your example
here is... the hard way ...

# cat f1
1
2
3
4
5
# cat f2
1|azoep
1|fskl
1|gjldfk
1|gjldiropez
1|gmlds
2|dsfgk
2|jgkfdls
3|fjsdk
3|jkflsdql
# >output
# ksh mtst
# cat output
1|azoep
2|dsfgk
3|fjsdk

# cat mtst
exec 3<f1
exec 4<f2
read -u3 n
read -u4 l
while [[ -n "$n" && -n "$l" ]]
do
        if [[ $n = "${l%%\|*}" ]]
        then
                echo $l >>output
                read -u3 n
                read -u4 l
        elif [[ $n < "${l%%\|*}" ]]; then
                read -u3 n
        else
                read -u4 l
        fi
done
exec 3<&-
exec 4<&-
#

If any idea to make the code quicker , please suggest, i am curious to see how fast we can tweak it (keeping at shell level)