Trying to remove duplicates based on field and row

newbie2010 · December 11, 2013, 1:13pm

I am trying to see if I can use awk to remove duplicates from a file. This is the file:

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

However, I am having issues. I want to remove the first volume in the 2nd field if it has another entry under rearrange. So I want the file to look like this:

Correct file:

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

I have tried the following but I think it is not acknowledging the 2nd occurrence of the volume.

cat test3 | gawk '{if (Line!=$1$2) print; Line=$1$2}'

cat  test3 |gawk 'BEGIN{RS="="} $1=$1' FS= '\t"

Also I have found a line that makes an array of the file, by searching for similar issues. I am not sure how this works, but I think it does not consider the "rearrange part"

 cat test3 |gawk '!arr[$2]++'

The above expression gets rid of the last line, which is NOT what I want. I want only the rearrange for that volume to be outputed. In addition, there is a command "tac" that I have seen some work with, but I don't have it on my distribution.

Does anybody have ideas? I am really a novice at removing duplicates and am not sure how the process works.

Yoda · December 11, 2013, 1:21pm

If its is OK that the order is not preserved:

awk '
        /^-/ {
                print $0
        }
        !/^-/ {
                if ( !(A[$2]) )
                        A[$2] = $1
                else if ( $1 == "rearrange" )
                        A[$2] = $1
        }
        END {
                for ( k in A )
                        print A[k], k
        }
' file

Akshay_Hegde · December 11, 2013, 1:24pm

Try :
if order doesn't matter

$ cat <<eof | awk 'NR==1;NR>1{A[$2]=$0}END{for(i in A)print A}'
-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943
eof

-==> Listvol <==
deleting   /vol/eng_rmd_0941
rearrange  /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1012
deleting   /vol/eng_rmd_1006

for file

$ awk 'NR==1;NR>1{A[$2]=$0}END{for(i in A)print A}' file

newbie2010 · December 11, 2013, 1:31pm

Akshay:

I am relatively new to awk. This worked great! If you have time, could you briefly explain the syntax?

Also, what is the difference between your statement and

gawk '{if (Line!=$1$2) print; Line=$2}'

Thanks again! This statement was very concise. Just brilliant!

Akshay_Hegde · December 11, 2013, 1:39pm

newbie2010:

Akshay:

I am relatively new to awk. This worked great! If you have time, could you briefly explain the syntax?

Also, what is the difference between your statement and
gawk '{if (Line!=$1$2) print; Line=$2}'
Thanks again! This statement was very concise. Just brilliant!

awk 'NR==1; ---> prints your header in line number 1

NR>1{A[$2]=$0} ---> line number is greater then 1

NR>1{ then Array A with index of column of $2 will hold line $0 that is A[$2]=$0

END{for(i in A)print A}' --> In END block printing array contents

gawk '{if (Line!=$1$2) print; Line=$2}'

Line!=$1$2 --> if line is not equal to column 1 and column2 then print line print , this will work for first line since Line is not set, after printing variable Line will be assigned the value of $2 Line=$2 , and again check if for 2nd line.

--edit--

your code will not work because it just considers previous line pattern, in between if there is any duplicate it will get printed

and awk '!arr[$2]++' this prints only first found value from field 2 $2 this is the reason why rearrange is not getting printed

$ cat <<eof | awk '!arr[$2]++'                     
-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943
eof

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012

Yoda solution keeps track of rearrange in field1, my solution assumes it's sorted so it save last found value, if file is not sorted I think you should go through Yoda's solution.

RudiC · December 11, 2013, 2:47pm

Try also (a bit lengthy)

cat -n file | sort -r |awk '!T[$2,$3]++' | sort | awk '{print $2 "\t" $3}'
-==>    Listvol
deleting    /vol/eng_rmd_0941
deleting    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_1006
deleting    /vol/eng_rmd_1012
rearrange    /vol/eng_rmd_0943

RavinderSingh13 · January 23, 2014, 9:06am

Hello All,

one more approach by using awk as follows.

Input file:

-==>    Listvol
deleting    /vol/eng_rmd_0941
deleting    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_1006
deleting    /vol/eng_rmd_1012
rearrange    /vol/eng_rmd_0943

sort -rk2 check_actual_read_opposite | awk '$2  == g {next} {g=$2} 1'

Output will be as follows it will not change the order for column two values with respect to column one values.

-==>    Listvol
deleting    /vol/eng_rmd_1012
deleting    /vol/eng_rmd_1006
rearrange    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_0941

NOTE: where check_actual_read_opposite is the file name.

Thanks,
R. Singh