Trying to remove duplicates based on field and row

I am trying to see if I can use awk to remove duplicates from a file. This is the file:

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

However, I am having issues. I want to remove the first volume in the 2nd field if it has another entry under rearrange. So I want the file to look like this:

Correct file:

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

I have tried the following but I think it is not acknowledging the 2nd occurrence of the volume.

cat test3 | gawk '{if (Line!=$1$2) print; Line=$1$2}'
cat  test3 |gawk 'BEGIN{RS="="} $1=$1' FS= '\t" 

Also I have found a line that makes an array of the file, by searching for similar issues. I am not sure how this works, but I think it does not consider the "rearrange part"

 cat test3 |gawk '!arr[$2]++'

The above expression gets rid of the last line, which is NOT what I want. I want only the rearrange for that volume to be outputed. In addition, there is a command "tac" that I have seen some work with, but I don't have it on my distribution.

Does anybody have ideas? I am really a novice at removing duplicates and am not sure how the process works.

If its is OK that the order is not preserved:

awk '
        /^-/ {
                print $0
        !/^-/ {
                if ( !(A[$2]) )
                        A[$2] = $1
                else if ( $1 == "rearrange" )
                        A[$2] = $1
        END {
                for ( k in A )
                        print A[k], k
' file

Try :
if order doesn't matter

$ cat <<eof | awk 'NR==1;NR>1{A[$2]=$0}END{for(i in A)print A}'
-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

-==> Listvol <==
deleting   /vol/eng_rmd_0941
rearrange  /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1012
deleting   /vol/eng_rmd_1006

for file

$ awk 'NR==1;NR>1{A[$2]=$0}END{for(i in A)print A}' file
1 Like


I am relatively new to awk. This worked great! If you have time, could you briefly explain the syntax?

Also, what is the difference between your statement and

gawk '{if (Line!=$1$2) print; Line=$2}'

Thanks again! This statement was very concise. Just brilliant!

awk 'NR==1; ---> prints your header in line number 1

NR>1{A[$2]=$0} ---> line number is greater then 1

NR>1{ then Array A with index of column of $2 will hold line $0 that is A[$2]=$0

END{for(i in A)print A}' --> In END block printing array contents

gawk '{if (Line!=$1$2) print; Line=$2}'

Line!=$1$2 --> if line is not equal to column 1 and column2 then print line print , this will work for first line since Line is not set, after printing variable Line will be assigned the value of $2 Line=$2 , and again check if for 2nd line.


your code will not work because it just considers previous line pattern, in between if there is any duplicate it will get printed

and awk '!arr[$2]++' this prints only first found value from field 2 $2 this is the reason why rearrange is not getting printed

$ cat <<eof | awk '!arr[$2]++'                     
-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012
rearrange  /vol/eng_rmd_0943

-==> Listvol <==
deleting   /vol/eng_rmd_0941
deleting   /vol/eng_rmd_0943
deleting   /vol/eng_rmd_1006
deleting   /vol/eng_rmd_1012

Yoda solution keeps track of rearrange in field1, my solution assumes it's sorted so it save last found value, if file is not sorted I think you should go through Yoda's solution.

Try also (a bit lengthy)

cat -n file | sort -r |awk '!T[$2,$3]++' | sort | awk '{print $2 "\t" $3}'
-==>    Listvol
deleting    /vol/eng_rmd_0941
deleting    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_1006
deleting    /vol/eng_rmd_1012
rearrange    /vol/eng_rmd_0943

Hello All,

one more approach by using awk as follows.

Input file:

-==>    Listvol
deleting    /vol/eng_rmd_0941
deleting    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_1006
deleting    /vol/eng_rmd_1012
rearrange    /vol/eng_rmd_0943
sort -rk2 check_actual_read_opposite | awk '$2  == g {next} {g=$2} 1'

Output will be as follows it will not change the order for column two values with respect to column one values.

-==>    Listvol
deleting    /vol/eng_rmd_1012
deleting    /vol/eng_rmd_1006
rearrange    /vol/eng_rmd_0943
deleting    /vol/eng_rmd_0941

NOTE: where check_actual_read_opposite is the file name.

R. Singh