Removing duplicates

gctex · April 25, 2011, 10:30am

I have a test file with the following 2 columns:

Col 1       |     Col 2
T1          |         1    <= remove
T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T4          |         1    <= remove
  T1        |         2    <= remove
  T3        |         2    <= remove
T3          |         1    <= remove
T2          |         1

I need to remove any sub branches ... eg., T4 in the left column appears above with a value of 2 in the right column. So remove any other occurences of T4 with lesser value in the right column. Similarly T1, 1 T1,2 need to be removed because there is T1,3. Data with higher value in Column 2 needs to be retained.

Expected final list:

T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T2          |         1

Franklin52 · April 25, 2011, 11:27am

awk -F"|" '$2 > a[$1]{a[$1]=$NF} END{for(i in a)print i FS a}' file

gctex · April 25, 2011, 10:23pm

Thanks, it works, but it prints this way :

T1 | 3
T2 | 1
T3 | 3
T4 | 2
T5 | 1

Can we print it without altering the original sort order?

Also, the first column, with values greater than 1 in the second column, need to be indented. ie., T4, T1 & T3.

(Original file had the indendations, but for some reason the indendation gets removed when the code is posted).

---------- Post updated at 09:23 PM ---------- Previous update was at 12:18 PM ----------

Frankin, thanks for adding code tags to my post. So can we print it the way I want it?

pravin27 · April 26, 2011, 2:49am

Try this,

awk -F"|" 'NR==FNR{if(a[$1]){ if(a[$1]<$2) {a[$1]=$2;b[$1]=NR}} else {a[$1]=$2;b[$1]=NR}}
NR>FNR{if(b[$1]==FNR){print}}' infile infile

palanisvr · April 26, 2011, 7:43am

##--get unique tags
for i in ` cat testfile.txt | awk  '{print $1}'|sort -u`
do
grep $i testfile.txt >temp.txt
cat temp.txt | sort -n |tail -1  >>finaldata.txt
done

gctex · April 26, 2011, 10:59am

Not sure, this is what I am getting:

 
!. srt1.sh
T1          |         1
T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T4          |         1
  T1        |         2
  T3        |         2
T3          |         1
T2          |         1
T1          |         1
T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T4          |         1
  T1        |         2
  T3        |         2
T3          |         1
T2          |         1
 
!cat srt1.sh
awk -F"|" 'NR==FNR{if(a[$1]){ if(a[$1]<$2) {a[$1]=$2;b[$1]=NR}} else {a[$1]=$2;b[$1]=NR}}
NR>FNR{if(b[$1]==FNR){print}}' fp1.txt fp1.txt

---------- Post updated at 09:59 AM ---------- Previous update was at 09:56 AM ----------

This is what I am getting:

!. srt.sh
T1          |         1
T2          |         1
T3          |         1
T4          |         1
T5          |         1
 
!cat srt.sh
for i in `cat fp1.txt | awk  '{print $1}'|sort -u`
do
grep $i fp1.txt >temp.txt
cat temp.txt | sort -n |tail -1  >>finaldata.txt
done
cat finaldata.txt

palanisvr · April 27, 2011, 4:22am

I got the desired output for the below.

script : 
]$ cat test.sh
rm finaldata.txt
##--get unique tags
for i in ` cat tt.txt | awk  '{print $1}'|sort -u`
do
grep $i tt.txt >temp.txt
cat temp.txt | sort -n |tail -1  >>finaldata.txt
done
cat finaldata.txt

have tried with this test file :

$ cat tt.txt
T1          |         1
T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T4          |         1
  T1        |         2
  T3        |         2
T3          |         1
T2          |         1
T1          |         1
T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T4          |         1
  T1        |         2
  T3        |         2
T3          |         1
T2          |         1

Got output :

$sh test.sh
    T1      |         3
T2          |         1
    T3      |         3
  T4        |         2
T5          |         1

gctex · April 27, 2011, 7:50am

T1 is a sub branch of T4 just like T3. So it needs to appear along with T3. So final sort has to be this way:

T5          |         1
  T4        |         2
    T1      |         3
    T3      |         3
T2          |         1