Delete duplicate strings in a line

Hi,

i need help to remove duplicates in my file. The problem is i need to delete one duplicate for each line only. the input file as follows and it is not tab delimited:-

The output need to remove 2nd word (in red) that duplicate with 1st word (in blue). Other duplicates should remained unchanged. my output should be like this:-

i don't know how to do this. i did try but it deleted all the other duplicates as well in that lines. tried to google too and it seems that most of the issue is the duplicate lines. Please kindly help. Thanks

Here is an awk solution. Note that the 3rd record in your output file does not match your input and requirement as the fields do not match. Assuming that the '&' in the beginning of field 1 is not included when matching field 2 even though you highlighted in blue.

awk '{if (substr($1,2,length($1)-1)==$2) print $1,$3,$4,$5; else print $1,$2,$3,$4,$5 }' file.txt
&aff2g0440 aspl2221 nos:scad1 blablablabla
&aff2g0740 aspl5221 nos:scad1 blablablabla
&aff4g0160 aff4g01600 aspl2251 nos:scad1 blablablabla
&aff9g0020 aspl3391 nos:scad2 blablablabla
1 Like
awk '$1~$2{$2=x}1' file
1 Like
$ sed 's/&\([^ ]* \)\1/\&\1/' file
&aff2g0440 aspl2221 nos:scad1 blablablabla
&aff2g0740 aspl5221 nos:scad1 blablablabla 
&aff4g0160 aff4g01600 aspl2251 nos:scad1 blablablabla
&aff9g0020 aspl3391 nos:scad2 blablablabla
1 Like

Hi guys,

Thanks so much for your fast responses. I tried all of your codes, and Yoda codes perfectly solved my problem. mjf, your codes worked too but it deleted some of the strings that i have in my file. I have a huge files that has many weird things, and i tried changing your codes to see it how it goes. There are still strings missing though i managed to get some. and Scrutinizer, i have a problem with your codes too. But, i really appreciate your ideas on this. Thanks a lot guys!

---------- Post updated at 09:10 PM ---------- Previous update was at 09:09 PM ----------

hi Yoda,

if possible, can u explain your code here? thanks

OR

$ awk 'substr($1,2) == $2{$2=x;$0=$0;$1=$1}1' file

&aff2g0440 aspl2221 nos:scad1 blablablabla
&aff2g0740 aspl5221 nos:scad1 blablablabla
&aff4g0160 aff4g01600 aspl2251 nos:scad1 blablablabla
&aff9g0020 aspl3391 nos:scad2 blablablabla
1 Like

Hi Akshay,

your codes work perfectly..thanks..can you pls explain it? thanks :slight_smile:

substr($1,2) ---> if your input is &aff2g0440 after using substr($1,2) you will get aff2g0440 second char onwards from column 1 and it searches for exact match in column2, if condition is true,

$2 = x ---> since x is not set, its NULL so field will be masked here(or empty field2)

$0 = $0 ---> recalculate field

$1=$1 ---> recalculate record, and remove space

finally

}1 --> 1 is true 0 is false since its one so prints all the line

1 Like

Hi Akshay,

thanks so much!! your explanation is simple and clear :wink:

one more approach for awk .

cat <<eof | awk '{gsub(/\&/,X)}  $1==$2 {$2=X;$1="&"$1}1'
&aff2g0440 aff2g0440 aspl2221 nos:scad1 blablablabla
&aff2g0740 aff2g0740 aspl5221 nos:scad1 blablablabla
&aff4g0160 aff4g01600 aspl2251 nos:scad1 blablablabla
&aff9g0020 aff9g0020 aspl3391 nos:scad2 blablablabla
eof

Output will be as follows.

&aff2g0440  aspl2221 nos:scad1 blablablabla
&aff2g0740  aspl5221 nos:scad1 blablablabla
aff4g0160 aff4g01600 aspl2251 nos:scad1 blablablabla
&aff9g0020  aspl3391 nos:scad2 blablablabla

Thanks,
R. Singh

@ravinder, that is UUOC ... compare:

awk ... <<eof
[...]

--
The awk's gsub will delete all ampersands on the line (instead of the first character ampersand in $1), which happens to work with the given input..

Thank you Scrutinizer for correcting me. :b:

@RavinderSingh13

Desired output is not the one which you have shown. please read what is thread is about and all answers (if answered earlier) before you reply something.