Delete complete row according to condition

Gents,

Please can you help me.

In the range 4-24 column the values are duplicate some times and I will like to delete the fist occurrence and keep the last only. The file is not sorted and I can sorted because from column 75 to the end the file is increase by time..

I have a file like this

S  21301.00  21481.00  2               0       915802.1 1846679.3  48.1227 23141
S  21083.00  21397.00  1               0       916712.0 1840909.0  55.7227 42035
S  21081.00  21619.00  2               0       921533.2 1843642.2  72.2227 52203
S  21299.00  22041.00  2               0       927954.1 1853627.3  96.7227 65151
S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21313.00  21353.00  2               0       912876.9 1845343.2  36.2227 30120
S  21095.00  21469.00  4               0       918111.1 1842071.9  55.0227 44452
S  21309.00  21861.00  2               0       923411.6 1851708.4  79.2227    40
S  21115.00  21869.00  1               0       926530.0 1847499.1  82.3227    58
S  21321.00  21845.00  1               0       923431.7 1851669.1  79.1227   135
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153
S  21113.00  21871.00  1               0       926596.1 1847485.5  83.3227   251
S  21115.00  21871.00  1               0       923473.9 1851689.8  77.9227   309
S  21113.00  21873.00  1               0       926640.2 1847501.4  83.2227   403
S  21323.00  21847.00  1               0       923455.8 1851729.7  78.0227   439

and i would like to delete the following lines

S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153

So, my output file should be like this.

S  21301.00  21481.00  2               0       915802.1 1846679.3  48.1227 23141
S  21083.00  21397.00  1               0       916712.0 1840909.0  55.7227 42035
S  21081.00  21619.00  2               0       921533.2 1843642.2  72.2227 52203
S  21299.00  22041.00  2               0       927954.1 1853627.3  96.7227 65151
S  21313.00  21353.00  2               0       912876.9 1845343.2  36.2227 30120
S  21095.00  21469.00  4               0       918111.1 1842071.9  55.0227 44452
S  21309.00  21861.00  2               0       923411.6 1851708.4  79.2227    40
S  21115.00  21869.00  1               0       926530.0 1847499.1  82.3227    58
S  21321.00  21845.00  1               0       923431.7 1851669.1  79.1227   135
S  21113.00  21871.00  1               0       926596.1 1847485.5  83.3227   251
S  21115.00  21871.00  1               0       923473.9 1851689.8  77.9227   309
S  21113.00  21873.00  1               0       926640.2 1847501.4  83.2227   403
S  21323.00  21847.00  1               0       923455.8 1851729.7  78.0227   439

Thanks in advance :b:

How much is unique in the records you want to delete? As soon as you have a unique string, you can use grep to remove them.

Have a go and let us know how you get on so we can assist more if needed.

Robin

Your directions aren't clear as to what is supposed to happen if the 21 characters starting in column 4 appear in more than two lines. Assuming you just want to keep the last one, this seems to do what you want:

awk '
FNR == NR {
	c[substr($0, 4, 21)]++
	next
}
c[substr($0, 4, 21)]-- == 1' file file

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

1 Like

Don Cragun
I try like this

awk 'FNR == NR {c[substr($0, 4, 21)]++; next} c[substr($0, 4, 21)]-- == 1' file newfile

But dint give my any output :frowning:

You have to supply the original file twice, as the proposal needs to run through it once to count the repetitions, and once to print lines & skip the repeting ones. If you want/need, redirect stdout to a new file.

1 Like

Slight variation:

awk '{i=substr($0,4,21)} NR==FNR{P=FNR; next} P==FNR' file file

or perhaps:

awk '{i=$2 FS $3} NR==FNR{P=FNR; next} P==FNR' file file

or

awk '{i=$2 FS $3} P==FNR; NR==FNR{P=FNR}' file file
1 Like

@Scrutinizer: nice approach! But shouldn't you include $4 as well, because requestor talked of char pos. 4 - 24 to be the key?

Thanks to all appreciate your help it works fine... :b::slight_smile:

---------- Post updated at 01:55 PM ---------- Previous update was at 01:06 PM ----------

Gents,
It is possible to modify the script to get the information rejected in a separate file

S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153

I know I can get it using

grep -vFf newfile oldfile

but I would like to get it in a separate output.

Thanks

Try:

awk '
FNR == NR {
	c[substr($0, 4, 21)]++
	next
}
c[substr($0, 4, 21)]-- == 1 {
	print
	next
}
{	print > "oldfile"
}' file file > newfile

Thanks Don Cragun

I did somethig in my way,, but its more complicate,, it works but too much complicate

awk '{if(/^X/)print $0}' $jd"g".xps > tmpX10
awk 'FNR == NR {c[substr($0,4,12)]++;next}c[substr($0,4,12)]-- == 1' tmpX10 tmpX10 > tmpX11
awk 'FNR == NR {c[substr($0,20,19)]++;next}c[substr($0,20,19)]-- == 1' tmpX11 tmpX11 > tmpX12 
grep -vFf tmpX12 tmpX11 | awk '{print substr($0,1,38)}' > tmpX13
grep -vFf tmpX13 tmpX10 > tmpX   

I will use your command it is more clear and efficient ... Thanks a lot for ur support

Ow, yes.. Best to use i=substr($0,4,21) everywhere ...