Delete complete row according to condition

jiam912 · August 16, 2014, 3:55pm

Gents,

Please can you help me.

In the range 4-24 column the values are duplicate some times and I will like to delete the fist occurrence and keep the last only. The file is not sorted and I can sorted because from column 75 to the end the file is increase by time..

I have a file like this

S  21301.00  21481.00  2               0       915802.1 1846679.3  48.1227 23141
S  21083.00  21397.00  1               0       916712.0 1840909.0  55.7227 42035
S  21081.00  21619.00  2               0       921533.2 1843642.2  72.2227 52203
S  21299.00  22041.00  2               0       927954.1 1853627.3  96.7227 65151
S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21313.00  21353.00  2               0       912876.9 1845343.2  36.2227 30120
S  21095.00  21469.00  4               0       918111.1 1842071.9  55.0227 44452
S  21309.00  21861.00  2               0       923411.6 1851708.4  79.2227    40
S  21115.00  21869.00  1               0       926530.0 1847499.1  82.3227    58
S  21321.00  21845.00  1               0       923431.7 1851669.1  79.1227   135
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153
S  21113.00  21871.00  1               0       926596.1 1847485.5  83.3227   251
S  21115.00  21871.00  1               0       923473.9 1851689.8  77.9227   309
S  21113.00  21873.00  1               0       926640.2 1847501.4  83.2227   403
S  21323.00  21847.00  1               0       923455.8 1851729.7  78.0227   439

and i would like to delete the following lines

S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153

So, my output file should be like this.

S  21301.00  21481.00  2               0       915802.1 1846679.3  48.1227 23141
S  21083.00  21397.00  1               0       916712.0 1840909.0  55.7227 42035
S  21081.00  21619.00  2               0       921533.2 1843642.2  72.2227 52203
S  21299.00  22041.00  2               0       927954.1 1853627.3  96.7227 65151
S  21313.00  21353.00  2               0       912876.9 1845343.2  36.2227 30120
S  21095.00  21469.00  4               0       918111.1 1842071.9  55.0227 44452
S  21309.00  21861.00  2               0       923411.6 1851708.4  79.2227    40
S  21115.00  21869.00  1               0       926530.0 1847499.1  82.3227    58
S  21321.00  21845.00  1               0       923431.7 1851669.1  79.1227   135
S  21113.00  21871.00  1               0       926596.1 1847485.5  83.3227   251
S  21115.00  21871.00  1               0       923473.9 1851689.8  77.9227   309
S  21113.00  21873.00  1               0       926640.2 1847501.4  83.2227   403
S  21323.00  21847.00  1               0       923455.8 1851729.7  78.0227   439

Thanks in advance

rbatte1 · August 16, 2014, 4:11pm

How much is unique in the records you want to delete? As soon as you have a unique string, you can use grep to remove them.

Have a go and let us know how you get on so we can assist more if needed.

Robin

Don_Cragun · August 16, 2014, 4:53pm

Your directions aren't clear as to what is supposed to happen if the 21 characters starting in column 4 appear in more than two lines. Assuming you just want to keep the last one, this seems to do what you want:

awk '
FNR == NR {
	c[substr($0, 4, 21)]++
	next
}
c[substr($0, 4, 21)]-- == 1' file file

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

jiam912 · August 17, 2014, 11:23am

Don Cragun
I try like this

awk 'FNR == NR {c[substr($0, 4, 21)]++; next} c[substr($0, 4, 21)]-- == 1' file newfile

But dint give my any output

RudiC · August 17, 2014, 11:37am

You have to supply the original file twice, as the proposal needs to run through it once to count the repetitions, and once to print lines & skip the repeting ones. If you want/need, redirect stdout to a new file.

Scrutinizer · August 17, 2014, 1:11pm

Slight variation:

awk '{i=substr($0,4,21)} NR==FNR{P=FNR; next} P==FNR' file file

or perhaps:

awk '{i=$2 FS $3} NR==FNR{P=FNR; next} P==FNR' file file

or

awk '{i=$2 FS $3} P==FNR; NR==FNR{P=FNR}' file file

RudiC · August 17, 2014, 1:43pm

@Scrutinizer: nice approach! But shouldn't you include $4 as well, because requestor talked of char pos. 4 - 24 to be the key?

jiam912 · August 17, 2014, 2:55pm

Thanks to all appreciate your help it works fine...

---------- Post updated at 01:55 PM ---------- Previous update was at 01:06 PM ----------

Gents,
It is possible to modify the script to get the information rejected in a separate file

S  21309.00  21861.00  2               0       923928.7 1851604.8  77.3227  2105
S  21115.00  21871.00  1               0       926560.4 1847521.8  83.3227   153

I know I can get it using

grep -vFf newfile oldfile

but I would like to get it in a separate output.

Thanks

Don_Cragun · August 17, 2014, 3:39pm

Try:

awk '
FNR == NR {
	c[substr($0, 4, 21)]++
	next
}
c[substr($0, 4, 21)]-- == 1 {
	print
	next
}
{	print > "oldfile"
}' file file > newfile

jiam912 · August 17, 2014, 3:45pm

Thanks Don Cragun

I did somethig in my way,, but its more complicate,, it works but too much complicate

awk '{if(/^X/)print $0}' $jd"g".xps > tmpX10
awk 'FNR == NR {c[substr($0,4,12)]++;next}c[substr($0,4,12)]-- == 1' tmpX10 tmpX10 > tmpX11
awk 'FNR == NR {c[substr($0,20,19)]++;next}c[substr($0,20,19)]-- == 1' tmpX11 tmpX11 > tmpX12 
grep -vFf tmpX12 tmpX11 | awk '{print substr($0,1,38)}' > tmpX13
grep -vFf tmpX13 tmpX10 > tmpX

I will use your command it is more clear and efficient ... Thanks a lot for ur support

Scrutinizer · August 17, 2014, 4:11pm

Ow, yes.. Best to use i=substr($0,4,21) everywhere ...