Skipping rows based on columns

ramky79 · May 1, 2012, 12:13pm

Hi,
suppose I have the following file and certain rows have missing columns,
how do i skip these rows and create an output file which has all the columns in it

E/N       Ko_exp   %err  Ko_calc  %err   diff  diff-  diff+  0.95
        ========  =======  ====  =======  ====  =====  =====  =====  ====
     1      4.00   2.8100   3.0   3.9502   0.5  -1.14  -1.31  -0.97    0
     2      8.00   2.8123   3.0   3.9668   0.5  -1.15  -1.32  -0.98    0
     3     12.00   2.8300   3.0   3.9920   0.5  -1.16  -1.33  -0.99    0
     4     16.00            3.0   4.0201   0.5  -1.18  -1.35  -1.00    0
     5     20.00   2.8700   3.0   4.0473   0.5  -1.18  -1.35  -1.00    0
     6     24.00   2.9007   3.0            0.5  -1.17  -1.34  -0.99    0
     7     28.00   2.9437   3.0   4.0807   0.5  -1.14  -1.31  -0.96    0
     8     32.00   2.9983   3.0   4.0833   0.5  -1.08  -1.27  -0.90    0
     9     36.00   3.0567   3.0   4.0778   0.5  -1.02  -1.21  -0.84    0
    10     40.00   3.1100   3.0   4.0656   0.5  -0.96  -1.14  -0.77    0

I want my output file to look like this:

E/N       Ko_exp   %err  Ko_calc  %err   diff  diff-  diff+  0.95
        ========  =======  ====  =======  ====  =====  =====  =====  ====
     1      4.00   2.8100   3.0   3.9502   0.5  -1.14  -1.31  -0.97    0
     2      8.00   2.8123   3.0   3.9668   0.5  -1.15  -1.32  -0.98    0
     3     12.00   2.8300   3.0   3.9920   0.5  -1.16  -1.33  -0.99    0
     5     20.00   2.8700   3.0   4.0473   0.5  -1.18  -1.35  -1.00    0
     7     28.00   2.9437   3.0   4.0807   0.5  -1.14  -1.31  -0.96    0
     8     32.00   2.9983   3.0   4.0833   0.5  -1.08  -1.27  -0.90    0
     9     36.00   3.0567   3.0   4.0778   0.5  -1.02  -1.21  -0.84    0
    10     40.00   3.1100   3.0   4.0656   0.5  -0.96  -1.14  -0.77    0

guruprasadpr · May 1, 2012, 12:17pm

Hi

$ awk 'NR<=2 || NF==10' file

Guru.

neutronscott · May 1, 2012, 12:18pm

With awk you can print the first two lines, and then only lines which have 10 columns:

awk 'NR<3||NF==10' input

Had to include the header separately, since it has only 9 columns.

ramky79 · May 2, 2012, 10:29am

Thankyou Guru and neutronscott.
this is working but, I see that it is skipping the rows eventhough they have 5 fields in them...

see my examples below

Here is the first 10 lines of my input file:

Country  Postal  Admin4  StreetBaseName  StreetType
HUN      2243    K�ka    D�zsa Gy�rgy   �t
HUN      5475    Cs�pa   4511
HUN      9600    S�rv�r  Ady Endre      utca
HUN      8705    Somogyszentp�l  Kossuth        utca
HUN      7098    Magyarkeszi     H?s�k  tere
HUN      2483    G�rdony
HUN      5100    J�szber�ny
HUN      5100    J�szber�ny      Lehel vez�r    t�r
HUN      5811    V�gegyh�za      Sz�chenyi Istv�n       �t

I have used the following code:

awk 'NR<2||NF==5' HUN1.dat >HUN2.dat

Here are the First 10 lines of my output file:

Country  Postal  Admin4  StreetBaseName  StreetType
HUN      8705    Somogyszentp�l  Kossuth        utca
HUN      7098    Magyarkeszi     H?s�k  tere
HUN      2310    Szigetszentmikl�s       Losonczi       utca
HUN      7142    P�rb�ly         �voda  utca
HUN      4025    Debrecen        Barna  utca
HUN      2040    Buda�rs         Farkasr�ti     utca
HUN      2040    Buda�rs         Szabads�g      �t
HUN      9373    Pusztacsal�d    �j     utca
HUN      4262    Ny�racs�d       R�k�czi        utca

Line 1,3,9 and 10 are skipped even though they have 5 fields in them.

neutronscott · May 2, 2012, 10:56am

Problem there is, what defines a field? Are those tabs? Because line 1 is 6 columns if you use space delimiter because of the space in "D�zsa Gy�rgy"

If they are tabs: awk -F'\t' 'NF==5'
If they are spaces: awk -F' *' 'NF==5'
That's 3 spaces before the asterisks, then each field is split by 2 or more spaces..

ramky79 · May 2, 2012, 11:11am

Now the problem is back to square one...
I did try it with -F'\t'; now i see lines with four fields the fifth field is empty.

I have tried the folloiwng code

awk -F'\t' 'NR<2||NF==5' HUN1.dat >HUN4.dat

here are the first 10 lines from the result file

Country  Postal  Admin4  StreetBaseName  StreetType
HUN      2243    K�ka    D�zsa Gy�rgy   �t
HUN      5475    Cs�pa   4511
HUN      9600    S�rv�r  Ady Endre      utca
HUN      8705    Somogyszentp�l  Kossuth        utca
HUN      7098    Magyarkeszi     H?s�k  tere
HUN      2483    G�rdony
HUN      5100    J�szber�ny
HUN      5100    J�szber�ny      Lehel vez�r    t�r
HUN      5811    V�gegyh�za      Sz�chenyi Istv�n       �t

Scrutinizer · May 2, 2012, 11:19am

@ramky, this is because your data sample was not representative of your actual data.
Try:

awk 'NF>4' infile

but that will give false positives for streets consisting of two words and a missing street type, so you would need to manually remove records..
Or you can try to tinker with the -F value like neutronscott suggested..

neutronscott · May 2, 2012, 11:40am

$ echo $'a\t\t\tb' | awk -F'\t' '{print NF}'
4

Ooops. Oh, right. If using tab and the tab is there but the field is blank, we'll need a better test. hmm.. best I can think of:

awk -F'\t' '{for (i=1;i<NF;i++) if (!length($i)) next}1'

Scrutinizer · May 2, 2012, 11:51am

How about: -F'[ \t][ \t]+'

awk -F'[ \t][ \t]+' 'NF>4' infile

--
You can check how may field it finds to verify if the field separator does the right thing:

$ awk -F'[ \t][ \t]+' '{print NF,$0}' infile
5 Country  Postal  Admin4  StreetBaseName  StreetType
5 HUN      2243    K�ka    D�zsa Gy�rgy   �t
4 HUN      5475    Cs�pa   4511
5 HUN      9600    S�rv�r  Ady Endre      utca
5 HUN      8705    Somogyszentp�l  Kossuth        utca
5 HUN      7098    Magyarkeszi     H?s�k  tere
3 HUN      2483    G�rdony
3 HUN      5100    J�szber�ny
5 HUN      5100    J�szber�ny      Lehel vez�r    t�r
5 HUN      5811    V�gegyh�za      Sz�chenyi Istv�n       �t

Corona688 · May 2, 2012, 11:58am

Shouldn't + there end up requiring two or more tabs per field? I don't know why it doesn't, for you.

neutronscott · May 2, 2012, 12:22pm

I think if input is how I believe, where a TAB is delimiter, you can search for double TAB

awk '!match($0,/(^|\t)($|\t)/)' input

Scrutinizer · May 2, 2012, 12:24pm

The field separator is requiring 2 or more tabs or spaces, or is that not what you mean?

Corona688 · May 2, 2012, 1:39pm

The field separator, yes.

But I think I see what you're getting at now. Your field sep can match a trailing space instead of a tab...