Hi,
suppose I have the following file and certain rows have missing columns,
how do i skip these rows and create an output file which has all the columns in it
E/N Ko_exp %err Ko_calc %err diff diff- diff+ 0.95
======== ======= ==== ======= ==== ===== ===== ===== ====
1 4.00 2.8100 3.0 3.9502 0.5 -1.14 -1.31 -0.97 0
2 8.00 2.8123 3.0 3.9668 0.5 -1.15 -1.32 -0.98 0
3 12.00 2.8300 3.0 3.9920 0.5 -1.16 -1.33 -0.99 0
4 16.00 3.0 4.0201 0.5 -1.18 -1.35 -1.00 0
5 20.00 2.8700 3.0 4.0473 0.5 -1.18 -1.35 -1.00 0
6 24.00 2.9007 3.0 0.5 -1.17 -1.34 -0.99 0
7 28.00 2.9437 3.0 4.0807 0.5 -1.14 -1.31 -0.96 0
8 32.00 2.9983 3.0 4.0833 0.5 -1.08 -1.27 -0.90 0
9 36.00 3.0567 3.0 4.0778 0.5 -1.02 -1.21 -0.84 0
10 40.00 3.1100 3.0 4.0656 0.5 -0.96 -1.14 -0.77 0
I want my output file to look like this:
E/N Ko_exp %err Ko_calc %err diff diff- diff+ 0.95
======== ======= ==== ======= ==== ===== ===== ===== ====
1 4.00 2.8100 3.0 3.9502 0.5 -1.14 -1.31 -0.97 0
2 8.00 2.8123 3.0 3.9668 0.5 -1.15 -1.32 -0.98 0
3 12.00 2.8300 3.0 3.9920 0.5 -1.16 -1.33 -0.99 0
5 20.00 2.8700 3.0 4.0473 0.5 -1.18 -1.35 -1.00 0
7 28.00 2.9437 3.0 4.0807 0.5 -1.14 -1.31 -0.96 0
8 32.00 2.9983 3.0 4.0833 0.5 -1.08 -1.27 -0.90 0
9 36.00 3.0567 3.0 4.0778 0.5 -1.02 -1.21 -0.84 0
10 40.00 3.1100 3.0 4.0656 0.5 -0.96 -1.14 -0.77 0
Hi
$ awk 'NR<=2 || NF==10' file
Guru.
With awk you can print the first two lines, and then only lines which have 10 columns:
awk 'NR<3||NF==10' input
Had to include the header separately, since it has only 9 columns.
Thankyou Guru and neutronscott.
this is working but, I see that it is skipping the rows eventhough they have 5 fields in them...
see my examples below
Here is the first 10 lines of my input file:
Country Postal Admin4 StreetBaseName StreetType
HUN 2243 K�ka D�zsa Gy�rgy �t
HUN 5475 Cs�pa 4511
HUN 9600 S�rv�r Ady Endre utca
HUN 8705 Somogyszentp�l Kossuth utca
HUN 7098 Magyarkeszi H?s�k tere
HUN 2483 G�rdony
HUN 5100 J�szber�ny
HUN 5100 J�szber�ny Lehel vez�r t�r
HUN 5811 V�gegyh�za Sz�chenyi Istv�n �t
I have used the following code:
awk 'NR<2||NF==5' HUN1.dat >HUN2.dat
Here are the First 10 lines of my output file:
Country Postal Admin4 StreetBaseName StreetType
HUN 8705 Somogyszentp�l Kossuth utca
HUN 7098 Magyarkeszi H?s�k tere
HUN 2310 Szigetszentmikl�s Losonczi utca
HUN 7142 P�rb�ly �voda utca
HUN 4025 Debrecen Barna utca
HUN 2040 Buda�rs Farkasr�ti utca
HUN 2040 Buda�rs Szabads�g �t
HUN 9373 Pusztacsal�d �j utca
HUN 4262 Ny�racs�d R�k�czi utca
Line 1,3,9 and 10 are skipped even though they have 5 fields in them.
Problem there is, what defines a field? Are those tabs? Because line 1 is 6 columns if you use space delimiter because of the space in "D�zsa Gy�rgy"
If they are tabs: awk -F'\t' 'NF==5'
If they are spaces: awk -F' *' 'NF==5'
That's 3 spaces before the asterisks, then each field is split by 2 or more spaces..
Now the problem is back to square one...
I did try it with -F'\t'; now i see lines with four fields the fifth field is empty.
I have tried the folloiwng code
awk -F'\t' 'NR<2||NF==5' HUN1.dat >HUN4.dat
here are the first 10 lines from the result file
Country Postal Admin4 StreetBaseName StreetType
HUN 2243 K�ka D�zsa Gy�rgy �t
HUN 5475 Cs�pa 4511
HUN 9600 S�rv�r Ady Endre utca
HUN 8705 Somogyszentp�l Kossuth utca
HUN 7098 Magyarkeszi H?s�k tere
HUN 2483 G�rdony
HUN 5100 J�szber�ny
HUN 5100 J�szber�ny Lehel vez�r t�r
HUN 5811 V�gegyh�za Sz�chenyi Istv�n �t
@ramky, this is because your data sample was not representative of your actual data.
Try:
awk 'NF>4' infile
but that will give false positives for streets consisting of two words and a missing street type, so you would need to manually remove records..
Or you can try to tinker with the -F value like neutronscott suggested..
$ echo $'a\t\t\tb' | awk -F'\t' '{print NF}'
4
Ooops. Oh, right. If using tab and the tab is there but the field is blank, we'll need a better test. hmm.. best I can think of:
awk -F'\t' '{for (i=1;i<NF;i++) if (!length($i)) next}1'
How about: -F'[ \t][ \t]+'
awk -F'[ \t][ \t]+' 'NF>4' infile
--
You can check how may field it finds to verify if the field separator does the right thing:
$ awk -F'[ \t][ \t]+' '{print NF,$0}' infile
5 Country Postal Admin4 StreetBaseName StreetType
5 HUN 2243 K�ka D�zsa Gy�rgy �t
4 HUN 5475 Cs�pa 4511
5 HUN 9600 S�rv�r Ady Endre utca
5 HUN 8705 Somogyszentp�l Kossuth utca
5 HUN 7098 Magyarkeszi H?s�k tere
3 HUN 2483 G�rdony
3 HUN 5100 J�szber�ny
5 HUN 5100 J�szber�ny Lehel vez�r t�r
5 HUN 5811 V�gegyh�za Sz�chenyi Istv�n �t
Shouldn't + there end up requiring two or more tabs per field? I don't know why it doesn't, for you.
I think if input is how I believe, where a TAB is delimiter, you can search for double TAB
awk '!match($0,/(^|\t)($|\t)/)' input
The field separator is requiring 2 or more tabs or spaces, or is that not what you mean?
The field separator, yes.
But I think I see what you're getting at now. Your field sep can match a trailing space instead of a tab...