Counting number of records with string row delimiter

aksforum · October 12, 2011, 4:40pm

HI,

i have a file like this
t.txt

f1|_f2|_
f1|_f2|_
f1|_f2|_

as if col delimiter is |_ and row delimiter |_\n

trying to count number of records using awk

$ awk 'BEGIN{FS="|_" ; RS="~~\n"} {n++}END{print n} ' t.txt
7

wondering how can i count this to 3 ?

thx
a

durden_tyler · October 12, 2011, 4:48pm

$
$ cat t.txt
f1|_f2|_
f1|_f2|_
f1|_f2|_
$
$ awk 'BEGIN {FS="|_"} {print "Record Number: ", NR, "No. of fields = ", NF}' t.txt
Record Number:  1 No. of fields =  3
Record Number:  2 No. of fields =  3
Record Number:  3 No. of fields =  3
$
$

tyler_durden

Franklin52 · October 12, 2011, 4:54pm

awk 'END{print NR}' file

or:

awk '{n++}END{print n}' file

aksforum · October 12, 2011, 5:04pm

okay, i confused it.. here is the text file

f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~

you can see that field f3 has a new line character in it.. but i want ~~\n as row delimiter adn so it should count to 3.

awk 'BEGIN{FS="|_" ; RS="~~\n"} {print NF, n++}END{print n} ' t.txt
5 0
0 1
6 2
0 3
6 4
0 5
2 6
7

somehow awk doesn't take multiple character as field or row delimiters.? how do i that?

thx

Corona688 · October 12, 2011, 5:12pm

awk definitely supports multiple characters as record separators. I tested with your script and your data, it even works with a crummy buxybox awk version.

I think your data's not what you think it is. Did you edit this text file in windows?

Franklin52 · October 12, 2011, 5:15pm

Try:

awk -F"|_" '{print NF, n}/~~$/{n++}END{print n} ' t.txt

aksforum · October 12, 2011, 5:15pm

here is the copy pasted data from vi editor
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~

don't know a copy/paste going to add anything extra (like \r )

thx

Corona688 · October 12, 2011, 5:17pm

A copy paste definitely isn't going to show us if it's full of \r's, so I really don't care what you copy-pasted it from.

What the data was edited in originally is important though. Did you edit it in windows or did it originate on a Windows machine?

durden_tyler · October 12, 2011, 6:27pm

If your data is in a file called "myfile.txt", then run the following command:

od -bc myfile.txt

and paste its output over here.
The command prints the octal dump of your file contents and will display "\r" characters if they exist in there.

tyler_durden

alister · October 13, 2011, 12:57am

POSIX-compliant AWK implementations are not required to support multi-character record separators.

In the Linux world, you can usually count on multi-character RS being treated as a regular expression. Busybox, gawk, and mawk behave this way and that mostly covers the AWK implementations you're likely to find on a Linux system.

nawk (aka New AWK aka BWK AWK aka One True AWK), however, does not support that behavior [1]. When RS is a multi-character string, nawk only uses the first character and it is always used literally (it is never a regular expression).

nawk is quite popular outside of the Linux world. It is used by OS X, FreeBSD, NetBSD, and OpenBSD. nawk is also present on Solaris and I wouldn't be surprised if it's present on other proprietary UNIX systems such as HP-UX and AIX.

Footnote:

Although its man page will not admit it, there is one nawk mutant loose in the wild which does treat a multicharacter RS as a regular expression. For details, see the readrec() portion of the following diff: http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c.diff?r1=1.1&r2=1.2

aksforum:

okay, i confused it.. here is the text file
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
you can see that field f3 has a new line character in it.. but i want ~~\n as row delimiter adn so it should count to 3.
awk 'BEGIN{FS="|_" ; RS="~~\n"} {print NF, n++}END{print n} ' t.txt
5 0
0 1
6 2
0 3
6 4
0 5
2 6
7
somehow awk doesn't take multiple character as field or row delimiters.? how do i that?

thx

Your AWK implementation appears to be using RS=~ . If it does not support multi-character RS (whether as a regular expression or a literal string) you cannot do it (at least not easily).

Further, note that even when considering the unintended RS, the field count is wrong. I suspect this is because your field separator, FS="|_" is set to a regular expression which yields undefined behavior. AWK implementations use the extended regular expression flavor. In that grammar, the pipe is a metacharacter whose meaning is undefined if it is the first character in the expression (among other contexts). You should backslash escape the pipe.

Which AWK implementation are you using?

Regards,
Alister