Counting number of records with string row delimiter

HI,

i have a file like this
t.txt

f1|_f2|_
f1|_f2|_
f1|_f2|_

as if col delimiter is |_ and row delimiter |_\n

trying to count number of records using awk

$ awk 'BEGIN{FS="|_" ; RS="~~\n"} {n++}END{print n} ' t.txt
7

wondering how can i count this to 3 ?

thx
a

$
$ cat t.txt
f1|_f2|_
f1|_f2|_
f1|_f2|_
$
$ awk 'BEGIN {FS="|_"} {print "Record Number: ", NR, "No. of fields = ", NF}' t.txt
Record Number:  1 No. of fields =  3
Record Number:  2 No. of fields =  3
Record Number:  3 No. of fields =  3
$
$

tyler_durden

awk 'END{print NR}' file

or:

awk '{n++}END{print n}' file

okay, i confused it.. here is the text file

f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~

you can see that field f3 has a new line character in it.. but i want ~~\n as row delimiter adn so it should count to 3.

awk 'BEGIN{FS="|_" ; RS="~~\n"} {print NF, n++}END{print n} ' t.txt
5 0
0 1
6 2
0 3
6 4
0 5
2 6
7

somehow awk doesn't take multiple character as field or row delimiters.? how do i that?

thx

awk definitely supports multiple characters as record separators. I tested with your script and your data, it even works with a crummy buxybox awk version.

I think your data's not what you think it is. Did you edit this text file in windows?

Try:

awk -F"|_" '{print NF, n}/~~$/{n++}END{print n} ' t.txt

here is the copy pasted data from vi editor
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~
f1|_f2|_f3
|_f4~~

don't know a copy/paste going to add anything extra (like \r )

thx

A copy paste definitely isn't going to show us if it's full of \r's, so I really don't care what you copy-pasted it from.

What the data was edited in originally is important though. Did you edit it in windows or did it originate on a Windows machine?

If your data is in a file called "myfile.txt", then run the following command:

od -bc myfile.txt

and paste its output over here.
The command prints the octal dump of your file contents and will display "\r" characters if they exist in there.

tyler_durden

POSIX-compliant AWK implementations are not required to support multi-character record separators.

In the Linux world, you can usually count on multi-character RS being treated as a regular expression. Busybox, gawk, and mawk behave this way and that mostly covers the AWK implementations you're likely to find on a Linux system.

nawk (aka New AWK aka BWK AWK aka One True AWK), however, does not support that behavior [1]. When RS is a multi-character string, nawk only uses the first character and it is always used literally (it is never a regular expression).

nawk is quite popular outside of the Linux world. It is used by OS X, FreeBSD, NetBSD, and OpenBSD. nawk is also present on Solaris and I wouldn't be surprised if it's present on other proprietary UNIX systems such as HP-UX and AIX.

Footnote:

  1. Although its man page will not admit it, there is one nawk mutant loose in the wild which does treat a multicharacter RS as a regular expression. For details, see the readrec() portion of the following diff: http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c.diff?r1=1.1&r2=1.2

Your AWK implementation appears to be using RS=~ . If it does not support multi-character RS (whether as a regular expression or a literal string) you cannot do it (at least not easily).

Further, note that even when considering the unintended RS, the field count is wrong. I suspect this is because your field separator, FS="|_" is set to a regular expression which yields undefined behavior. AWK implementations use the extended regular expression flavor. In that grammar, the pipe is a metacharacter whose meaning is undefined if it is the first character in the expression (among other contexts). You should backslash escape the pipe.

Which AWK implementation are you using?

Regards,
Alister