Avoiding external utilities

SkySmart · January 7, 2018, 8:23pm

under past circumstances, id be fine running commands like this:

case 1:

printf '%s\n' "${MASSIVETEXT}" | egrep "i am on the first line"

case 2:

vprintf '%s\n' "${MASSIVETEXT}" | egrep -v "i am on the first line"

This works fine. Bit it calls external utility "egrep" and "printf". Yes, printf is actually a utility. i thought it was a function.

nevertheless, here's what im trying to do to help me avoid external commands:

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

case "${MASSIVETEXT}" in
*${mea}*)
        echo "${MASSIVETEXT}"
        break
;;
*${meb}*)
        echo "${MASSIVETEXT}"
        break
;;
esac

The problem here is, when the strings in variable "mea" and "meb" are found, the code outputs everything in the $MASSIVETEXT variable.

I only want it to output the line containing the strings in either "mea" or "meb" variable.

bakunin · January 7, 2018, 8:52pm

It is a good idea to avoid external utilities when possible, but that doesn't mean you should never use them. In fact they are quite good at what they are supposed to do and as long as you do not misuse them you are perfectly entitled to used them in your code.

In your case, where you search for a fixed string, you do not need egrep at all. egrep uses a very mighty (and therefore resource-intensive) regex machine, but to search for a fixed string you don't need that. Use fgrep ("fast grep"), which does the same as egrep but with a very limited (and hence resource-saving) regexp machine.

What you want to do cannot be done in shell. Actually it can be done, but not economically. It would look similar to this:

echo ${LongText} | while read line ; do
     case line in
          *RegexpA*)
               echo $line
               ;;

          *RegexpB*)
               echo $line
               ;;

     esac
done

I expect that to work slower than using (any) grep, though. echo , btw., is a shell builtin, if you use bash .

I hope this helps.

bakunin

rovf · January 8, 2018, 5:42am

Well, first to the usage of fgrep: I don't think you will notice any time different to egrep. The main advantage of fgrep is not speed, but convenience: With fgrep, you don't have to escape characters which have a special meaning inside a regexp.

Also, note that egrep and fgrep are obsolete; the recommended form is

grep -E

and

grep -F

. Of course, especially when typing from the command line, fgrep is faster to type than grep -F.

Not to the problem at hand: While I don't see how you can avoid an external utility (unless you write your own shell version of grep, which is possible, but not necessarily faster when it comes to large input), you can - if you are using bash or zsh - at least get rid of the pipe:

grep -F  'i am on the first line'  <<<$MASSIVETEXT

Don_Cragun · January 8, 2018, 6:47am

Hi SkySmart,
To avoid using any external utilities with most shells written since 1985 (including bash and ksh ), I would use something more like:

#!/bin/ksh
MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

found=

printf '%s\n' "$MASSIVETEXT" | while read -r line
do
	case "$line" in
	(*$mea*)
		echo "$line"
		exit;;
	(*$meb*)
		found="$line";;
	esac
done
[ -n "$found" ] && printf '%s\n' "$found"

to do what I think you're trying to do. I.e., to search the entire variable contents for a match for $mea before looking for any match for $meb and to only print the first line in $MASSIVETEXT that matches the appropriate pattern.

Although echo and printf are almost always provided as built-ins in recently developed (i.e., since the 1970s) shells, they are required by the standards to also be available to be executed as stand alone utilities in one of the directories listed in the POSIX-compliant default setting for the PATH environment variable. This is true for all utilities defined by the standards except for the special built-in utilities: break , : , continue , . , eval , exec , exit , export , readonly , return , set , shift , times , trap , and unset . (Beware, however, that not all systems conform to the POSIX requirements. You may find some systems that don't have all required utilities available as a stand-alone utility; read is an example of a utility that is missing on some non-conforming systems.) To determine whether a given utility is built into your shell, the standards say you can use:

type utility_name...

With ksh (version 93u+) on macOS 10.13.2, the command:

type echo printf set cat

produces the output:

echo is a shell builtin
printf is a shell builtin
set is a special shell builtin
cat is a tracked alias for /bin/cat

while with bash (version 3.2.57) on the same system, the output produced is:

echo is a shell builtin
printf is a shell builtin
set is a shell builtin
cat is /bin/cat

Note that bash doesn't distinguish between regular built-ins and special built-ins like ksh does. Both meet the required specifications in the standards for the output produced by type .

Hi bakunin,
Note that SkySmart seems to only want to print the 1st line in $MASSIVETEXT that is matched by $mea and, only if no match for that fixed string is found, then print the 1st line in $MASSIVETEXT that matches $meb . The code you suggested will print every line matching either fixed string. Using only standard grep options, the above code may well be faster than fgrep even for medium sized contents of the variable MASSIVETEXT especially if neither fixed string is present in the file or if the 1st fixed string appears early in $MASSIVETEXT .

Since SkySmart hasn't told us what OS and shell are being used, we would need to just use standard options leading to something like:

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

{	printf '%s\n' "$MASSIVETEXT" | grep -F "$mea" ||
	printf '%s\n' "$MASSIVETEXT" | grep -F "$meb"
} | {	read -r line
	[ -n "$line" ] && printf '%s\n' "$line"
}

which should work with any POSIX-conforming shell and grep utility. On many systems, the following would frequently be much faster, but it depends on the non-standard -m max_count option being supported by the user's grep utility:

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

printf '%s\n' "$MASSIVETEXT" | grep -F -m 1 "$mea" ||
    printf '%s\n' "$MASSIVETEXT" | grep -F -m 1 "$meb"

Note that if neither string is present in the file, you have to invoke grep twice and read the entire "massive" text twice.

And, the shell being used hasn't been specified either. With a recent bash or ksh , the above scripts could all avoid most of the invocations of printf s by using here-strings:

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

found=

while read -r line
do
	case "$line" in
	(*$mea*)
		echo "$line"
		exit;;
	(*$meb*)
		found="$line";;
	esac
done <<<$MASSIVETEXT
[ -n "$found" ] && printf '%s\n' "$found"

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

{	grep -F "$mea" <<<$MASSIVETEXT ||
	grep -F "$meb" <<<$MASSIVETEXT
} | {	read -r line
	[ -n "$line" ] && printf '%s\n' "$line"
}

MASSIVETEXT="i am on the first line
i am on the second line
I am on the third line"

mea="i am on the first line"
meb="i am on the second line"

grep -Fm1 "$mea" <<<$MASSIVETEXT || grep -Fm1 "$meb" <<<$MASSIVETEXT

Hi rovf,
On every system I've seen, for a large amount of data (which one might assume from a variable named MASSIVETEXT ), there is a noticeable difference in performance between grep -F (fastest), grep without -E and without -F (slower), and grep -E (slower still). However, with fixed strings as REs, I don't usually see much difference between plain grep and grep -E . I don't have any experience with where grep -P fits into the speed spectrum on systems that include support for perl 's RE extensions in grep .

bakunin · January 8, 2018, 7:21am

You are right. Only now i notice i misread that part of his question. My bad. But the point i wanted to make was to use grep (or any other utility, for that matter) where it makes sense. The code was only added for illustration.

bakunin

rovf · January 8, 2018, 8:35am

Interesting point. I tried it with a 300MB file, using Cygwin grep, and searching for a fixed string with various settings. I repeated each run 3 times. Here the times

with -F:

0.27s user 0.08s system 91% cpu 0.372 total

0.30s user 0.05s system 93% cpu 0.367 total

0.28s user 0.06s system 95% cpu 0.358 total

without option:

0.22s user 0.14s system 98% cpu 0.362 total

0.31s user 0.05s system 98% cpu 0.363 total

0.22s user 0.12s system 95% cpu 0.358 total

with -E:

0.25s user 0.09s system 93% cpu 0.367 total

0.27s user 0.08s system 95% cpu 0.359 total

0.25s user 0.11s system 98% cpu 0.365 total

Of course, it could be that the amount of data is not massive enough to show a significant difference; or that the I/O of the Cygwin layer is so heavy, that it shadows performance differences. Or that gnu grep is clever enough to recognize, that the pattern does not contain any regexp characters and does internally a '-F' always.

To evaluate at the last hypothesis, I replaced the pattern by one which really contained an extended regexp instead of a fixed string (using a .+ regexp operation). In this case, the times were systematically higher, but not very much:

0.37s user 0.11s system 99% cpu 0.486 total

0.36s user 0.09s system 93% cpu 0.481 total

0.30s user 0.14s system 89% cpu 0.486 total

Perhaps such a test should really be repeated with a file which is several GB in size.

Interestingly, in the OP's problem, the whole content of the file was stored in a shell variable, and this makes me wonder, where the practical limit is for the content of a variable in, say, bash or ksh...