Grep lines that matches the exact string

r_magoha · March 24, 2023, 9:55am

OS : RHEL 7.9

In somestrings.txt file (shown below), on the fourth field, I am trying to 'fetch' the lines that exactly matches the pattern w_sbc_q.
So, patterns like w_sbc_qkh_x, w_sbc_qkh_p seen below shouldn't be matched.

Hence I used the grep command shown under "This is what I tried" section.

My expected output:
Only Lines 3 and 4 from somestrings.txt should be returned.

$ cat somestrings.txt
paris:branch7:paris1:w_sbc_qkh_x:manhprd187.domain.net
paris:branch8:paris2:w_sbc_qkh_x:manhprd214.domain.net
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net ## Line 3
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net ## Line 4 
madrid:branch13:madrid1:w_sbc_qkh_p:manhprd181.domain.net
madrid:branch14:madrid2:w_sbc_qkh_p:manhprd137.domain.net

This is what I tried. But, no luck. I want the below grep command to match only lines with w_sbc_q in the 4th field. But, my below command returns lines with w_sbc_qkh_x and w_sbc_qkh_p too.

$
$
$ MYPATTERN=w_sbc_q
$ grep "^[^:]*:[^:]*:[^:]*:[^:]*${MYPATTERN}" somestrings.txt
paris:branch7:paris1:w_sbc_qkh_x:manhprd187.domain.net
paris:branch8:paris2:w_sbc_qkh_x:manhprd214.domain.net
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net
madrid:branch13:madrid1:w_sbc_qdh_p:manhprd181.domain.net
madrid:branch14:madrid2:w_sbc_qdh_p:manhprd137.domain.net
$

## Tried enclosing ${MYPATTERN} in single quotes. No output
$ grep "^[^:]*:[^:]*:[^:]*:[^:]*'${MYPATTERN}'" somestrings.txt
$
## Tried enclosing *${MYPATTERN} in single quotes. No output
$ grep "^[^:]*:[^:]*:[^:]*:[^:]'*${MYPATTERN}'" somestrings.txt

## Tried removing the asterix. No output (understandably)
$ grep "^[^:]*:[^:]*:[^:]*:[^:]${MYPATTERN}" somestrings.txt

munkeHoller · March 24, 2023, 10:04am

@r_magoha , you're over thinking , reading the docs on grep will empower you !

does the following help ...

grep -w 'w_sbc_q' somestrings.txt
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net ## Line 3
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net ## Line 4

in awk (probably a 'better' solution)

awk -F':' '$4 ~ /^w_sbc_q$/ { print }' somestrings.txt
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net ## Line 3
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net ## Line 4

MadeInGermany · March 24, 2023, 10:31am

awk is most elegant. $4 is field 4, and the trailing $ sais the pattern must be at the end (of field 4). Likewise the initial ^ sais it must start at the beginning (of field 4).

Using the default print action and passing a parameter:

awk -v s="^$MYPATTERN"'$' -F':' '$4 ~ s' somestrings.txt

If you must use grep then you can enforce an ending colon

grep "^[^:]*:[^:]*:[^:]*:[^:]*${MYPATTERN}:" somestrings.txt

And not allowing any in-field characters before the pattern:

grep "^[^:]*:[^:]*:[^:]*:${MYPATTERN}:" somestrings.txt

grep -w w_sbc_q would match a w_sbc_q-dh_p

Paul_Pedant · March 24, 2023, 11:29am

The grep does match the exact string. What you neglected to specify was what you considered as boundaries to that string. : is a boundary, but k (and in fact any other characters) are not.

grep -F ':w_sbc_q:' would be correct. It is actually irrelevant that the lines appear to be several fields delimited by :. We just need to consider one text with adjacent boundaries.

The -F option (aka --fixed-strings) is an optimisation which says that there are no special characters (regular expression operators) in the search string.

r_magoha · March 24, 2023, 11:35am

Just Brilliant !

Thank You munkeHoller, MadeinGermany
I will go with the awk solution.

I have few more questions on the solutions that were posted.

Question 1. In the below awk solution, -v flag says, use variables. In the below case, s is the variable. Am I right ?

Question 2. '$4 ~ s' means do pattern matching on 4th field using the awk's special ~ operator based on the pattern stored in variable s.

awk -v s="^$MYPATTERN"'$' -F':' '$4 ~ s' somestrings.txt

Question 3. Following is from the man page of grep.

-w, --word-regexp
              Select only those lines containing matches that form whole words.  The test is that the matching substring must either be at the beginning of the line, or preceded by a
              non-word  constituent  character.  Similarly, it must be either at the end of the line or followed by a non-word constituent character.  Word-constituent characters are
              letters, digits, and the underscore.

If -w is to match a whole word like w_sbc_q. Then why does it match w_sbc_q-dh_p as well as MIG suggested ? Or did I miss some 'Terms & Conditions' bit mentioned in the above man page excerpt ?

grep -w w_sbc_q would match a w_sbc_q-dh_p

Question 4. About the below solution posted by MadeinGermany using grep , how does colon character change the pattern matching behaviour ?
Is this mentioned in the grep man page ?

grep "^[^:]*:[^:]*:[^:]*:[^:]*${MYPATTERN}:" somestrings.txt

MadeInGermany · March 24, 2023, 11:45am

Q1 Yes
Q2 Yes
Q3 The dash is not a word-constituent character.

Q4 It is simple: the trailing colon must occur at the given position.

r_magoha · March 24, 2023, 12:21pm

Thank You Paul !

One last question.

About the last solution which MIG posted with a description "And not allowing any in-field characters before the pattern"

$ grep "^[^:]*:[^:]*:[^:]*:${MYPATTERN}:" somestrings.txt
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net

What is an in-field character ?

MadeInGermany · March 24, 2023, 12:27pm

My short name for the [^:]
A character that is not a colon i.e. occurs within the field.

Paul_Pedant · March 24, 2023, 3:02pm

The general idea is to ignore completely the first three fields.

[:] would be a pattern element that matches one :.

The ^ (caret) reverses the set of matching characters, so [^:] matches any single character that is not a :. (^ means something different at the very start of a pattern.)

The * makes the previous operation match zero, one or more times, so it now matches an empty substring, or any substring that does not contain a :. So that is exactly what we think of as a field.

So [^:]*: matches a field and its following separator.

This appears three consecutive times, so they will match anything in the first three fields, because we don't care what they are.

The fourth field must match exactly the value in MYPATTERN, and that must be followed by the last : to ensure there are no other stray characters in the fourth field.

Regular expressions look daunting, but they are usually assembled out of smaller units that can be easily understood.

There is actually a serious flaw in this, especially if the value of MYPATTERN can be entered by a user or from a config file.

What if MYPATTERN was w_sb:c_q? Putting the extra : means we would be testing field 5 for w_sb and field 6 for c_q. If any special character (like * [ ] ( ) ) was in MYPATTERN, it would match something unexpected (if it was still a valid pattern), or make the program exit.

For reasons like this, I try not to use grep and sed where field separators are involved. Awk seems to solve most of my problems in a much cleaner way.

Francois · March 24, 2023, 11:39pm

Thank You all.

As a beginner in awk, I have few questions on MadeinGermany's solution (posted below)

awk -v s="^$MYPATTERN"'$' -F':' '$4 ~ s' somestrings.txt
          |           ^ ^

Shouldn't the double quotes be ONLY around $MYPATTERN in order for MYPATTERN variable to expand ?
But, MIG included caret sign ^ inside the double quotes. Any particular reason ? I put a pipe symbol underneath the caret sign I am talking about.

Why is the $ sign (special character to match the end of line) in single quotes ? I put 2 caret signs beneath it to point it out.

Below is a variant of MIG's solution. Is this fine ? Or, is it flawed ?

$ awk -F':' -v s=^"$MYPATTERN"$ '$4 ~ s' somestrings.txt
madrid:branch13:madrid1:w_sbc_q:manhprd179.domain.net ## Line 3
madrid:branch14:madrid2:w_sbc_q:manhprd182.domain.net ## Line 4

MadeInGermany · March 25, 2023, 7:49am

Should work as well.
This part is evaluated+substituted by the shell.
Test it in an echo command:
echo s=^"$MYPATTERN"$
echo s="^$MYPATTERN"'$'

Paul_Pedant · March 25, 2023, 8:35am

As @MadeInGermany notes, the value of s is passed to awk after shell does the substitution.

The $ introduces an expansion of the following variable. As it is at the end of the value, the expansion cannot happen. That is a bit of an edge case -- the $ could be treated as a plain character or as a failed expansion. It happens that Bash does the former, but it may not be so in all shells, and may confuse maintainers.

A pedant (such as myself) would enclose the $ in single quotes, which explicitly prevents expansion.

The ^ does not have the same ambiguity, but a pedant with an eye for symmetry could write this as s='^'"${MYPATTERN}"'$' without censure.

munkeHoller · March 25, 2023, 9:20am

as being demonstrated, there are a number of valid ways this could be constructed ...., below, a couple of others

x="^${MYPATTERN}\$"
x="^$MYPATTERN$"

its good to play around and see the behaviour.
combine with checking out the documentation https://www.gnu.org/software/bash/manual/bash.html

EmersonPrado · March 27, 2023, 2:22pm

My 2 cent: since we're looking for a fixed string, we can change the pattern matching for a simple comparison:

awk -F: '$4 == "w_sbc_q"' somestrings.txt

MadeInGermany · March 27, 2023, 2:26pm

Good point. And the parameter passing simplifies, too:

awk -v s="$MYPATTERN" -F':' '$4 == s' somestrings.txt