Grep command is not search the complete pattern

sumit.vedi1988 · June 25, 2013, 6:05am

I am facing a problem while using the grep command in shell script. Actually I have one file (PCF_STARHUB_20130625_1) which contain below records.

SH_5.55916.00.00.100029_20130601_0001_NUC.csv.gz|438|3556691115 
SH_5.55916.00.00.100029_20130601_0001_Summary.csv.gz|275|3919504621 
SH_5.55916.00.00.100029_20130601_0001_UI.csv.gz|226|593316831
SH_5.55916.00.00.100029_20130601_0001_US.csv.gz|349|1700116234 
SH_5.55916.00.00.100038_20130601_0001_NUC.csv.gz|368|3553014997
SH_5.55916.00.00.100038_20130601_0001_Summary.csv.gz|276|2625719449 
SH_5.55916.00.00.100038_20130601_0001_UI.csv.gz|226|3825232121 
SH_5.55916.00.00.100038_20130601_0001_US.csv.gz|199|2099616349 
SH_5.75470.00.00.100015_20130601_0001_NUC.csv.gz|425|1627227450

And I have a pattern which is stored in one variable (INPUT_FILE_T), and want to search the pattern from the file (PCF_STARHUB_20130625_1). For that I have used below command

INPUT_FILE_T="SH?*???????????????US.*" 
grep -h ${INPUT_FILE_T} PCF_STARHUB_20130625_1

The output of above command is coming as below

SH_5.55916.00.00.100029_20130601_0001_US.csv.gz|349|1700116234

Problem is that only one entry is showing in output (It should contain two entries) output should come like below

SH_5.55916.00.00.100029_20130601_0001_US.csv.gz|349|1700116234
SH_5.55916.00.00.100038_20130601_0001_US.csv.gz|199|2099616349

Is there any technique except grep please tell me.
Please help me on this issue.

Don_Cragun · June 25, 2013, 9:44am

The grep utility evaluates basic regular expressions. Unfortunately, ( SH?*???????????????US.* ) is a filename matching pattern; not a BRE.

To search for lines in a file that match a pattern matching expression, try the following shell script using any shell that recognizes basic Bourne shell syntax (such as ksh and bash):

INPUT_FILE_T="SH?*???????????????US.*"
while IFS='' read -r f
do      case "$f" in
        ($INPUT_FILE_T) printf "%s\n" "$f";;
        esac
done < PCF_STARHUB_20130625_1

Furthermore, since you didn't quote the expansion of $INPUT_FILE_T in your grep command, the shell expanded that variable into a list of matching filenames in the current directory before calling grep; so (assuming that the file PCF_STARHUB_20130625_1 contained a list of some of the files in the current directory) the command that you ran was expanded by the shell to:

grep -h SH_5.55916.00.00.100029_20130601_0001_US.csv.gz|349|1700116234 SH_5.55916.00.00.100038_20130601_0001_US.csv.gz|199|2099616349 PCF_STARHUB_20130625_1

which treated SH_5.55916.00.00.100029_20130601_0001_US.csv.gz|349|1700116234 as a basic regular expression that happens to match itself when looking in the file PCF_STARHUB_20130625_1 and, fortunately, doesn't seem to have matched any lines in the file named SH_5.55916.00.00.100038_20130601_0001_US.csv.gz|199|2099616349 .

To use grep instead of a loop in the shell, you could translate the filename matching pattern SH?*???????????????US.* to a corresponding BRE ( SH..*...............US[.].* or more succinctly SH.\{16,\}US[.].* ) and use:

INPUT_FILE_T_BRE="SH.\{16,\}US[.].*"
grep "$INPUT_FILE_T_BRE" PCF_STARHUB_20130625_1

Note that the double quotes in the above grep command are crucial to keep the shell from trying to expand the BRE as a filename matching pattern

rbatte1 · June 25, 2013, 9:57am

It could be as simple as quoting your search string:-

grep -h "${INPUT_FILE_T}" PCF_STARHUB_20130625_1

It's an odd search string though. From the man page I have on RHEL 6.1, I have:-

A regular expression may be followed by one of several repetition operators:
      ?      The preceding item is optional and matched at most once.
      *      The preceding item will be matched zero or more times.
      +      The preceding item will be matched one or more times.
      {n}    The preceding item is matched exactly n times.
      {n,}   The preceding item is matched n or more times.
      {,m}   The preceding item is matched at most m times.
      {n,m}  The preceding item is matched at least n times, but not more than m times.

So that you mean that you are looking for a record that starts (doesn't have to be at the beginning of the line) with an S, then the H is optional and then I get confused.

Are you trying to use the ? as a single character each time?

I would think a better search string would be more like:-

INPUT_FILE_T="^SH....................................US"

to represent Start of line, SH, then any 3 characters, then US. The remainder of the line can be ignored.

Do either of these meet your needs?

Robin
Liverpool/Blackburn
UK