Search for a pattern

DNAx86 · February 21, 2008, 11:58am

I want to write a command to search in a list of files, the files that contain a pattern for a number of times between Min and Max

But I don't know how to do it.

Can anyone give me advice?

bobbygsk · February 21, 2008, 12:25pm

Can you please be precise. What is the input and what is the output.

joeyg · February 21, 2008, 12:29pm

pattern="abcdefg"
for zf in *.dat; do
   numpat=$(cat $zf | grep "$pattern" | wc -l)

   not sure what you want to do next
      perhaps verify $numpat is in a range min-max ?

done

If you provide some sample text and output, it might be easier to take this further.

DNAx86 · February 21, 2008, 3:31pm

I want to write a script, with these arguments:

the pattern to find in a file (that is text)
the path where the files to analyse are located
the minimum and the maximum n� of times the pattern is found.

I know how to handle the arguments, but I don't know how to do the "core" of the script, the part of script where I write the command using grep (or other command) in pipeline with all other command are needed.

The most difficult part of the script is that one in Bold.

what I have written is:

egrep  -l '{$min,$max}  $pattern_to_find' $name_of_file

but it does't work

PS: I don't speeck english very well, so please be patient.

joeyg · February 21, 2008, 3:58pm

datafiles used:

> cat sctfile1.dat
lalalala good file
lalblala good file
lalalalb good file
lalalbla good file

> cat sctfile2.dat
lblblblb good file
lblblclb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lclblblb bad file

> cat sctfile3.dat
lblblblb good file
lblblclb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lclblblb bad file
lblblclb good file
lblblclb good file
lblblclb good file
lblblclb good file

script program is

#! /bin/bash
pattern="good"
rm sctfile.raw 2>/dev/null
for zf in *.dat; do
   numpat=$(cat $zf | grep "$pattern" | wc -l)
   echo $numpat","$zf >>sctfile.raw
done
sort -nr sctfile.raw >sctfile.srt

and the output from the sorted file is

13,sctfile3.dat
6,sctfile2.dat
4,sctfile1.dat

from here, you could use the head or tail function to find maximum or minimum of the file sctfile.srt

DNAx86 · February 21, 2008, 4:41pm

joeyg:

datafiles used:

> cat sctfile1.dat
lalalala good file
lalblala good file
lalalalb good file
lalalbla good file

> cat sctfile2.dat
lblblblb good file
lblblclb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lclblblb bad file

> cat sctfile3.dat
lblblblb good file
lblblclb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lblclblb good file
lclblblb bad file
lblblclb good file
lblblclb good file
lblblclb good file
lblblclb good file

script program is

#! /bin/bash
pattern="good"
rm sctfile.raw 2>/dev/null
for zf in *.dat; do
   numpat=$(cat $zf | grep "$pattern" | wc -l)
   echo $numpat","$zf >>sctfile.raw
done
sort -nr sctfile.raw >sctfile.srt

and the output from the sorted file is

13,sctfile3.dat
6,sctfile2.dat
4,sctfile1.dat

from here, you could use the head or tail function to find maximum or minimum of the file sctfile.srt

Thank you, I tryed your code and it works on my script, but I realized that if the pattern is present more times in the same line it's counted 1 only time, there a way of improving that script?

joeyg · February 21, 2008, 5:14pm

I am not sure if I have the sed command correct, but I think you get the idea. I * the changes. Because the file will now only have one keyword on a line, the wc -l command would give a truer count.

Good luck!

#! /bin/bash
pattern="good"
rm sctfile.raw 2>/dev/null
for zf in *.dat; do

sed 's/$pattern/$pattern \n/g' $zf >tempf
# numpat=$(cat $zf | grep "$pattern" | wc -l)
numpat=$(cat tempf | grep "$pattern" | wc -l)

echo $numpat","$zf >>sctfile.raw
done
sort -nr sctfile.raw >sctfile.srt

DNAx86 · February 22, 2008, 8:35am

Sorry, I forgot to say that the script have to find the pattern not only in the main directory but also in the subdirectories

joeyg · February 22, 2008, 8:42am

Work with the sample code above for one directory. Once that is done, all you need to do is make a bigger 'loop' around the existing code. This would look thru the sub-directories.

So, make sure this is doing what you want within one directory first - before making the process more complicated!

DNAx86 · February 22, 2008, 11:30am

Hi,

I don't understand why the NEW code you posted does't work:

* sed 's/$pattern_da_cercare/$pattern_da_cercare \n/g' $zf >tempf
# numpat=$(cat $zf | grep "$pattern_da_cercare" | wc -l)
* numpat=$(cat tempf | grep "$pattern_da_cercare" | wc -l)

The shell return these error messages:
./conta_occorrenze: line 158: Desktop: command not found
./conta_occorrenze: line 160: Desktop: command not found

joeyg · February 22, 2008, 11:43am

I inserted the * characters to help highlight two lines of code. Delete that character from your script and try again.

sed 's/$pattern_da_cercare/$pattern_da_cercare \n/g' $zf >tempf
# numpat=$(cat $zf | grep "$pattern_da_cercare" | wc -l)
numpat=$(cat tempf | grep "$pattern_da_cercare" | wc -l)

You should be able to after script run look at the tempf and see only one pattern on each line. If not, then something is wrong with the sed command and the \n. I ahve seen some quirky behavior from the insert of \n or newline character on some systems and shells.

Franklin52 · February 22, 2008, 12:55pm

This should be faster and you can search for files that contain a pattern for a number of times between MIN and MAX:

#!/bin/sh

MIN=2
MAX=7
PATTERN="XXX"

for file in *.dat
do
  awk -v min=$MIN -v max=$MAX -v patt=$PATTERN '
  BEGIN {RS=" |\n"}
  $0 == patt {n++}
  END {
  if(n >= min && n <= max) {print FILENAME " : " n}
  }
  ' "$file"
done

Use nawk or /usr/xpg4/bin/awk on Solaris

Regards

DNAx86 · February 22, 2008, 4:32pm

The two scripts doesn't work...

I thought that so I had to install nawk, so I installed it but it's the same story, the shell give me an error.

Franklin52, the error for this part of code is:

MIN=2
MAX=7
for file in *.txt
do
  awk -v min=$MIN -v max=$MAX -v patt=$pattern_da_cercare '
  BEGIN {RS=" |\n"}
  $0 == patt {n++}
  END {
  if(n >= min && n <= max) {print FILENAME " : " n}
  }
  ' "$file"


exit 0
done

The error codes are:
./conta_occorrenze: line 157: syntax error near unexpected token `exit'
./conta_occorrenze: line 157: `exit 0'

The strange fact is the the line 157 is that one in bold, but the line with "exit" is 167

@ joeyg
I installed nawk on my MAC OS X but your code doen't work.
So I tryed on UBUNTU and it doesn't work.
Why do you thing it happen?

Franklin52 · February 22, 2008, 4:42pm

Place the exit command after the loop not within it.

MIN=2
MAX=7
for file in *.txt
do
  awk -v min=$MIN -v max=$MAX -v patt=$pattern_da_cercare '
  BEGIN {RS=" |\n"}
  $0 == patt {n++}
  END {
  if(n >= min && n <= max) {print FILENAME " : " n}
  }
  ' "$file"
done

exit 0

Regards

DNAx86 · February 22, 2008, 6:24pm

franklin52:

Place the exit command after the loop not within it.

MIN=2
MAX=7
for file in *.txt
do
  awk -v min=$MIN -v max=$MAX -v patt=$pattern_da_cercare '
  BEGIN {RS=" |\n"}
  $0 == patt {n++}
  END {
  if(n >= min && n <= max) {print FILENAME " : " n}
  }
  ' "$file"
done

exit 0

Regards

I'm sorry, have pasted incorrectly the code, exit 0 was already after the loop

joeyg · February 23, 2008, 8:07am

#! /bin/bash
pattern="good"
rm sctfile.raw 2>/dev/null
for zf in *.dat; do
  sed 's/$pattern/$pattern \n/g' $zf >tempf
  numpat=$(cat tempf | grep "$pattern" | wc -l)
  echo $numpat","$zf >>sctfile.raw
done
sort -nr sctfile.raw >sctfile.srt

I suspect that the sed is not forcing the newline; thus count not correct. The count only reflects the number of original lines where the $pattern was found. Please confirm.

DNAx86 · February 23, 2008, 8:45am

joeyg:

#! /bin/bash
pattern="good"
rm sctfile.raw 2>/dev/null
for zf in *.dat; do
  sed 's/$pattern/$pattern \n/g' $zf >tempf
  numpat=$(cat tempf | grep "$pattern" | wc -l)
  echo $numpat","$zf >>sctfile.raw
done
sort -nr sctfile.raw >sctfile.srt
I suspect that the sed is not forcing the newline; thus count not correct. The count only reflects the number of original lines where the $pattern was found. Please confirm.

The code is:

rm sctfile.raw 2>/dev/null	# Elimina i msg di err

for zf in *.txt; do


sed 's/$pattern_da_cercare/$pattern_da_cercare \n/g' $zf >tempf
# numpat=$(cat $zf | grep "$pattern_da_cercare" | wc -l)
numpat=$(cat tempf | grep "$pattern_da_cercare" | wc -l)



echo $numpat","$zf >>sctfile.raw
done

sort -nr sctfile.raw >sctfile.srt

The error message now is:
cat: tempf: No such file or directory
sed: numpat=: No such file or directory
sed: 13\rnumpat=: No such file or directory
sed: 0: No such file or directory

I don't understand why,

Franklin52 · February 23, 2008, 8:53am

It seems that you have unexpected characters in the script.
Have you ftp the script in binary mode?
Do you see some characters like ^M in vi?

Regards

DNAx86 · February 23, 2008, 9:11am

I don't understand what you talk about in your 1st qyestion, do you talk about the FTP protocol?

>Do you see some characters like ^M in vi?
I did't edited the script with VI, I used XCODE and text editor (Mac OS X software)
But when I edited the script with VI....
YES I see ^M characters.

What does it mean?

DNAx86 · February 23, 2008, 9:26am

ladies and gentleman now I don't have errors any more!

The code that I use now is:

rm sctfile.raw 2>/dev/null	# Elimina i msg di err

for zf in *.txt; do


sed 's/$pattern_da_cercare/$pattern_da_cercare \n/g' $zf >tempf
numpat=$(cat tempf | grep "$pattern_da_cercare" | wc -l)


echo $numpat","$zf >>sctfile.raw
done

sort -nr sctfile.raw >sctfile.srt

exit 0

BUT It still not count more then 1 word if in that line the pattern have already been found.

How can I modify it?