Grep multiple patterns(file) and replace whole line

wxboo · June 19, 2019, 3:36am

I am able to grep multiple patterns which stored in a files. However, how could we replace the whole line with either the pattern or new string?

For example:
pattern_file: *Info in the () is not part of the pattern file. They are the intended name to replace the whole line after the pattern found. Listed here for reference.

hot.*aaa.* (H_A)
cold.*bbb.* (C_B)
cold.*aaa.* (C_A)
(.. lots more)

input_file:

hot_temp_aaa_first
hot_temp_bbb_first
cold_temp_aaa_last
cold_temp_bbb_first
hot_bake_aaa_last
hot_bake_bbb_last
cold_bake_aaa_last

Expected Output:

H_A
C_A
C_B
H_A
C_A

The output i get which not able to conclude how many pattern had been found:

hot_temp_aaa_first
cold_temp_aaa_last
cold_temp_bbb_first
hot_bake_aaa_last
cold_bake_aaa_last

How to replace them with either the new name or pattern name . The reason i want to replace them is that later i need to count how many patterns had been found. Maybe using

sort -u | wc

.
I stuck after grep all the matched, but do not know how many patterns had been found.

less input_file | grep -f pattern_file | ... | sort -u | wc

Thank you very much.

krishmaths · June 19, 2019, 4:43am

One solution using awk, without converting the original input lines into intermediate format.

awk -F"_" '{++a[$1$3]} END{for(i in a){print i" "a}}' input_file

Output:

hotbbb 2
coldbbb 1
hotaaa 2
coldaaa 2

bakunin · June 19, 2019, 5:28am

wxboo:

How to replace them with either the new name or pattern name . The reason i want to replace them is that later i need to count how many patterns had been found. Maybe using
sort -u | wc
.
I stuck after grep all the matched, but do not know how many patterns had been found.
less input_file | grep -f pattern_file | ... | sort -u | wc

OK, first: if you want to change something, grep is not the right tool for it. You should use sed . grep is for finding things - but only finding, not changing them.

Second: before you start on a solution you should define your problem correctly. For instance, your sample input file has seven lines, your expected output has 5. Are the two missing lines left on purpose? If yes, say so. If not, how should they be handled? Maybe let unchanged?

So, let us first rephrase your task. I will make some assumptions here which might as well be wrong. Don't hesitate to correct them:

you have an input file containing certain text patterns and a pattern file which you want to apply to the input. When a pattern is matched you want to replace the whole line in the input with a certain marker, which is defined distinctly for each pattern found that way. Lines not matched by any pattern should be deleted from the result set. In a final step you want to count how many markers of each kind are found in the result set.

Is that correct?

I hope this helps.

bakunin

MadeInGermany · June 19, 2019, 12:51pm

Another guess what you might want:

while IFS= read pat
do
  printf "%s match %s times\n" "$pat" $(grep -c "$pat" input_file)
done < pattern_file

hot.*aaa.* match 2 times
cold.*bbb.* match 1 times
cold.*aaa.* match 2 times

wxboo · June 20, 2019, 5:30am

Thanks everyone for the input

--- Post updated at 09:11 AM ---

krishmaths:

One solution using awk, without converting the original input lines into intermediate format.
awk -F"_" '{++a[$1$3]} END{for(i in a){print i" "a}}' input_file
Output:
hotbbb 2
coldbbb 1
hotaaa 2
coldaaa 2

krishmaths, thank you very much for the input.
Useful command that combine the grouping and count together. After that I can filter the group not in the pattern_file and achieve the purpose.
But, the grouping seem to be limited to certain format of input. The input file might have format as below, quite random:

defect_hot_temp_chk_aaa_first
line_chk_hot_temp_bbb_first
cold_temp_aaa_last
cold_temp_bbb_first
hot_bake_aaa_last
hot_bake_bbb_last
cold_bake_aaa_last
cold_bake_10hrs_aaa_last

--- Post updated at 10:06 AM ---

bakunin:

OK, first: if you want to change something, grep is not the right tool for it. You should use sed . grep is for finding things - but only finding, not changing them.

Second: before you start on a solution you should define your problem correctly. For instance, your sample input file has seven lines, your expected output has 5. Are the two missing lines left on purpose? If yes, say so. If not, how should they be handled? Maybe let unchanged?

So, let us first rephrase your task. I will make some assumptions here which might as well be wrong. Don't hesitate to correct them:

you have an input file containing certain text patterns and a pattern file which you want to apply to the input. When a pattern is matched you want to replace the whole line in the input with a certain marker, which is defined distinctly for each pattern found that way. Lines not matched by any pattern should be deleted from the result set. In a final step you want to count how many markers of each kind are found in the result set.

Is that correct?

I hope this helps.

bakunin

bakunin, thank you very much for sorting this out.

My initial thinking is to identify how many patterns can be found for an input file.
Let's say I had 50 lines of patterns and 1000 lines of input. How many patterns are there in these 1000 lines? Maybe 400 lines matched but only 30 patterns. These 400 lines are unique so my idea is to group them and count. That's how I come to grep and replace line work flow.

Focus is not to overwrite the input info. I do not need an output file as well. Everything can do in pipe and get the count is the best.

--- Post updated at 10:30 AM ---

madeingermany:

Another guess what you might want:

while IFS= read pat
do
  printf "%s match %s times\n" "$pat" $(grep -c "$pat" input_file)
done < pattern_file

hot.*aaa.* match 2 times
cold.*bbb.* match 1 times
cold.*aaa.* match 2 times

MadeInGermany, thank you very much for this. This suit what I want to do.

For those who got new label to assign, below is my thinking:

while IFS= read pat; do printf "%s match %s times\n" $(grep "$pat" pattern_grp | awk '{print $1}') $(grep -c "$pat" input_file); done < pattern_file

Format of pattern_grp:

H_A hot.*aaa.*
C_B cold.*bbb.*
C_A cold.*aaa.*

Output:

H_A match 2 times
C_B match 1 times
C_A match 2 times

I use grep one more time to count

while IFS= read pat; do printf "%s match %s times\n" $(grep "$pat" pattern_grp | awk '{print $1}') $(grep -c "$pat" input_file); done < pattern_file| grep -c '0 times'

*Not a programmer, very limited knowledge, try to use what I have.

MadeInGermany · June 21, 2019, 4:23am

Looks too complicated.
Why 3 input files?
How does you pattern_grp file look like?
Say it looks like

H_A hot.*aaa.*
C_B cold.*bbb.*
C_A cold.*aaa.*

The value pairs seem related.?
Then you can read both whitespace-separated columns into two variables:

while read sp pat; do printf "%s alias %s match %s times\n" "$sp" "$pat" "$(grep -c "$pat" input_file)"; done < pattern_grp

But why do you do all the printing with aliaes when at the end you throw the output away, in favor of the amount of the non-matches?
--
BTW each expression in command arguments should be in "quotes", because the shell should not attempt substitutions on it.
So there should be quotes around the $pat argument of the grep command, and another pair around the $( ) argument of the printf command.
The $( ) runs a subshell, so the quotes inside and outside do not conflict. I forgot the outer quotes in my previous post.