Help with matching pattern inside a file

Ernst · June 2, 2011, 12:17pm

I have a huge file that has roughly 30304 lines. I need to extract specific info from that file. For example,

Box 1 > *aaaaaaaajjjj*
> hbbvjvj
> jdnnfddllll
> *dgdfhfekwjh*
Box 2 > *aaaaaaa'aj'jjj*
> dse hkjuejef bfdw
> dyeee
> dsewq
> *dgdfhfekwjh*
>feweiuei
Box 3 > *aaaa"aaaaj"jjj*
> fuiurhir
.
.
.
Box 100 >  *aaaa"'aaaajjjj*
> hhdfwiiji
>*dgdfhfekwjh*
>ggjhfhf

I hope you got the idea. I need to scan through the file and select the exact lines/rows that match my pattern and assign all these patterns to a specific Box and at the same time discarding everything that does not match my pattern.

Can someone help me out with this reqquest?

panyam · June 2, 2011, 12:22pm

What pattern you are looking in the test data you provided?

a simple "grep" will work for you.

Ernst · June 3, 2011, 10:16am

Sorry, I did a bad job explaining the issue. Let me try again:
I have a list of files, each one ending with (.log). For example
file1.log
file2.log
file3.log
.
.
.
file100.log

Within each file,I want to search for particular fields. However, I need the output data to be concatenated into one single file and at the same time I want to be able to know which data belongs to file1.log, file2.log and so forth.
I wrote this script:

cat *.log > log1
while read line
do
awk  'NR==2{print; exit}' log1 >> /home/usr/Folder/file1
awk  'NR==5{print; exit}' log1 >> /home/usr/Folder/file2
awk  'NR==54{print; exit}' log1 >> /home/usr/Folder/file3
awk  'NR==16{print; exit}' log1 >> /home/usr/Folder/file4
awk  'NR==37{print; exit}' log1 >> /home/usr/Folder/file5
awk  'NR==69{print; exit}' log1 >> /home/usr/Folder/file6
awk  'NR==100{print; exit}' log1 >> /home/usr/Folder/file7
cat file1 file2 file3 file4 file5 file6 file7 > file.cvs
done<log1

and the output of the script is as follows:

file1.log> I am okay today
file2.log> I am okay today
file3.log> I am okay today
file4.log> I am okay today
$ wow the weather is nice outside
$ wow the weather is bad outside
$ wow the weather is terrible
$ wow the weather sucks
I like to eat
go to the movies
play golf
read good books
$ Come back
$ Come back
$ Come back
$ Come back
unix is fun
what a mess
telecom
wireless
$ their history
$ their history
$ their history
$ their history
good job
excellent
fantastic
job well done

Rather, I want the output to show the final data in the following format:

file1.log> I am okay today
$ wow the weather is nice outside
$ I like to eat
$ Come back
$ unix is fun
$ their history
$ good job
file2.log> I am okay today
$ wow the weather is bad outside
$ go to the movies
$ Come back
$ what a mess
$ their history
$ excellent
file3.log> I am okay today
$ wow the weather is terrible
$ play golf
$ telecom
$ their history
$ fantastic
$ Come back

file4.log> I am okay today
$ wow the weather sucks
$ read good books
$ Come back
$ wireless
$ their history
$ job well done

I cannot quite figure out how to do that yet. Any help is appreciated.

Thanks!

panyam · June 3, 2011, 10:32am

Can you please post a sample of data from file1.log , file2.log and the outcome you are expecting and based on what you are doing the search.I'm not able to get how the script you did is OK.RegardsRavi

Shell_Life · June 3, 2011, 10:38am

What are these fields you want to search for?
Are they located in a file?

Ernst · June 3, 2011, 10:55am

Yes, each file contains a number of lines from which I need to get the fields. All I want to do is to tell the script to go to each file and grab specific fields and print the result into one single file. But when printing the result, I need to be able to know which data belongs to what file.
Note: I did not mention that earlier, each file has one header inside that particular file that identifies it. For example, file1.log would have for example A that tells me this is file A when I look at the data; thus it would be be headed by A followed the fields; file2.log would be headed by B followed by the fields, file21.log would be headed by U followed by the fields.

I hope that helps.

It it does not, how do I move line around. Let's say that I have:
1) A> reading is fun
2) B> hair spray
3) C> travel abroad
4) A>the sky is blue
5) B> train station
6) C>cup of water
Assuming there are thousands of lines, how can I write a script that looks at the whole data and prints all the lines related to A consecutively and then prints everything from B consecutively and goes down the line until the end of the file.
Thanks!

Shell_Life · June 3, 2011, 11:05am

Ernst, you are still not clear on your requirements.

You are not making it any easy for the members to help you.

Please specify:
1) Sample of input data.
2) Sample of strings you want to search for in the input data.
3) Describe in plain English as much details as possible what you want to do.
4) Display the expected output based on the sample data.

panyam · June 3, 2011, 11:23am

Ernst,I assume the log_file ( cat of log1,log2..) have the data:A> reading is funB> hair sprayC> travel abroadA>the sky is blueB> train stationC>cup of waterfor this to get the desired output:awk -F">" '{f=$1;$1="";a[f]=$0"\n"a[f]} END { for(i in a) { print i"."a [i]}}' log_file | sed '/^$/d'

Ernst · June 3, 2011, 1:26pm

Let me try to make it easier.
This is my input file:

This is my desired output file:

How do I rearrange the input file to get the desired result?

Shell_Life · June 3, 2011, 2:24pm

Brilliant Ernst, now that I know what you want, here is one possible solution:

#!/usr/bin/ksh
sed -n 's/List\(.*\)>.*/\1/p' inp_file | while read mString
do
  egrep "${mString}" inp_file
  echo ''
done

ctsgnb · June 3, 2011, 2:45pm

Assuming the '>' sign only appear once per block :

nawk -v N="$(( $(grep '>' infile | wc -l) ))" '{A[NR]=$0}END{for(i=0;++i<=N;)for(j=0;++j<=NR;)if(j%N==i%N) print A[j]}' infile

or if you want block separated with an empty line add the red code:

nawk -v N="$(( $(grep '>' infile | wc -l) ))"  '{A[NR]=$0}END{for(i=0;++i<=N;){for(j=0;++j<=NR;)if(j%N==i%N) print  A[j]; print}}' infile

$ cat tst
List13> I belong to list 13
List14> I belong to list 14
List23> I belong to list 23
List67> I belong to list 67
I also belong to list 13
I also belong to list 14
I also belong to list 23
I also belong to list 67
list 13 is my home
list 14 is my home
list 23 is my home
list 67 is my home
Put me under list 13
Put me under list 14
Put me under list 23
Put me under list 67
Like to go to 13 for vacation.
Like to go to 14 for vacation
Like to go to 23 for vacation
Like to go to 67 for vacation
sweet home 13
sweet home 14
sweet home 23
sweet home 67
13 could be a good numberr
14 could be a good numbe
23 could be a good number
67 could be a good number

$ nawk -v N="$(( $(grep '>' tst | wc -l) ))" '{A[NR]=$0}END{for(i=0;++i<=N;){for(j=0;++j<=NR;)if(j%N==i%N) print A[j];print}}' tst
List13> I belong to list 13
I also belong to list 13
list 13 is my home
Put me under list 13
Like to go to 13 for vacation.
sweet home 13
13 could be a good numberr

List14> I belong to list 14
I also belong to list 14
list 14 is my home
Put me under list 14
Like to go to 14 for vacation
sweet home 14
14 could be a good numbe

List23> I belong to list 23
I also belong to list 23
list 23 is my home
Put me under list 23
Like to go to 23 for vacation
sweet home 23
23 could be a good number

List67> I belong to list 67
I also belong to list 67
list 67 is my home
Put me under list 67
Like to go to 67 for vacation
sweet home 67
67 could be a good number

$

Ernst · June 3, 2011, 3:02pm

What I really want is the following:
INPUT FILE:

Alpha> lh ru warpA read DL_PM_PA0_C0
Beta> lh ru warpA read DL_PM_PA0_C0
Gamma> lh ru warpA read DL_PM_PA0_C0
Delta> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x01CC9739 (30185273) read from address 0x00000B8F.
BXP_0_1: Value 0x050A2F06 (84553478) read from address 0x00000B8F.
BXP_0_1: Value 0x02563DEF (39206383) read from address 0x00000B8F.
BXP_0_1: Value 0x01CB58B7 (30103735) read from address 0x00000B8F.
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
BXP_1_1: Value 0x05033922 (84097314) read from address 0x00000B8F.
BXP_1_1: Value 0x01CCEFB6 (30207926) read from address 0x00000B8F.
BXP_1_1: Value 0x01CED447 (30331975) read from address 0x00000B8F.
BXP_1_1: Value 0x0218E0BA (35184826) read from address 0x00000B8F.
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x0236B631 (37140017) read from address 0x00000B8F.
BXP_2_1: Value 0x01CE0AF3 (30280435) read from address 0x00000B8F.
BXP_2_1: Value 0x050FAD30 (84913456) read from address 0x00000B8F.
BXP_2_1: Value 0x01CCCC5A (30198874) read from address 0x00000B8F.

OUTPUT FILE:

Alpha> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x01CC9739 (30185273) read from address 0x00000B8F
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
BXP_1_1: Value 0x05033922 (84097314) read from address 0x00000B8F
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x0236B631 (37140017) read from address 0x00000B8F
Beta> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x050A2F06 (84553478) read from address 0x00000B8F
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
BXP_1_1: Value 0x01CCEFB6 (30207926) read from address 0x00000B8F
BXP_1_1: Value 0x01CED447 (30331975) read from address 0x00000B8F
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x01CE0AF3 (30280435) read from address 0x00000B8F
Gamma> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x02563DEF (39206383) read from address 0x00000B8F
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x050FAD30 (84913456) read from address 0x00000B8F
Delta> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x01CB58B7 (30103735) read from address 0x00000B8F
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
BXP_1_1: Value 0x0218E0BA (35184826) read from address 0x00000B8F.
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x01CCCC5A (30198874) read from address 0x00000B8F

When you write the code, keep in mind that the input file might contain more info from epsilon, theta, sigma and so on. I just need to know how to rearrange the file.

Thanks.

Shell_Life · June 3, 2011, 3:09pm

Ernst, you specified a sample input data and the expected output.

I wrote a solution based on your sample data.

Now you changed your input data in such a way that my solution will no longer work.

Please, decide on what you want and stick with it.

ctsgnb · June 3, 2011, 3:13pm

---OOOppps ---.

I think i should reread th thread from the beginning ...

Ernst · June 3, 2011, 3:44pm

I am not sure about what's going on, but my output file is empty when I run the script. I did not change anything. I just copied and pasted. any idea?
Thanks!

---------- Post updated at 03:44 PM ---------- Previous update was at 03:32 PM ----------

Shell, all your assumptions are correct and I want to get exactly your output. My only issue so far is that my output file is empty when I use the code. cannot quite understand why.

Thanks!

ctsgnb · June 3, 2011, 3:52pm

When building your big file from many logfiles,
did you try this :

paste -d"\n" *.log

?

---------- Post updated at 09:52 PM ---------- Previous update was at 09:47 PM ----------

$ cat f1
file 1
1
1
1
1
1
1
$ cat f2
file2
2
2
2
2
2
2
2
2
2
2
2
2
$ cat f3
file3
3
3
3
$ paste -d"\t" f?
file 1  file2   file3
1       2       3
1       2       3
1       2       3
1       2
1       2
1       2
        2
        2
        2
        2
        2
        2
$ paste -d"\n" f?
file 1
file2
file3
1
2
3
1
2
3
1
2
3
1
2

1
2

1
2


2


2


2


2


2


2

$

NOTE : the number of empty lines MATTER if you want to do some reverse processing based on line number such as the awk code i've posted earlier.

Ernst · June 3, 2011, 3:57pm

ctsgn,

I meant to send my last reply to you. The output you posted is what I need. But my file is empty when I run your code.

nawk -v N="$(( $(grep '>' tst | wc -l) ))" '{A[NR]=$0}END{for(i=0;++i<=N;){for(j=0;++j<=NR;)if(j%N==i%N) print A[j];print}}' tst

I do not know why this code does not work for me the same way it does for you. I want to get the same output.

ctsgnb · June 3, 2011, 4:10pm

1) On which OS are you ?
2) make sure you have replaced 'tst' with the name of your own file

try with "awk" instead of "nawk" (the rest of the syntax should remain the same)

If you are on SunOS / Solaris plateform you can go with 'nawk', otherwise try 'awk'

---------- Post updated at 10:10 PM ---------- Previous update was at 10:04 PM ----------

awk -v N="$(( $(grep '>' yourbiginputfile | wc -l) ))" '{A[NR]=$0}END{for(i=0;++i<=N;){for(j=0;++j<=NR;)if(j%N==i%N) print A[j];print}}' yourbiginputfile

Ernst · June 3, 2011, 4:11pm

This is the error I get with "awk"

ctsgnb · June 3, 2011, 4:18pm

On which Operating system are you ?

uname -a

?
Which shell are you running ?

Please post (copy paste as is , the command you paste on your screen)

Sometimes, Copy/Paste can mess the character, retype it manually on your screen and make sure you don't forget any parenthesis () nor curly brace {} nor space !!!