AWK instead of Shell script

Lokesha · February 5, 2008, 9:56am

I've a list file, which has some file names.
ex: list file "list_file" will have

data_file1.txt
data_file2.txt
data_file3.txt
:
:
data_filen.txt

Above files will have the below layout:

Header1
Header2
*TM*
Data record 1
Datarecord 2
Datarecord n
*TM*
Trailer1
Trailer2
*TM*
*TM*
EOF

I've to read all the data file name one by one and extract only data records to single file "complete.txt".

I've done the shell script for the above work. But my Manager is suggested me to simplify the code.
Are there any simple logic using awk to accomplish the same?

Please help me...

With Regards / Lokesha

jim_mcnamara · February 5, 2008, 10:00am

for file in `ls data_file*.txt`
do
   awk '/^Data/' $file
done > complete.txt

Lokesha · February 5, 2008, 10:19am

Thanks Jim mcnamara,

But you are not reading "list_file" which will have the only file names that are need to select.

Also we can't select the data as '/^Data/', because it is not always staring with the same contents. So we have to extract the data with in two
*TM*.

Any idea???

radoulov · February 5, 2008, 10:41am

If the filenames contain no spaces:

awk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file)

Use nawk or /usr/xpg4/bin/awk on Solaris.

Otherwse:

(IFS=$'\n';awk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file))

P.S. If your shell doesn't expand $'\n' to a newline, use IFS='
'

jim_mcnamara · February 5, 2008, 10:43am

ok:

while read file
do
   awk '/^Data/' $file
done <  list_file > complete.txt

bobbygsk · February 5, 2008, 10:53am

Try the below code

radoulov · February 5, 2008, 11:17am

With Awk only:

awk '{ f[NR] = $0 
} END { 
  for (k = 1; k <= NR; k++) {
    while ((getline < f[k]) > 0) { 
      if (p[f[k]] == 1 && $0 !~ /^\*TM\*/)
        print > complete
      if ($0 ~ /^\*TM\*/)
        p[f[k]]++
    }
  }
}' complete="complete.txt" list_file

Use nawk or /usr/xpg4/bin/awk on Solaris.

Lokesha · February 5, 2008, 11:33am

Thanks a lot,

Two questions here:

1) Radoulov - Your code is amazing, but what is the difference between the two option which you have given? I was not clear for "If the filenames contain no spaces". But the first option is working for my requirement.

2)bobbygsk - your code is also working for my requirement. But my concern here is the "performance". Since in real environment the datafiles will have "millions" of records speed up our script is very essential.

Hence please suggest me, which code is faster ?

Thanks again......

With Regards / Lokesha

radoulov · February 5, 2008, 12:02pm

Try the first one with a filename like this:

data file1.txt

I'd try both solutions and see.

bobbygsk · February 5, 2008, 3:07pm

Lokesha,

My code looks simple and easy to understand.
I'm in intermediate stage of unix scripting.
I do not know about my script performance.
It is better to go with AWK.

shamrock · February 5, 2008, 5:17pm

You need to try out the alternatives before picking one that works efficiently specially since you need to process millions of records which would not be an easy feat to accomplish.

awk '{
   s = sprintf("\"%s\"", $0)
   re = "*TM*"
   a[re] = 0
   while ("cat "s | getline l) {
     if (l == re)
        a[re]++
     if (a[re] == 1 && l != re)
        print l
   }
}' list_file

Lokesha · February 6, 2008, 12:11am

One more problem,

Radoulov- I'm using the your below code:

nawk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file) > complete.txt

Since datafile and list will not be in the same location. Also list file will only have the datafile names to be selected and not the location. We've to declare this explicitly.
ex: FTPIN/ is the location for datafile and FILES/ is the location for list file and FTPOUT/ is the location for complete.txt file.
I've to mention these location in your code as below:

nawk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<FILES/list_file) >> FTPOUT/complete.txt

But where to mention the datafile location FTPIN/ to select the datafiles?

Also please correct me if Iam wrong for the above code.

radoulov · February 6, 2008, 3:39am

You can use something like this:

nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' $(printf "FTPIN/%s\n" $(<FILES/list_file)) > FTPOUT/complete.txt

P.S. You don't need to append here:

... >> FTPOUT/complete.txt

This should be sufficient:

... > FTPOUT/complete.txt

lokiman · February 9, 2008, 8:59am

Thanks Radoulov,

Iam using your below script:

nawk 'f[FILENAME] == 1 && !/^\*TM\/
/^\*TM\/ { f[FILENAME]++ }
' $(printf "FTPIN/%s\n" $(<FILES/list_file)) > FTPOUT/complete.txt

One more question here. Is it possible to check the "read" permission for the file whose names are present in the "FILES/list_file" ?

This is really challenging for me to check this, as this is mandatory...

radoulov · February 9, 2008, 11:37am

Yes,
use this:

nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' $(while IFS= read;do
      [ -r "FTPIN/$REPLY" ]&&printf "%s\n" "FTPIN/$REPLY"
   done<FILES/list_file)>FTPOUT/complete.txt

radoulov · February 9, 2008, 12:15pm

Do you want to process only the files that are readable?
(the above code does that)

lokiman · February 10, 2008, 2:01am

Yes, but not completely.

Script should abort if any one file doesn't have 'read' permission.
How to do this?

Thanks

radoulov · February 10, 2008, 6:15am

Use bash, ksh93 (/usr/dt/bin/dtksh on Solaris)
or zsh, do not use ksh88 (ksh on Solaris):

#!/bin/bash

source_dir="FTPIN"
file_list="FILES/list_file"
out="FTPOUT/complete.txt"
unset f

while IFS= read;do
  [ -r "$source_dir/$REPLY" ] && \
  f=("${f[@]}" "$source_dir/$REPLY")||exit 1
done<"$file_list" && \
nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' "${f[@]}" > "$out"

exit

lokiman · February 11, 2008, 7:04am

Oh, unfortunately Iam using "ksh" on solaris !!!

radoulov · February 11, 2008, 7:23am

...

You shouldn't

#!/bin/ksh

source_dir="FTPIN"
file_list="FILES/list_file"
out="FTPOUT/complete.txt"
unset f

while IFS= read;do
  [ -r "$source_dir/$REPLY" ] && \
  set -A f "${f[@]}" "$source_dir/$REPLY"||exit 1
done<"$file_list" && \
nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' "${f[@]}" > "$out"

exit