AWK instead of Shell script

I've a list file, which has some file names.
ex: list file "list_file" will have

data_file1.txt
data_file2.txt
data_file3.txt
:
:
data_filen.txt

Above files will have the below layout:

Header1
Header2
*TM*
Data record 1
Datarecord 2
Datarecord n
*TM*
Trailer1
Trailer2
*TM*
*TM*
EOF

I've to read all the data file name one by one and extract only data records to single file "complete.txt".

I've done the shell script for the above work. But my Manager is suggested me to simplify the code.
Are there any simple logic using awk to accomplish the same?

Please help me...

With Regards / Lokesha

for file in `ls data_file*.txt`
do
   awk '/^Data/' $file
done > complete.txt

Thanks Jim mcnamara,

But you are not reading "list_file" which will have the only file names that are need to select.

Also we can't select the data as '/^Data/', because it is not always staring with the same contents. So we have to extract the data with in two
*TM*.

Any idea???:confused:

If the filenames contain no spaces:

awk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file)

Use nawk or /usr/xpg4/bin/awk on Solaris.

Otherwse:

(IFS=$'\n';awk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file))

P.S. If your shell doesn't expand $'\n' to a newline, use IFS='
'

ok:

while read file
do
   awk '/^Data/' $file
done <  list_file > complete.txt

Try the below code

With Awk only:

awk '{ f[NR] = $0 
} END { 
  for (k = 1; k <= NR; k++) {
    while ((getline < f[k]) > 0) { 
      if (p[f[k]] == 1 && $0 !~ /^\*TM\*/)
        print > complete
      if ($0 ~ /^\*TM\*/)
        p[f[k]]++
    }
  }
}' complete="complete.txt" list_file 

Use nawk or /usr/xpg4/bin/awk on Solaris.

Thanks a lot,

Two questions here:

1) Radoulov - Your code is amazing, but what is the difference between the two option which you have given? I was not clear for "If the filenames contain no spaces". But the first option is working for my requirement.

2)bobbygsk - your code is also working for my requirement. But my concern here is the "performance". Since in real environment the datafiles will have "millions" of records speed up our script is very essential.

Hence please suggest me, which code is faster ?

Thanks again......:stuck_out_tongue:

With Regards / Lokesha

Try the first one with a filename like this:

data file1.txt

I'd try both solutions and see.

Lokesha,

My code looks simple and easy to understand.
I'm in intermediate stage of unix scripting.
I do not know about my script performance.
It is better to go with AWK.

You need to try out the alternatives before picking one that works efficiently specially since you need to process millions of records which would not be an easy feat to accomplish.

awk '{
   s = sprintf("\"%s\"", $0)
   re = "*TM*"
   a[re] = 0
   while ("cat "s | getline l) {
     if (l == re)
        a[re]++
     if (a[re] == 1 && l != re)
        print l
   }
}' list_file

One more problem,

Radoulov- I'm using the your below code:

nawk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<list_file) > complete.txt

Since datafile and list will not be in the same location. Also list file will only have the datafile names to be selected and not the location. We've to declare this explicitly.
ex: FTPIN/ is the location for datafile and FILES/ is the location for list file and FTPOUT/ is the location for complete.txt file.
I've to mention these location in your code as below:

nawk 'f[FILENAME]==1&&!/^\*TM\*/;/^\*TM\*/{f[FILENAME]++}' $(<FILES/list_file) >> FTPOUT/complete.txt

But where to mention the datafile location FTPIN/ to select the datafiles?

Also please correct me if Iam wrong for the above code.

You can use something like this:

nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' $(printf "FTPIN/%s\n" $(<FILES/list_file)) > FTPOUT/complete.txt

P.S. You don't need to append here:

... >> FTPOUT/complete.txt

This should be sufficient:

... > FTPOUT/complete.txt

Thanks Radoulov,

Iam using your below script:

nawk 'f[FILENAME] == 1 && !/^\*TM\/
/^\*TM\
/ { f[FILENAME]++ }
' $(printf "FTPIN/%s\n" $(<FILES/list_file)) > FTPOUT/complete.txt

One more question here. Is it possible to check the "read" permission for the file whose names are present in the "FILES/list_file" ?

This is really challenging for me to check this, as this is mandatory...:confused:

Yes,
use this:

nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' $(while IFS= read;do
      [ -r "FTPIN/$REPLY" ]&&printf "%s\n" "FTPIN/$REPLY"
   done<FILES/list_file)>FTPOUT/complete.txt

Do you want to process only the files that are readable?
(the above code does that)

Yes, but not completely.

Script should abort if any one file doesn't have 'read' permission.
How to do this? :confused:

Thanks

Use bash, ksh93 (/usr/dt/bin/dtksh on Solaris)
or zsh, do not use ksh88 (ksh on Solaris):

#!/bin/bash

source_dir="FTPIN"
file_list="FILES/list_file"
out="FTPOUT/complete.txt"
unset f

while IFS= read;do
  [ -r "$source_dir/$REPLY" ] && \
  f=("${f[@]}" "$source_dir/$REPLY")||exit 1
done<"$file_list" && \
nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' "${f[@]}" > "$out"

exit

Oh, unfortunately Iam using "ksh" on solaris !!!

...

You shouldn't :slight_smile:

#!/bin/ksh

source_dir="FTPIN"
file_list="FILES/list_file"
out="FTPOUT/complete.txt"
unset f

while IFS= read;do
  [ -r "$source_dir/$REPLY" ] && \
  set -A f "${f[@]}" "$source_dir/$REPLY"||exit 1
done<"$file_list" && \
nawk 'f[FILENAME] == 1 && !/^\*TM\*/
/^\*TM\*/ { f[FILENAME]++ }
' "${f[@]}" > "$out"

exit