Extract information from file

vedanta · May 10, 2017, 2:24pm

In a particular directory, there can be 1000 files like below.

filename is job901.ksh

#!/bin/ksh
cront -x << EOJ
submit file=$PRODPATH/scripts/genReport.sh maxdelay=30
      &node=xnode01
      tname=job901
      &pfile1=/prod/mldata/data/test1.dat
      &pfile2=/prod/mldata/data/test2.dat
      &metafile1=test1.met
      &metafile2=test2.met
      &jobname=job901
      &priority=10;
      EOJ
exit

Want to read similar all files and extract the info. That means the output would have info for each
of the files with such format.The output for one file would be expected like below.

Is it possible?

File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay
job901.ksh| job901  | xnode01| test1.dat,test2.dat | test1.met,test2.met | job901| 10 | 30

Thanks.

RavinderSingh13 · May 10, 2017, 2:55pm

Hello Vedanta,

Could you please try following and let me know if this helps you.

awk -F'[=/]' 'BEGIN{print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay"} /maxdelay/{delay=$NF;next} /node/{node=$NF;next} /tname/{name=$NF;next} /pfile/{file=file?file","$NF:$NF;next} /metafile/{metafile=metafile?metafile","$NF:$NF;next} /jobname/{jobname=$NF;next} /priority/{pri=$NF;next} /exit/{print FILENAME OFS jobname OFS node OFS file OFS metafile OFS name OFS pri OFS delay;jobname=node=file=metafile=name=pri=delay="";}' OFS="|  " *.ksh

I haven't tested it with 1000 or more files, let us know if this helps you.
EDIT: Adding a non-one liner form of solution too here.

awk -F'[=/]' 'BEGIN{
                        print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay"
                   }
              /maxdelay/{
                                delay=$NF;
                                next
                        }
              /node/    {
                                node=$NF;
                                next
                        }
              /tname/   {
                                name=$NF;
                                next
                        }
              /pfile/   {
                                file=file?file","$NF:$NF;
                                next
                        }
              /metafile/{
                                metafile=metafile?metafile","$NF:$NF;
                                next
                        }
              /jobname/ {
                                jobname=$NF;
                                next
                        }
              /priority/{
                                pri=$NF;
                                next
                        }
              /exit/    {
                                print FILENAME OFS jobname OFS node OFS file OFS metafile OFS name OFS pri OFS delay;
                                jobname=node=file=metafile=name=pri=delay="";
                        }
             ' OFS="| "   *.ksh

Thanks,
R. Singh

MadeInGermany · May 10, 2017, 3:46pm

Here is a bash solution with associative arrays, requires bash 4.

#!/bin/bash
cols="node tname pfile metafile"
space=20
declare -A C

# print the header
headersep=""
for col in $cols
do
  printf "${headersep}%${space}s" "$col"
  headersep=" | "
done
printf "\n"

# loop over the files
for jfile in job[0-9]*.ksh
do
  # loop over the lines, collect values in hash C[]
  while IFS="=" read key val
  do
    [ -z "${#val}" ] && continue
    case $key in
    *"&node") C[node]=$val
    ;;
    *"&tname") C[tname]=$val
    ;;
    *"&pfile"*) C[pfile]=${C[pfile]}${C[pfile]:+,}${val##*/}
    ;;
    *"&metafile"*) C[metafile]=${C[metafile]}${C[metafile]:+,}${val##*/}
    ;;
    esac
  done < "$jfile"

  # print and clear C[]
  sep=""
  for col in $cols
  do
    printf "${sep}%${space}s" "${C[$col]}"
    unset C[$col]
    sep=$headersep
  done
  printf "\n"
done

It is not yet complete.
But once understood how it works it is easy to expand.

vedanta · May 11, 2017, 1:22pm

Thanks Guys for the help.
Is there a way to use

awk '/^submit/{print FILENAME;nextfile}' *.sh

instead of

*.sh

with awk in the script above.
There can be different files with different pattern and I want to pick only those files (1000 files in this case ) which would have content starting with submit.
can the last line be modified to select only those files that have content starting with line 'submit'

' OFS="| "   *.ksh

' OFS="| "   awk '/^submit/{print FILENAME;nextfile}' *.sh # pick only those selected files

@Ravinder, It worked. Many thanks!

RavinderSingh13 · May 11, 2017, 2:58pm

vedanta:

Thanks Guys for the help.
Is there a way to use awk '/^submit/{print FILENAME;nextfile}' *.sh instead of *.sh with awk in the script above.
There can be different files with different pattern and I want to pick only those files (1000 files in this case ) which would have content starting with submit.
can the last line be modified to select only those files that have content starting with line 'submit'
' OFS="| " *.ksh
' OFS="| " awk '/^submit/{print FILENAME;nextfile}' *.sh # pick only those selected files
@Ravinder, It worked. Many thanks!

Hello Vedanta,

Please always open a new thread for a new question, now coming to your question, if you want to print only the file names out of many files which have string submit in them starting of the line then you could make a slight change into your code.

 awk '/^submit/{print FILENAME;nextfile}' *.sh

Since you haven't told us about the Input_files and their look so I am removing the OFS part here which is anyways not require since you are printing only the file names, in case your Input_files are | delimited then you could add -F"|" into this above code after awk and your string submit is on a specific field you could look only for that field then.

Thanks,
R. Singh

vedanta · May 12, 2017, 2:59am

Hi,

It does not work when I replaced the *.sh by

awk '/^submit/{print FILENAME;nextfile}' *.sh

in the orginal solution you provided. It gives error can not open file awk

RavinderSingh13 · May 12, 2017, 3:20am

Hello Vedanta,

I just tested with 3 files and it worked for me, could you please paste the exact error here? Also make sure you have at least read permissions to files in which you are trying to search the keyword.

Thanks,
R. Singh

vedanta · May 12, 2017, 7:18am

when I replaced *.sh by the line in your original code

I am getting below error:

awk: cmd. line:37: fatal: can not open file 'awk' for reading ( no such file or directory )

RavinderSingh13 · May 12, 2017, 7:50am

Hello Vedanta,

Are you running this awk program too as a .sh script? If yes then you may need to consider that it will take this script also as this script is also ending with .sh . If this is not the case then you may need to post whatever files are present in the directory and we may need to see what's going on.

Thanks,
R. Singh

vedanta · May 12, 2017, 8:27am

I am running the program as a .sh script.

#!/bin/ksh
awk ' NR==1 {
                        print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay" # to be replaced by tab
                   }
              /maxdelay/{
                                delay=$NF;
                                next
                        }
              /node/    {
                                node=$NF;
                                next
                        }
              /tname/   {
                                name=$NF;
                                next
                        }
              /pfile/   {
                                file=file?file","$NF:$NF;
                                next
                        }
              /metafile/{
                                metafile=metafile?metafile","$NF:$NF;
                                next
                        }
              /jobname/ {
                                jobname=$NF;
                                next
                        }
              /priority/{
                                pri=$NF;
                                next
                        }
              /exit/    {
                                print FILENAME OFS jobname OFS node OFS file OFS metafile OFS name OFS pri OFS delay;
                                #jobname=node=file=metafile=name=pri=delay="";
                        }
             ' OFS="\t"   awk '/^submit/{print FILENAME;nextfile}' *.sh  # replaced *.sh with awk

Please see last line. I want to pick only selected files ( 1000 files from many files).

I executed like below

ksh test.ksh

RavinderSingh13 · May 12, 2017, 8:50am

vedanta:

I am running the program as a .sh script.

#!/bin/ksh
awk ' NR==1 {
   print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay" # to be replaced by tab
   }
   /maxdelay/{
   delay=$NF;
   next
   }
   /node/    {
   node=$NF;
   next
   }
   /tname/   {
   name=$NF;
   next
   }
   /pfile/   {
   file=file?file","$NF:$NF;
   next
   }
   /metafile/{
   metafile=metafile?metafile","$NF:$NF;
   next
   }
   /jobname/ {
   jobname=$NF;
   next
   }
   /priority/{
   pri=$NF;
   next
   }
   /exit/    {
   print FILENAME OFS jobname OFS node OFS file OFS metafile OFS name OFS pri OFS delay;
   #jobname=node=file=metafile=name=pri=delay="";
   }
   ' OFS="\t"   awk '/^submit/{print FILENAME;nextfile}' *.sh  # replaced *.sh with awk

Please see last line. I want to pick only selected files ( 1000 files from many files).

I executed like below

ksh test.ksh

Hello Vedanta,

I am shocked, as I have already told you not to mix 2 codes. My first code was to get the output in your expected shape from a Input_file and second code which was posted by you I corrected(which I provided fair warning like don't mix them and open a new thread on same).

Would like to request you if you can segregate your requirements as it is very confusing now.

NOTE: Above your code will NOT work in this style you are using 2 awk by providing 1 time Input_file which is *.sh .

Thanks,
R. Singh

vedanta · May 12, 2017, 9:20am

Hi,
I want to pull all the files that have the pattern like the input file given. In a directory there are say 5000 files out of which 1000 files have the text starting with 'submit file'.
and from those files I want to get the output I mentioned earlier.

First filter those files which have the text starting with 'submit file' ( out of 5000, I would have 1000 such input file ) and read those input files which have have the text starting with 'submit file' and get the output in the required format I mentioned earlier. So basically, I want to filter out the files and read those file as input only and get required output. thanks

Don_Cragun · May 12, 2017, 1:33pm

This is very convoluted logic. Using awk to read all of your 5000 files to get a list of files containing a certain string to use as arguments to another awk script is grossly inefficient since you have to read each of your 5000 and then read the selected 1000 files a second time. There is very seldom a need to invoke awk twice, but if you must you have to actually invoke awk twice instead of just using the command line arguments that would be used to invoke awk as operands to awk as in:

awk '...
...
             ' OFS="\t"   $(awk '/^submit/{print FILENAME;nextfile}' *.sh)  # replaced *.sh with awk

or:

awk '...
...
             ' OFS="\t"   $(grep -l '^submit' *.sh)  # filter *.sh with grep

But, as I said before, putting it all in a single awk script would be much more efficient:

#!/bin/ksh
awk ' NR==1 {
                        print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay" # to be replaced by tab
                   }
              /maxdelay/{
                                delay=$NF;
                                next
                        }
              /node/    {
                                node=$NF;
                                next
                        }
              /tname/   {
                                name=$NF;
                                next
                        }
              /pfile/   {
                                file=file?file","$NF:$NF;
                                next
                        }
              /metafile/{
                                metafile=metafile?metafile","$NF:$NF;
                                next
                        }
              /jobname/ {
                                jobname=$NF;
                                next
                        }
              /priority/{
                                pri=$NF;
                                next
                        }
              /^submit/{
                                printit=1;
                                next
                        }
              /exit/ && printit{
                                print FILENAME OFS jobname OFS node OFS file OFS metafile OFS name OFS pri OFS delay;
                                #jobname=node=file=metafile=name=pri=delay="";
                                printit=0
                                nextfile
                        }
             ' OFS="\t" *.sh

vedanta · May 15, 2017, 2:44am

Many thanks!
One problem I am seeing here is that awk also is reading the comment lines.
Is there a way to ignore all comment lines at the very beginning? Should I need to create a separate post for this? Thanks.

I have used below as a solution

added !/^#/ eg, 
/pfile/ &&  !/^#/ { ..

Don_Cragun · May 15, 2017, 3:29am

If you are having problems with comments in the shell scripts that are being fed into the script you specified in post #1 in this thread, you don't need to start a new thread; otherwise, you do.

Either way you need to explain what comments need to be removed and exactly how your awk script is supposed to determine what you consider to be a "comment line at the very beginning". The only comment shown in your sample input file(s) is:

#!/bin/ksh

and I don't see why removing that comment from your input files will make any difference in your results. If you don't supply representative sample input files corresponding to the data you want to process, you are wasting time for all of us.

MadeInGermany · May 15, 2017, 3:36am

vedanta:

Many thanks!
One problem I am seeing here is that awk also is reading the comment lines.
Is there a way to ignore all comment lines at the very beginning? Should I need to create a separate post for this? Thanks.

I have used below as a solution
added !/^#/ eg, 
/pfile/ &&  !/^#/ { .. 

Most easy: at the beginning of the awk script put

/^#/ { next }

It will skip the following code whenever a #comment is met.

Don_Cragun · May 15, 2017, 3:45am

Note that it can't quite be at the beginning of the script... It has to come after:

awk ' NR==1 {
                        print "File      | Jobname |  node  | pfile               | metafile            | tname | priority | delay" # to be replaced by tab
                   }

or you won't get the desired heading line in your output file.

MadeInGermany · May 15, 2017, 3:50am

Good point. The correct location is here

awk ' NR==1 { ...
  }
/^#/ { next }