awk not working for calculating no of lines with criteria

I have tar.gz file and i want to count the lines which are matching the criteria as well as which are not matching the criteria. Following is the code

Output Requirement:
Match the input from zcat with 26th filed having 02 value, in case it matches then print the output in a file & increase the match counter by 1 & in case it doesnt matches then increase the not match counter. At last i have 2 files one having the records in a.txt file & another file having match counter & not match counter values.

But this is not working, please help

zcat filename.tar.gz | awk -v mon="07" '
BEGIN {
 if (( (substr($0,26,2)=="02") && substr($0,84,2) == month  ))
  print $0 >> "a.txt"
  ++matchcounter
 else 
 ++notmatch 
 ;}
END { print matchcounter","notmatch >> "countfile"}
'

Try without BEGIN

zcat filename.tar.gz | awk -v mon="07" '
    {
 if (( (substr($0,26,2)=="02") && substr($0,84,2) == month  ))
  print $0 >> "a.txt"
  ++matchcounter
 else 
 ++notmatch}
    END { print matchcounter","notmatch >> "countfile"}'

Seems to be working, thanks.

One more thing, if i want to extract the filename as well from filename.tar.gz file & get the count accordingly for match counter and not match counter, how would i do?

---------- Post updated at 10:00 PM ---------- Previous update was at 06:00 PM ----------

hi pamu,
when i tried removing BEGIN it gave an error, below is the code

 cat filename.tar.gz | awk -v mon="07" '
    {
 if (( (substr($0,26,2)=="02") && substr($0,84,2) == month  ))
  print $0 >> "a.txt"
  ++matchcounter
 else
 ++notmatch};
    END { print matchcounter","notmatch >> "countfile"}'
awk: cmd. line:5:  else
awk: cmd. line:5:  ^ syntax error

Can you please suggest

I don't know what you're trying to do, but there are a few obvious problems. Since I don't know what you're trying to do, I haven't made any attempt to test the following suggestion.

In earlier posts in this thread you were using zcat to unzip a compressed tar file. The code in orange in your script is now using cat instead of zcat???

The code in red in your script sets a variable named "mon" (and never uses it) and uses a variable named "month" (that has never been set).

And as shown by the diagnostic messages you're getting from awk, your if statement, is not using the correct syntax.

The following seems to fix the obvious issues above:

zcat filename.tar.gz | awk -v mon="07" '
{       if(substr($0,26,2)=="02" && substr($0,84,2) == mon) {
                print $0 >> "a.txt"
                ++matchcounter
        } else 
                ++notmatch
}
END {   print matchcounter","notmatch >> "countfile"}'

The changes marked in red are crucial; the other changes are editorial.

I don't see how this script can do anything useful with a tar file, but I haven't made any changes to account for that. Your if statement might make sense (although I didn't look at the definition of a tar header block to confirm it) if it was only looking at tar headers, but this script is looking at every line in the tar file (headers and archived file contents).

hey don,
that's correct, script is looking at each line of of archived files in the tar file. Can you please suggest how to get the archived filename which is being parsed in the code below for each line

Shell scripts aren't particularly well suited to skipping over chunks of data that may contain data that is binary (rather than textual) contents of a file. Even if all of the files that are included in the tar file are text files, the shell and awk aren't necessarily a good fit for this job. Instead of showing us a broken awk script that doesn't do what you want to do; why don't you tell us what you are trying to do and show us output from the command:

tar -tvf filename.tar.gz

or if that doesn't work:

zcat filename.tar.gz | tar -tvf -

and show us exactly what output you want your shell/awk script to produce when given this gzipped tar archive (or the unzipped tar archive) as input?

hi don,
I want to parse the tar file having multiple files(more than 5000+) using the conditions mentioned in awk & also need to print the filename from where the condition has matched. In case condition in theawk(if condition) is not matched then notmatch counter should increase along with the filename in which it has not matched.

Code 1:

zcat filename.tar.gz

This would read each line of archived files

Code 2:

awk -v mon="07" '
{       if(substr($0,26,2)=="02" && substr($0,84,2) == mon) {
                print $0 >> "a.txt"
                ++matchcounter
        } else 
                ++notmatch
}

This condition will then match the condition and in case matched then output should be written in text file & also increase the match counter, in case condition is not matched, then counter of notmatch should increase

Code 3:

END {   print matchcounter","notmatch >> "countfile"}'

This condition will print the value of match and notmatch counter

Now requirement is to print the archived filename in the above condition(code no 2)

OK. I have looked up the tar header format. The tar header contains lots of nul bytes, so any attempt to process a tar archive using the shell, awk, sed, or any other Linux or UNIX text processing utilities produces undefined results. The 1st 100 bytes in a tar header may contain the file's name (if it is <= 100 bytes long), may contain one or more directory names from the file's pathname (if they fit along with the file's name in 100 bytes), and may contain complete garbage left over from archiving a previous file. If the file's name is longer than 100 bytes, but the complete stored pathname is <= 155 bytes, the pathname (including the final component) may be saved in bytes 345-499 (with the 1st byte numbered 0). So your awk script seems to be looking for "02" and "07" at specific points in the middle of a pathname that ends with a newline character and that is somewhere between 86 and 100 bytes long. If these conditions are met in the 1st file archived in the tar file, you may get the results you want for that file; otherwise, all bets are off.

If you will show us what I asked for in my last message (or at least the 1st several lines of output from the tar command and the corresponding output you want to be produced for those lines), we may be able to help you parse the output of a tar archive listing command to get what you want. Otherwise, I don't see how we can help.

Hi don,
as required out from both zcat & the below code needs is as follows

OUTPUT FROM ZCAT filename.tar.gz

20130701/
0001750020745500000000000082010060000000000                                                USSDlike                                        0000000000000429496704040
5899136999995
000000000000002148063927402YD-MTSBAL               519132008926477227        1120130701074546201307020745460000000001121005060000000001
  405891369335696         MTSCHNAOC               2471                    00000000000004294967040405899136999995
000000000000003148064263403YD-MTSBAL               519131878925724626        1120130701074550201307020745500000000001134005060000000000
                          MTSCHNAOC                                       00000000000004294967040405899136999995

Output from the code

000000000000002155850114502YD-MTSBAL               519132008641092603        1120130714101521201307151015210000000001038005060000000001
  405891743536224         MTSCHNAOC               2458                    00000000000004294967040405899136999995
000000000000003155849253702YD-MTSBAL               519132009153053234        1120130714101512201307151015120000000001122005050820600001
  405891360052922         MTSCHNAOC               2471                    00000000000004294967040405899136999995

OK. This is not what I asked for, but it is informative.

I take back everything I said before. I made the wild assumption that your filename filename.tar.gz followed normal UNIX and Linux conventions (i.e., it was a tar output file that had been compressed using gzip. But, the output from the zcat clearly shows that this is not a tar archive. So, exactly what command line was used to create filename.tar.gz ?

And, no matter what created this file, the awk script you have been showing us would never produce the four lines of output you have shown above. Two of these lines seem to meet your criteria, although the text I marked in red (that you showed in bold) can't both be from input columns 84 and 85. (Although both lines do contain 07 in columns 84 and 85.) But, the other two lines don't contain the strings "02" or "07" anywhere that I can see.

So. Forget about the awk code. Tell us in English what criteria you used to decide that the four lines of output shown above are the output that you want?

command line used for creating filename.tar.gz is as follows:

tar -zcvf filename.tar.gz file*.*

OUTPUT FROM ZCAT filename.tar.gz

20130701/
0001750020745500000000000082010060000000000                                                USSDlike                                        0000000000000429496704040
5899136999995
000000000000002148063927402YD-MTSBAL               519132008926477227        1120130701074546201307020745460000000001121005060000000001
  405891369335696         MTSCHNAOC               2471                    00000000000004294967040405899136999995
000000000000003148064263403YD-MTSBAL               519131878925724626        1120130701074550201307020745500000000001134005060000000000
                          MTSCHNAOC                                       00000000000004294967040405899136999995

Above is the input

Now for required output, i have placed a check for printing those lines which only have 02 in 26th field of the input line & 07 in the 84th field with 2 as length.

 if(substr($0,26,2)=="02" && substr($0,84,2) == mon)

So in case it matches then i print the output in a file, count no of match & also the filename from where condition has matched,i.e,
if

filename.tar.gz

is having 10 files with file names say file1, file2... file10, then for every condition matched above should print something like this


000000000000002155850114502YD-MTSBAL               519132008641092603        1120130714101521201307151015210000000001038005060000000001
  405891743536224         MTSCHNAOC               2458                    00000000000004294967040405899136999995

000000000000002155850114502YD-MTS               519132008641092603        1120130715101521201307151015210000000001038005060000000001
  405891743536224         MTSCHNAOC               2458                    00000000000004294967040405899136999995

Since the above is the matched condition so match counter will be increased accordingly.

In the end i would need match & not match count for each file & for match condition output to be in a.txt. Content of countfile should look something like this

file1 match count, notmatch count
file2 match count, notmatch count
file3 match count, notmatch count
.
.
.
file10 match count, notmatch count

Content of

a.txt

should look as mentioned above

since i have space constraints so untar cannot be done :(.
Hope this clarifies....

Since you sent me private mail asking me to help you on this again, I take it that you ignored my previous messages in this thread. The archive files produced by awk contain lots of NULL bytes; so by definition tar archive files are binary, not text, files. The shell and awk utilities are built to work with text files; not binary files, so there is no way to do what you're trying to do with a standard awk. (Some implementations may provide extensions to awk enabling it to work on binary files, but I do not have access to any such implementation. You might also be able to write a perl program to do this, but I am not fluent enough in perl to help you try this.)

It would be easy to extract the files from the archive and walk through the regular files in the extracted file hierarchy to get what you want. But, you say you don't have the room to do that.

The output format produced by tar -t and tar -tv is not standardized (and varies from implementation to implementation). It may be possible for you to use tar -t or tar -tv to get a list of regular files stored in the archive and then use tar -xO pathname in a loop with pathname set to a different regular file in the archive each time through the loop so you can feed the contents of that file through your awk script without saving a copy of the file on disk.

That will require reading the archive n+1 times if there are n regular files in the archive and even this only works if all of the regular files in the archive are text files. I encourage you to play with tar to see if you can make this work. (On some implementations, tar -tf archive will list directories in the archive with a trailing slash on the name and other files without a trailing slash. If the implementation of tar on your system does this; you can use the trailing slash to determine whether to skip that file or to extract it and feed it to your awk script.)

1 Like

hey don,
thanks for the input, when i am in need i dont ignore other remarks. I went through your earlier comments & was finding ways to crack this on binary files & from where i learned that the archive i am searching is a ustar format. Anyways, i am working on your comments & will get back to you in case any further help is required.