How to put the command to remove duplicate lines in my awk script?

Tim2424 · August 8, 2019, 4:11am

I create a CGI in bash/html.

My awk script looks like :

echo "<table>"
for fn in /var/www/cgi-bin/LPAR_MAP/*;
do
echo "<td>"
echo "<PRE>"
awk -F',|;' -v test="$test" '
     NR==1 { 
        split(FILENAME ,a,"[-.]");
      }
      $0 ~ test {
          if(!header++){
              print "DATE ========================== : " a[4] 
          }
          print ""
          print "LPARS :" $2
          print "RAM : " $5
          print "CPU 1 : " $6
          print "CPU 2 : " $7
          print "" 
          print ""
      }' $fn;



echo "</PRE>"
echo "</td>"
done
echo "</table>"

This script allow to analyze 276 csv files that looks like :


MO2PPC20;mo2vio20b;Running;VIOS 2.2.5.20;7;1.0;2;DefaultPool;shared;uncap;192 MO2PPC20;mo2vio20a;Running;VIOS 2.2.5.20;7;1.0;2;DefaultPool;shared;uncap;192 MO2PPC21;mplaix0311;Running;AIX 7.1 7100-05-02-1832;35;0.6;4;DefaultPool;shared;uncap;64 MO2PPC21;miaibv194;Running;AIX 6.1 6100-09-11-1810;11;0.2;1;DefaultPool;shared;uncap;64 MO2PPC21;mplaix0032;Running;AIX 6.1 6100-09-11-1810;105;4.0;11;DefaultPool;shared;uncap;128 MO2PPC21;mplaix0190;Running;Unknown;243;4.9;30;DefaultPool;shared;uncap;128 MO2PPC21;mo2vio21b;Running;VIOS 2.2.6.10;6;1.5;3;DefaultPool;shared;uncap;192 MO2PPC21;miaibv238;Running;AIX 7.1 7100-05-02-1810;10;0.5;1;DefaultPool;shared;uncap;64 MO2PPC21;mo2vio21a;Running;VIOS 2.2.6.10;6;1.5;3;DefaultPool;shared;uncap;192 MO2PPC21;miaibv193;Running;AIX 6.1 6100-09-11-1810;12;0.2;1;DefaultPool;shared;uncap;64 MO1PPC17;miaibe03;Running;AIX 5.2 5200-10-08-0930;25;null;3;null;ded;share_idle_procs;null MO1PPC17;miaiba12;Running;AIX 5.2 5200-10-08-0930;17;null;2;null;ded;share_idle_procs;null MO1PPC17;miaibf03;Running;AIX 5.2 5200-10-08-0930;30;null;3;null;ded;share_idle_procs;null MO1PPC17;miaibc05;Running;AIX 5.2 
 5200-10-08-0930;40;null;2;null;ded;share_idle_procs;null

And allow to display them in my CGI like that :

The numbers of columns is equal at the number of csv to analyze.

As you can see in the screenshot, some lines are sometimes the same in each csv files. The idea is to delete the lines that are the same in all my csv files.

I know the command :

awk '!a[$0]++'

But I don't know how to put it in my awk script... Do you have any idea ?

Thank you !

Neo · August 8, 2019, 11:36pm

I always do these types of tasks in PHP ; but that's just me. We have a lot of AWK lovers here who will help with the AWK code.

My comment is only on the HTML :

Regarding the code above; my only comment is that, for the most part web developers most agree that <table> tags should be avoided and <div> tags should be used instead.

I have mostly eliminated all <table> tags here at unix.com, but there are still a few <table> tags here at unix.com from decades of legacy code I need to obsolete someday......

Don_Cragun · August 9, 2019, 12:14am

So from your description, you seem to be saying that the output you want instead of the output you've shown us would be just the two lines consisting of the one containing the dates and the line that contains the RAM values 99, 99, and .25???

The first line in the image you showed us is not produced by the script you have shown us (so I am not seeing how anything that awk might do will affect that line of output) and every other line in that image has the same value in all three columns. The output described above doesn't seem to be very useful.

Note that the awk code '!a[$0]++' will discard duplicated lines within all of the files processed by a single invocation of awk . But, since you are processing each of your 276 CSV input files in a separate invocation of awk there is no way for any of those invocations of awk to compare any input values from one input file to any input values from any other input file.

Tim2424 · August 9, 2019, 6:05am

Hello !

Thank you for your answer !

That is why I decide to change my script.

For the moment, I have this :

awk -F",|;" ' {$0=$1","$2","$5","$6","$7 } /'$test'/ { if (!a[$0]++)
{
print ""
printf "LPARS : %s\n", $2
printf "RAM : %s\n", $3
printf "CPU1 : %s\n", $4
printf "CPU2 : %s\n", $5
print ""
}
}' /var/www/cgi-bin/LPAR_MAP/*.csv ;

And it works.

But I don't know how to put :

NR==1 { 
        split(FILENAME ,a,"[-.]");
     }

and

$0 ~ test {
          if(!header++){
              print "DATE ========================== : " a[4] 
          }

Any ideas ?

Tim2424 · August 9, 2019, 10:33am

I try something like that :

echo "<table>"
for fn in /var/www/cgi-bin/LPAR_MAP/*.csv;
do
echo "<td>"
echo "<PRE>"

awk -F',|;' '{ $0=$1","$2","$5","$6","$7} /'$test'/ { if (!a[$0]++) 
{
print "DATE ================= : " FILENAME 
printf "LPARS : %s\n", $2
printf "RAM : %s\n", $3
printf "CPU1 : %s\n", $4
printf "CPU2 : %s\n", $5
print ""
}
}' $fn

echo "</PRE>"
echo "</td>"
done
echo "</table"

The ouptut is :

It's almost what I want. I continued to search !

Don_Cragun · August 9, 2019, 10:13pm

I guess I still don't understand what output you're trying to produce.

Everything in the second column of the output you showed us in the image you included in post #5 in this thread is identical to the data to the first column of the output in that same image with the possible exception of the filename you have in the date field (which is chopped off because you didn't show us the entire image). In the first post you said you wanted to delete output lines where everything matched, but the output you've shown us has everything matching with nothing deleted???

Please try once more to clearly explain what you are trying to do. It is hard to help you come up with a solution to your problem if we can't figure out what you're trying to do!

Tim2424 · August 12, 2019, 4:20am

Hello there !

I will try be more clear ( I'm french so I don't speak english very well and I think this is why you having trouble to understand some things :D)

With my first script in post #1, I could display the informations in column. For each csv files, I create a new column with de date of the file ( the filename looks like MYCSV-DATE-20190812.csv in the filename, that's why is use the " splitname(FILENAME... )" to keep only the date " 20190812 " ), and under the date, I display the informations from CSV files that corresponds to him :

On this screenshot, you can see that each informations in the columns are the sames ( only 3 columns here, but I have 276 CSV files... So if you understand what I want to do, there should be 276 columns... ). It's normal that the informations are the sames because some lines are sometimes the same in each csv files ).

Sometimes, some columns are empty, so with this piece of code :

$0 ~ test 
{   
if(!header++)
{  print "DATE ========================== : " a[4]  }

I don't display the empty columns.

But now, I want to not display the lines that are the same. This will allow to reduce the number of columns. I know this command :

awk '!a[$0]++'

With this command in addition at my script and with the same frame as my screenshot, only 40 columns with only one or two infomations are displayed... that was 276 with my first script

But, ( certainly because I'm beginner and that my skills still pretty bad :D), I didn't succeed to put this command in my script. So... I decide to start from the beginning. I make this script :

 awk -F",|;" ' {$0=$1","$2","$5","$6","$7 } /'$test'/ { if (!a[$0]++)
 { 
print "" printf "LPARS : %s\n", $2 

printf "RAM : %s\n", $3 

printf "CPU1 : %s\n", $4 

printf "CPU2 : %s\n", $5 

print ""
 } 

}' /var/www/cgi-bin/LPAR_MAP/*.csv ;

This script allow to only display the differents lines from each files ( So, in my case, all the differents lines represents the moment where the % of consumption in RAM and CPU changed and this is what I want to displayed ). Now, I want to display these informations... Inside a columns, like my first script. One file = one date = one columns with the informations of the file under the date.

I changed my script for that :

echo "<table>" 
for fn in /var/www/cgi-bin/LPAR_MAP/*.csv; 
do 
echo "<td>" 
echo "<PRE>"  
awk -F',|;' '{ $0=$1","$2","$5","$6","$7} /'$test'/ { if (!a[$0]++)  
{ 
print "DATE ================= : " FILENAME  
printf "LPARS : %s\n", $2 
printf "RAM : %s\n", $3 
printf "CPU1 : %s\n", $4 
printf "CPU2 : %s\n", $5 
print "" 
} 
}' $fn  
echo "</PRE>" 
echo "</td>" 
done 
echo "</table">

And the output is :

So yes, there is a mistake somewhere because each files are displayed in the same column and the content of this column is the same inside the others columns... But there are some good points :

Only the differents lines are displayed
The informations are displayed under the right date

Now I will change that to create one column for one date, and change the FILENAME output to only keep the date.

I hope it's more clearly !

And thank you for your help !

Don_Cragun · August 12, 2019, 5:06pm

Please show us the exact output you hope to produce from the three sample files that we can assume were used to produce the output shown in the image you showed us in post #1 in this thread (preferably as text in CODE tags rather than as an image).

If you're trying to do what I think you are trying to do, you cannot use !a[$0]++ to filter your input because it will throw away input before you know whether or not you will want to print it. I think you need to read all of your input files and then compare the data for each LPARS value. If and only if all of the entries are identical, then you can decide not to print that row of output.

Note that in your sample input file shown in post #1 in this thread, you showed us two lines that seem to be in completely different formats. Please explain what the real format is for your input files.

It is also unclear as to whether or not all of the input files will contain an entry for each LPARS value. If a record for a specific LPARS value is not included in a file, should that be treated as a "different" value causing a line to be printed? Or should a file be ignored when determining whether or not to print an LPARS value line if there is no entry for that LPARS value in that file?

Tim2424 · August 13, 2019, 4:06am

Hello there !

I will try to show you what I want :

I would like this output :

  Date ======== 201908XX      Date ======== 201908XX          Date ======== 201908XX
                         
 LPARS : XX                            LPARS : XX                          LPARS : XX
 RAM : XX                               RAM : XX                              RAM : XX
 CPU 1 : XX                            CPU 1 : XX                           CPU 1 : XX
 CPU 2 : XX                            CPU 2 : XX                            CPU 2 : XX

Each date is equal to a file.

Oh, I didn't see that... It was just a exemple of my CSV files. Layout problem I guess. The correct layout :

MO2PPC20;mo2vio20b;Running;VIOS 2.2.5.20;7;1.0;2;DefaultPool;shared;uncap;192
MO2PPC20;mo2vio20a;Running;VIOS 2.2.5.20;7;1.0;2;DefaultPool;shared;uncap;192 
MO2PPC21;mplaix0311;Running;AIX 7.1 7100-05-02-1832;35;0.6;4;DefaultPool;shared;uncap;64 
MO2PPC21;miaibv194;Running;AIX 6.1 6100-09-11-1810;11;0.2;1;DefaultPool;shared;uncap;64 
MO2PPC21;mplaix0032;Running;AIX 6.1 6100-09-11-1810;105;4.0;11;DefaultPool;shared;uncap;128 
MO2PPC21;mplaix0190;Running;Unknown;243;4.9;30;DefaultPool;shared;uncap;128 
MO2PPC21;mo2vio21b;Running;VIOS 2.2.6.10;6;1.5;3;DefaultPool;shared;uncap;192 
MO2PPC21;miaibv238;Running;AIX 7.1 7100-05-02-1810;10;0.5;1;DefaultPool;shared;uncap;64 
MO2PPC21;mo2vio21a;Running;VIOS 2.2.6.10;6;1.5;3;DefaultPool;shared;uncap;192 
MO2PPC21;miaibv193;Running;AIX 6.1 6100-09-11-1810;12;0.2;1;DefaultPool;shared;uncap;64 
MO1PPC17;miaibe03;Running;AIX 5.2 5200-10-08-0930;25;null;3;null;ded;share_idle_procs;null 
MO1PPC17;miaiba12;Running;AIX 5.2 5200-10-08-0930;17;null;2;null;ded;share_idle_procs;null 
MO1PPC17;miaibf03;Running;AIX 5.2 5200-10-08-0930;30;null;3;null;ded;share_idle_procs;null 
MO1PPC17;miaibc05;Running;AIX 5.2 5200-10-08-0930;40;null;2;null;ded;share_idle_procs;null

In my script, I keep only the column 1,2,5,6 and 7 thanks to awk.

There is a value for each LPARS. And even if there is no value, it's not problem. For exemple, if I have no value for the RAM, nothing will be displayed next to " RAM " :

 Date ======== 201908XX      
                         
 LPARS : foo                            
 RAM :                                
 CPU 1 : 4                            
 CPU 2 : 2

I just need to put the content of the csv file next to his " key name " ( LPARS, RAM, CPU 1 or CPU 2 ). So if there is no informations, nothing will be displayed.

I don't have the impression that is difficult, but I don't see the solution... I succeeded with my first script but all I needed was to delete the duplicate lines... Now that I've succeed to delete the duplicate lines, I juste need to put the ouput at the good layout...

Hope is more clear.

Have a nice day !

Don_Cragun · August 14, 2019, 4:08am

Your first script never referenced input file field #1 in any way. Your remaining scripts keep it in reformatted input lines but never reference it.

Your scripts show a variable named test being used to filter input, but gives no indication of how it is set, what it is used to match, nor why it is there.

Please show us the code that you have hidden from us. Am I correct in guessing that you are setting the shell variable test to a value that will be identical to one of the values that will be found in field #1 in each of your input files?

From the image you supplied in post #1 in this thread I thought the output you wanted would be something like:

DATE ===== 20180122    DATE ===== 20180124    DATE ===== 20180125
RAM : 99               RAM : 99               RAM : 0.25

which are the only two lines in your output that do not have identical values in all three columns. I would have thought that it would be more useful to also show the rest of the information lines in the output related to LPARS value miaibg04 . But, since the data you say you want in post #9 has three input files with the same date ( 201908XX ) and identical values for all of the other fields ( XX ), I am still just guessing at what output you want to produce.

It is after 1:00am here, so I am going to bed. When I get up I will see If I can manufacture some input file data that I can use to test something that might or might not be similar to three of your input files and then see if I can create an awk script that will produce output that I might find useful. Since you are making this so difficult for any of us who are trying to help you, this may take a while and will not be high on my priority list.

Tim2424 · August 14, 2019, 5:52am

Hello !

You're right.

The complete code is :

read a
test=$( echo $a | cut -d'=' -f2)


echo "<p><h2>FRAME : $test</h2></p>"

echo "<table>"
for fn in /var/www/cgi-bin/LPAR_MAP/*;
do
echo "<td>"
echo "<PRE>"
awk -F',|;' -v test="$test" ' 
     NR==1 {
        split(FILENAME ,a,"[-.]");
      }
     $0 ~ test  {
          if(!header++){
              print "DATE ========================== : " a[4] 
          }
          print ""
          print "RAM : " $5
          print "CPU 1 : " $6
          print "CPU 2 : " $7
          print "" 
          print ""
      }' $fn;

echo "</PRE>"
echo "</td>"
done
echo "</table>"

echo "<table>"
echo "<td>"
echo "<PRE>"

read a allow to recover the query string and test=$( echo $a | cut -d'=' -f2) allow to change the output. The basic output is FRAME_NAME=MIAIBYE00 . It was generate from a listbox in my index page which is contain the list of my FRAMES. I use the cut command to keep only the right side of the = . My variable $test is equal to the query string with the cut. So I keep only the lines which is contain the query string.

In my post#1, I make a screenshot of only three columns, because... I can't do more. The date is from the filename, so yes, for this exemple, there is only three columns, bu as I have 276 csv files and if the date is from the filename... There is 276 columns. That's why there is only 3 columns here.
And like for the screenshot, the lines from my CSV are just here as an exemple. In reality, I have 226442 lines. You understand that I can't post all these lines as an exemple.

So, in a nuthsell :

I have many CSV files ( 276 csv -> 226442 lines )
I make awk to keep only the column 1,2,5,6,and 7. I would like to keep only the lines that are not the same, so I use the command if (!a[$0]++) to delete the duplicate lines ( By eliminating the duplicate lines, I reduce the number of columns too. )
I would like to display these informations like that thanks to a html array :

DATE ===== XXXXXXXX    DATE ===== XXXXXXXX    DATE ===== XXXXXXXX
LPARS :  XXX           LPARS :  XXX           LPARS :  XXX
RAM : XX               RAM : XX               RAM : XXX
CPU1 : XX              CPU 1 : XX             CPU 1: XX
CPU 2 : XX             CPU 2 : XX             CPU2 : XX

LPARS :  XXX           LPARS :  XXX           LPARS :  XXX
RAM : XX               RAM : XX               RAM : XXX
CPU1 : XX              CPU 1 : XX             CPU 1: XX
CPU 2 : XX             CPU 2 : XX             CPU2 : XX
 
...

As in my first script and as you can see an exemple on the screenshot.

LPARS : the content of the column 2 kept by the awk command
RAM : the content of the column 5 kept by the awk command
CPU 1 : the content of the column 6 kept by the awk command
CPU 2 : the content of the column 7 kept by the awk command

There is no problem for that. It's just a simple request. If you haven't time to awser me or if you can't find a soluce, never mind ! I will continue to find a soluce for my part !

Have a nice day !

Don_Cragun · August 14, 2019, 4:42pm

tim2424:

Hello !

You're right.

The complete code is :
read a
test=$( echo $a | cut -d'=' -f2)


echo "<p><h2>FRAME : $test</h2></p>"

echo "<table>"
for fn in /var/www/cgi-bin/LPAR_MAP/*;
do
echo "<td>"
echo "<PRE>"
awk -F',|;' -v test="$test" ' 
   NR==1 {
   split(FILENAME ,a,"[-.]");
   }
   $0 ~ test  {
   if(!header++){
   print "DATE ========================== : " a[4] 
   }
   print ""
   print "RAM : " $5
   print "CPU 1 : " $6
   print "CPU 2 : " $7
   print "" 
   print ""
   }' $fn;

echo "</PRE>"
echo "</td>"
done
echo "</table>"

echo "<table>"
echo "<td>"
echo "<PRE>"
read a allow to recover the query string and test=$( echo $a | cut -d'=' -f2) allow to change the output. The basic output is FRAME_NAME=MIAIBYE00 . It was generate from a listbox in my index page which is contain the list of my FRAMES. I use the cut command to keep only the right side of the = . My variable $test is equal to the query string with the cut. So I keep only the lines which is contain the query string.

OK. I know that the value stored in the shell variable test is used to filter the input. You still haven't clearly answered the question: Is the value stored in test a string that is an exact match for a $1 value in your input files? Assuming that it is, the awk test $1 == test would be a better test than using $0 ~ test . The $1 == test will only match exactly the value you want to match. The $0 ~ test will match the lines you do want, but could also match lines that you do not want.

In my post#1, I make a screenshot of only three columns, because... I can't do more. The date is from the filename, so yes, for this exemple, there is only three columns, bu as I have 276 csv files and if the date is from the filename... There is 276 columns. That's why there is only 3 columns here.
And like for the screenshot, the lines from my CSV are just here as an exemple. In reality, I have 226442 lines. You understand that I can't post all these lines as an exemple.

So, in a nuthsell :

I have many CSV files ( 276 csv -> 226442 lines )

I make awk to keep only the column 1,2,5,6,and 7. I would like to keep only the lines that are not the same, so I use the command if (!a[$0]++) to delete the duplicate lines ( By eliminating the duplicate lines, I reduce the number of columns too. )

I would like to display these informations like that thanks to a html array :
DATE ===== XXXXXXXX    DATE ===== XXXXXXXX    DATE ===== XXXXXXXX
LPARS :  XXX           LPARS :  XXX           LPARS :  XXX
RAM : XX               RAM : XX               RAM : XXX
CPU1 : XX              CPU 1 : XX             CPU 1: XX
CPU 2 : XX             CPU 2 : XX             CPU2 : XX

LPARS :  XXX           LPARS :  XXX           LPARS :  XXX
RAM : XX               RAM : XX               RAM : XXX
CPU1 : XX              CPU 1 : XX             CPU 1: XX
CPU 2 : XX             CPU 2 : XX             CPU2 : XX
 
...
As in my first script and as you can see an exemple on the screenshot.

LPARS : the content of the column 2 kept by the awk command
RAM : the content of the column 5 kept by the awk command
CPU 1 : the content of the column 6 kept by the awk command
CPU 2 : the content of the column 7 kept by the awk command

No, we cannot see that from your example in post #1. Your example in post #1 shows the output you would have gotten if you had run your script asking it to process three input files. It does not show us the output you want to get when you run your script with those same three input files. And, your refusal to show us the output your want to get from those three sample input files makes many of your later statements ambiguous.

PLEASE look at your sample output in post #1 and show us exactly what output you want to produce. (DO NOT use XX to hide the data you want; use the data that is in the image in post #1.) I assume that there will be somewhere between two and seven lines of output and I would have thought that you want three columns of output, but maybe you only want two columns of output. If you are unwilling to do this simple task for me, I don't think I will be able to figure out what you are trying to do. The language barrier is making it difficult for me to determine what you are trying to do. I need to see the actual output you are trying to produce from the three files used in your example.

I want to help, but without a clear example of the output you are trying to produce I can't write the code you need.

Calvin_smith · September 13, 2019, 1:57am

I create a CGI in bash/html.

My awk script looks like :
Code:

echo "<table>"
echo "<table>"
for fn in /var/www/cgi-bin/LPAR_MAP/*;
do
echo "<td>"
echo "<PRE>"

awk -F',|;' -v test="$test" '
NR==1 {
split(FILENAME ,a,"[-.]");
}
$0 ~ test {
if(!header++){
print "DATE ========================== : " a[4]
}
print ""
print "LPARS :" $2
print "RAM : " $5
print "CPU 1 : " $6
print "CPU 2 : " $7
print ""
print ""
}' $fn;

echo "</PRE>"
echo "</td>"
done
echo "</table>"