Using input from one file to define scope of other file in Linux

Hi,
I have two files A and B and would like to use A as a filter. Like this:

File A.txt: Contains a list of IP addresses, each one occurring only once and in order:

10.0.0.1
10.0.0.2
10.0.0.4
10.0.0.8

File B.txt: Contains the same IP addresses with a corresponding ping time each (in fact, file A.txt is generated with a simple sort -u command from file B.txt). B.txt could consist of thousands of rows but I have sorted all addresses and ping times in ascending order like this with another filter:

10.0.0.1     2
10.0.0.1    18
10.0.0.2     6
10.0.0.4     8
10.0.0.4    14
10.0.0.8    18

For each IP address in file A, I want to calculate the min/med/max ping time and the packet delay variation (difference in time between min/max) and store that in a separate file. So the result should look like this:

   IP      Min Ave Max PDV
10.0.0.1    2  10  18   16
10.0.0.2    6   6   6    0
10.0.0.4    8  11  14    6
10.0.0.8   18  18  18    0

Have tried quite a few different ways to solve this. One obvious way would probably be to loop through file A, and for each IP address I do a grep in file B and direct the result into a new file which is analysed separately. Something like this:

while read -r LINE ; do cat B.txt | grep $LINE > file$line.txt ; done < A.txt

I know the syntax won't do the trick (not any other similar syntax either, have tried them all), but if I could get one file per IP address I could figure out how to analyze the data.

After monitoring forums, I found out that a better way to do this on such big text files would be to use awk. Have found many examples in this forum, but none that takes entries from one file as input and uses that to define the scope for another file.

I have tried to combine 'awk' and 'read' but as the rookie I am I only manage to get some grey hair. Even this simple one (only trying to read A.txt and see if I can get the IP address out and match it with the left column in B.txt) fails:

while read -r LINE ; do awk '{if ($LINE[1]==$1) print $1}' B.txt ; done < A.txt

Just wonder if someone could give some advise here? Have really scanned after solutions in other threads on many forums but just can't pull it together.

As side note: Would be very interesting to understand if all of this could be done in a single command in some way also (including my first steps where A.txt is created and B.txt is sorted). Generating new files all the time means writing to the flash over and over and also add code to clean up those files as I get new files all the time. Would like to avoid that if possible.

Thanks!
Z

The sample data you have shown has averages that happen to work out to whole numbers. Do you always want the averages displayed as values rounded to whole numbers? If not, how many decimal places do you want in the output?

What formula are you using to calculate the packet delay variation? How many decimal places do you want to display in the output in the PDV column?

Does the output need to be sorted by IP address? (If not, the sort step can probably be eliminated.)

1 Like

Here is a solution that works with the unsorted original file:

awk '
{ T[$1]+=$2
  if(!($1 in M)) M[$1]=X[$1]=$2
  else if($2<M[$1]) M[$1]=$2
  else X[$1]=$2
  C[$1]++
}
END {
  printf "%-17s %4s %4s %4s %4s\n", "IP", "Min", "Ave", "Max", "PDV"
  for (I in T)
     printf "%-17s %4d %4d %4d %4d\n", I, M, T/C,X,X-M
}' infile

output:

IP                 Min  Ave  Max  PDV
10.0.0.1             2   10   18   16
10.0.0.2             6    6    6    0
10.0.0.4             8   11   14    6
10.0.0.8            18   18   18    0
1 Like

Hi Don,
Thanks for reaching out. Made the sample data like that for a simple printout, but I realize it wasn't smart ;). One decimal would be sufficient for all entries in the result file.

The packet delay variation (jitter) is obtained by first finding the max and min ping times for an IP address and then substract the min from the max. Ex: max=10, min=2, PDV=10-2=8.

The output doesn't have to be sorted. If I need that later I can fix it. If more than one row is required that's also fine. What I mainly want is to avoid writing/deleting files if possible.

Thanks!
Z

For 1 decimal value change the printf format to %6.1f instead of %4d , or course the headings should also be changed to %6s

1 Like

Works!! Thanks a million Chubler_XL! :smiley:

If I'm reading Chubbier_XL's code correctly, if the input file is not sorted, the max value ( X[$1] ) may not be saved correctly. The following slightly modified code (adjusted for 1 decimal point in the average output and to save the correct maximum value for each IP address) should work:

awk '
{	if(!($1 in c)) m[$1] = M[$1] = $2
	else	if($2 > M[$1]) M[$1] = $2
	else	if($2 < m[$1]) m[$1] = $2
	s[$1] += $2
	c[$1]++
}
END {	printf("%-17s %3s %-5s %3s %3s\n", "   IP Address", "Min", " Ave",
		"Max", "PDV")
	printf("%17s %3s %5s %3s %3s\n", "=================", "===", "=====",
		"===", "===")
	for(i in c)
		printf("%-17s %3d %5.1f %3d %3d\n", i, m, s / c, M,
			M - m)
}' B.txt

which, with your sample input, produces the output:

   IP Address     Min  Ave  Max PDV
================= === ===== === ===
10.0.0.1            2  10.0  18  16
10.0.0.2            6   6.0   6   0
10.0.0.4            8  11.0  14   6
10.0.0.8           18  18.0  18   0

If someone else wants to try this on a Solaris system, change awk to /usr/xpg4/bin/awk .

2 Likes

Yes, Just because a value isn't a min value doesn't mean its max, don't know what I was thinking there.

If the input file had been sorted with:
sort -o B.txt -k1,1 -k2,2n B.txt
which could be true with the sample input shown, your code would work just fine. In fact, the three lines in your code:

  if(!($1 in M)) M[$1]=X[$1]=$2
  else if($2<M[$1]) M[$1]=$2
  else X[$1]=$2

could be simplified even more to just:

  if(!($1 in M)) M[$1]=$2
  X[$1]=$2
1 Like

Thanks guys. Just got a couple of follow-ups:

  1. Let's say I want to add one more column counting the number of pings per address. That would only be to add c[i], right?

  2. Would it be possible to describe briefly what happens here?

if(!($1 in c)) m[$1] = M[$1] = $2
	else	if($2 > M[$1]) M[$1] = $2
	else	if($2 < m[$1]) m[$1] = $2

Then I can edit the file myself so it fits other types of files I'm working on.

Cheers!
Z

  1. Yes, c in the END clause in this awk program is the number of times IP address i has been seen in the input file(s) that have been processed.
  2. Do the following comments sufficiently explain what this code does?
	if(!($1 in c)) m[$1] = M[$1] = $2	# 1st line with this IP? Set min & Max for this IP
	else	if($2 > M[$1]) M[$1] = $2	# else, if new Max for this IP, save it
	else	if($2 < m[$1]) m[$1] = $2	# else, if new min for this IP, save it
1 Like

Thanks again Don. Just put this code in a bash script and I would like to be able to specify the input file as an argument. So I would like to call:

./stats.sh B.txt

for example. Tried to change "B.txt" in the file to various things (like $1 but that's already in use in the file...). What would do the trick?

What happened when you replaced the filename operand to awk in stats.sh with $1 ?

That should be OK. The $1 in an awk script expands to the contents on the 1st first field in the current input line, but the $1 in a shell script expands to the contents of the 1st command line argument passed to that shell script.

1 Like

You're right of course. The problem was that I used a script editor and got the line ending wrong. Fixed that with:

sed -i 's/\r//' stats.sh

Ok, the topic has derailed a bit, but here is the final one: if I also add a third column with the URL for each IP in the input file (B.txt), how can I print that in the final table? Tried to add:

name[$1] = $3

Assuming that this would set the variable 'name' for a certain IP address to the URL listed in column 3 in the infile. So if this is a row in the infile:

62.147.28.224  12  www.myserver.com

The output file would get one more column to the right with www.myserver.com. But it just prints zero (0) for me in that column.

Promise this is the last one in this thread :wink:

It looks like you're saving it appropriately. How are you printing it? Are you using %d or %f format specifier instead of %s to print a string?

Please show us you current script instead of making us guess at what might be wrong.

1 Like

Right again Don. Tried to print it as a decimal number but now it's fixed. I'm gonna download awk/bash tutorials and spend next week reading! Anyway, thanks a lot for all help. Posting the final script below where I also added ping counts and the URL.

I guess an infile with 10 000 rows containing 10 addresses in total would write URLs the name[$1] array 10 000 times. Could be done with a separate awk command for the URL only but this is good enough for me.

Now the infile syntax is:

<IP Address to ping> <ping time> <URL>

And the code that produces the stats:

awk '
{       if(!($1 in c)) m[$1] = M[$1] = $2
        else    if($2 > M[$1]) M[$1] = $2
        else    if($2 < m[$1]) m[$1] = $2
        s[$1] += $2
        name[$1] = $3
        c[$1]++
}

END {   printf("%-17s %3s %-5s %3s %3s %5s %5s\n", "   IP Address", "Min", " Ave",
                "Max", "PDV", "count", "URL")
        printf("%17s %3s %5s %3s %3s %5s %5s\n", "=================", "===", "=====",
                "===", "===", "=====", "===")
        for(i in c)
                printf("%-17s %3d %5.1f %3d %3d %5d %5s\n", i, m, s / c, M,
                        M - m, c, name)
}' $1

Thanks a lot Don. Appreciated!

If you have big input files, this might run a tiny bit faster (since it only saves the URL once per IP address) and might be a little bit prettier (lining up URL headings with the output assuming an 80 column output device and variable length URLs). It will also allow you to call your script with multiple input files and get a single report for the combined contents of all files you pass in...

awk '
{	if(!($1 in c)) {
		m[$1] = M[$1] = $2
		name[$1] = $3
	}
	else	if($2 > M[$1]) M[$1] = $2
	else	if($2 < m[$1]) m[$1] = $2
	s[$1] += $2
	c[$1]++
}
END {	printf("%-17s %3s %-5s %3s %3s %5s %s\n", "   IP Address", "Min",
		" Ave", "Max", "PDV", "count", "URL")
	printf("%17s %3s %5s %3s %3s %5s %s\n", "=================", "===",
		"=====", "===", "===", "=====",
		"=====================================")
	for(i in c)
		printf("%-17s %3d %5.1f %3d %3d %5d %s\n", i, m, s / c,
			M, M - m, c, name)
}' "$@"
1 Like