Need help improving my script.

garlandxj11 · April 12, 2016, 8:31pm

Thank you for taking the time to look at this and provide input.
To start, I am not a linux/unix expert but I muddle through the best I can.
I am also in no way shape or form a programmer. Please keep that in mind as you read this script.

This script is designed to find all files in a given directory that begin with "asalog", find lines containing a specific word and then process down those lines and output just the needed information. These files are currently zipped. The files are stored on remote ZFS storage. Copying all of the files down to the local system at once then unzipping is not feasible do to storage limitations. The script works as designed but it is very slow to do the task.

Please look over the code and suggest ways that I could improve its speed. The last run took 238 minutes to complete.

Due to access limitations I have to work within BASH, I do not have the option (nor the knowledge) to utilize perl, python, etc.

Any help is welcome as well as comments on the script as it sits. It has been cobbled together by remembering programming structure learned taking Turbo Pascal in high school (many years ago) and lots of google searches.

echo Search started at:
date +"%m/%d/%Y %T"
# Displays the start up information and the start time

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} \;  |  awk '/Built/&& !/10.10.120.145/{print $10, $11, $15, $18;}' | sed -e 's!/! !g' -e  's!:! !g' | awk '{if ($1 == "inbound") print $1, $2, $3, $4, $6, $7, $8; else if ($1 == "outbound") print $1, $2, $6, $7, $3, $4, $5;}' | awk '!seen[$0]++ {print}' >> /home/kenneth.cramer/asa/GC_ports.txt

# Finds all files with that begin with the name asalog that were written in the last 7 days. It then reads the files line by line looking
# for any lines containing the word Built but not the 10.10.120.145 IP address and prints out the 7th, 8th, 12th and 15th words in the line
# It then looks for any "/" slashes or ":" colons in the four words and replaaces them with spaces.
# The script now prints out the needed words from the line and then writes only unique lines to the output file.

echo
echo
echo
echo Sorting data into proper files.
# Displays that the script is now sorting the information

awk '{if ($1 == "inbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpinbound.txt"; else if ($1 == "inbound" && $2 == "UDP") print $2, $3, $4, $5,  $6, $7 >> "/home/kenneth.cramer/asa/GC_udpinbound.txt"; else if ($1 == "outbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpoutbound.txt"; else if ($1 == "outbound" && $2 == "UDP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_udpoutbound.txt";}' /home/kenneth.cramer/asa/GC_ports.txt
# The script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.


echo
echo
echo
echo Compressing files for transport

tar -czvf /home/kenneth.cramer/asa/GC_ports.tgz /home/kenneth.cramer/asa/GC_*.txt
# Compresses the output files into a single file for transport off the machine.

echo Process completed for Gold Camp at:
date +"%m/%d/%Y %T"
echo
echo
times

Don_Cragun · April 12, 2016, 10:08pm

You have the comment:

# The script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.

but there is no ports2.txt anywhere in your script. Do you care about having the file GC_ports.txt , or do you really just need the four GC_(tcp|udp)(in|out)bound.txt files that are created from this file?

Assuming you have more than one compressed file that is less than a week old, using -exec zcat {} + will be faster than -exec zcat {} \; and you could replace all four awk scripts and the sed script with a single awk script (which would considerably reduce the time spent reading and writing data that should only need to be read once and written at most twice; instead of reading the data five times and writing it six or seven times).

But, I would guess (hard to make any sound judgements here with no samples of the data being processed) that the bulk of the time being spent in this script is in compressing and recompressing relatively large files for your archives. And if you just need the files that you are splitting out of GC_ports.txt , the time spent creating, compressing, and archiving that unneeded file could be significant.

Can you show us some sample uncompressed data that is being pushed through the pipeline by find ... -exec zcat {} \; ? Figuring out exactly what that pipeline is doing without knowing where the slashes and colons are makes it hard to feel confident about suggesting ways to streamline your awk and sed scripts.

Even though echo is a shell built-in, invoking echo four times in a row instead of calling printf (another shell built-in) once is inefficient. (Depending on what operating system you're using, you could probably produce the same output with a single echo instead of a single printf , but I prefer print since its formatting options are more portable.) Do you really want/need that many empty lines in the output produced by this script?

garlandxj11 · April 12, 2016, 10:44pm

Sorry, Ports2.txt is incorrect. I pull data from two different archives so I created one script to test functionality and then copied that into two separate script files using a third script to run the other two. I had the script itself running so I pulled the code from the original file instead and missed that it was still referencing the test file.

Ports.sh only contains lines to run
vacavilleports.sh and goldcampports.sh

the code I posted was from a file named testports.sh which was the test code copied into vacavilleports.sh and goldcampports.sh then each of those was modified to reference their proper archive locations and output files.

I did not know if opening the script file while it was running would impact it so I chose the safe route of opening the test version.

Here is sample output from

GC_tcpinbound.txt
TCP internal 10.20.114.190 intmgmt 10.20.100.175 258
TCP internal 10.20.114.190 intmgmt 10.20.100.175 6455
TCP internal 10.20.114.190 intmgmt 10.20.100.175 1678
TCP internal 10.20.114.190 intmgmt 10.20.100.162 33923

Here is some sample input

Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199055 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/53099 (10.20.100.48/53099)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199056 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/43185 (10.20.100.48/43185)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199057 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/42319 (10.20.100.48/42319)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302016: Teardown UDP connection 731198699 for outside:158.96.0.254/53 to internal:10.20.114.124/58504 duration 0:00:00 bytes 179
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199059 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/54069 (10.20.100.48/54069)

Here is the exact code from goldcampports.sh

echo Search started at:
date +"%m/%d/%Y %T"
# Displays the start up information and the start time

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} \;  |  awk '/Built/&& !/10.10.120.145/{print $10, $11, $15, $18;}' | sed -e 's!/! !g' -e  's!:! !g' | awk '{if ($1 == "inbound") print $1, $2, $3, $4, $6, $7, $8; else if ($1 == "outbound") print $1, $2, $6, $7, $3, $4, $5;}' | awk '!seen[$0]++ {print}' >> /home/kenneth.cramer/asa/GC_ports.txt

# Finds all files with that begin with the name asalog that were written in the last 7 days. It then reads the files line by line looking
# for any lines containing the word Built but not the 10.10.120.145 IP address and prints out the 7th, 8th, 12th and 15th words in the line
# It then looks for any "/" slashes or ":" colons in the four words and replaaces them with spaces.
# The script now prints out the needed words from the line and then writes only unique lines to the output file.

echo
echo
echo
echo Sorting data into proper files.
# Displays that the script is now sorting the information

awk '{if ($1 == "inbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpinbound.txt"; else if ($1 == "inbound" && $2 == "UDP") print $2, $3, $4, $5,  $6, $7 >> "/home/kenneth.cramer/asa/GC_udpinbound.txt"; else if ($1 == "outbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpoutbound.txt"; else if ($1 == "outbound" && $2 == "UDP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_udpoutbound.txt";}' /home/kenneth.cramer/asa/GC_ports.txt
# Thee script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.
                             DOH!!! This  ^ should read    #The script now reads the file GC_ports.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.

echo
echo
echo
echo Compressing files for transport

tar -czvf /home/kenneth.cramer/asa/GC_ports.tgz /home/kenneth.cramer/asa/GC_*.txt
# Compresses the output files into a single file for transport off the machine.

echo Process completed for Gold Camp at:
date +"%m/%d/%Y %T"
echo
echo
times

---------- Post updated at 09:44 PM ---------- Previous update was at 09:31 PM ----------

In answer to your other questions,

No I do not care about having the GC_ports.txt file.
My only goal is to reach the output in those 4 files.
The blank lines are just for spacing. I did not spend much time researching the best way to do blank lines as this script has minimal output, just enough to let the person who runs it know what it is doing. I am mainly the person who runs it, but I built that in just in case someone else had to run it and got confused by the system not returning immediately to the prompt.

I hope the script is not too hard to follow, I am a network engineer not a programmer or a unix admin. This all is to assist a client in redoing their firewall.
The input is from the firewall log. We are looking for lines with the word built in it and capturing the source IP, destination IP, destination port and protocol of the connections. The four output files are dumped into 4 sheets in excel for us to see what ip's are talking and what we need to build rules for. When a previous company setup the firewall they left any/any rules in place for internal traffic and only locked down the outside interface. So we have to figure out what rules we need to create before removing those any/any rules and causing massive connectivity issues.

Yes, there are many tools out there to do this for us, but this is all client owned hardware and they don't have those tools installed. So we are left with this.

The log files it pulls are in 1 hour intervals, so 24 files per day times 7 days = 168 compressed log files. I did try copying down the zipped files and then uncompressing them on the local machine but expanded they are almost 60 gig. (repetitive text compresses VERY VERY well)

Thank you again for your suggestions and assistance.

Scrutinizer · April 12, 2016, 10:48pm

I would suggest to also use some of you vertical real estate, since that will greatly improve readability for future maintenance..

I expect Don's suggestion of using the + instead of \; will significantly increase processing speed.

An untested example of what a single awk might look like:

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} +  |
awk '
  !/Built inbound|Built outbound/ || /10\.10\.120\.145/ {
    next
  }
  {
    $0=$10 FS $11 FS $15 FS $18                   # recalculate fields
    gsub("[/:]",FS)
    if ($1 == "inbound")
      $0=$1 FS $2 FS $3 FS $4 FS $6 FS $7 FS $8   # recalculate fields
    else if ($1 == "outbound")
      $0=$1 FS $2 FS $6 FS $7 FS $3 FS $4 FS $5   # recalculate fields
  } 
  !seen[$0]++
' >> /home/kenneth.cramer/asa/GC_ports.txt

garlandxj11 · April 12, 2016, 11:01pm

What is the difference in "+" instead of "\;" ? What about that would help with the speed? Sorry for my ignorance. I really am trying to learn as I go on this.

bakunin · April 13, 2016, 5:29am

First off: no problem! There no dumb questions, just dumb answers, so don't be shy.

The difference is that "\;" will call the command named in "-exec" each time a file is found. For instance, let us suppose in the curent directory are 5 files, "a", "b", "c", "d" and "e" (and nothing else):

find . -type f -exec rm {} \;

This wil delete all the files, but it will delete every file individually. In fact this will be executed:

rm a
rm b
rm c
rm d
rm e

But "rm" can take a list of files as well and this:

find . -type f -exec rm {} +

would be the same as

rm a b c d e

The difference seems to be small, but the most time, when calling a command like "rm", is the loading and starting of the program, not its execution. Therefore, if the command is executed only every fifth time the speed gain will be quite noticeably.

I hope this helps.

bakunin

Don_Cragun · April 13, 2016, 11:18am

I was thinking of taking Scrutinizer's suggestions a step further, getting rid of the unneeded GC_ports.txt file completely and just using one awk script to produce the four desired output files ( GC_tcpinbound.txt , GC_tcpoutbound.txt , GC_udpinbound.txt , and GC_updoutbound.txt ). As Bakunin explained, find -exec command + instead of find -exec command \; reduces the number of times zcat is invoked. I added the, -v option to zcat to get a visible indication that progress is being made while the script runs.

Please remove /var/network_logs/gc/archive/GC_ports.txt if that file is still present from an earlier run of your script. Then see if something more like:

#!/bin/ksh
InputDir='/var/network_logs/gc/archive'
OutputDir='/home/kenneth.cramer/asa'

# Display the start time...
date +'Search started at: %m/%d/%Y %T%nProcessing asalog files...'

# Find and uncompress asalog* files that are less than a week old...
find "$InputDir"/asalog* -mtime -7 -exec zcat -v {} + |
awk -v OutputDir="$OutputDir" '
!/Built/ || /10.10.120.145/ {
	# Discard lines that do not contain "Built" and lines that contain
	# IP address 10.10.120.145.
	next
}
{	# Throw away unneeded data...
	$0 = $10 OFS $11 OFS $15 OFS $18
	# and change "/"s and ":"s to spaces (recomputing field boundaries).
	gsub("[/:]", " ")
}
$1 == "inbound" {
	# Process inbound records.
	if(seen[$1, $2, $3, $4, $6, $7, $8]++) {
		# Discard duplicates.
		next
	}
	# Following asuumes we only have TCP and UDP inbound records.
	# Print to one of two inbound text files.
	print $2, $3, $4, $6, $7, $8 > (OutputDir "/GC_" \
	    (($2 == "TCP") ? "tcp" : "udp") "inbound.txt")
}
$1 == "outbound" {
	# Process outbound records.
	if(seen[$1, $2, $6, $7, $3, $4, $5]++) {
		# Discard duplicates.
		next
	}
	# Following asuumes we only have TCP and UDP inbound records.
	# Print to one of two outbound text files.
	print $2, $6, $7, $3, $4, $5 > (OutputDir "/GC_" \
	    (($2 == "TCP") ? "tcp" : "udp") "outbound.txt")
}'

# Compress the output files into a single file for transport off the machine...
printf '\nCompressing files for transport...\n'

tar -czvf "$OutputDir/GC_ports.tgz" "$OutputDir"/GC_*.txt

# Print end time and statistics...
date +'%nProcess completed for Gold Camp at: %m/%d/%Y %T'
times

runs a little faster for you.

I know that you said you wanted to use bash , but I generally find that ksh will run scripts like this a little faster. These shells use different output formats for the output from the times built-in utility, but should otherwise produce identical results for this script. (You may want to try both a few times with real data to see how much of a difference in speed there is between bash and ksh on your system.)

When run with InputDir and OutputDir set to "." and with six copies of a compressed version of the sample input you provided in post #3 in files named asalog_test1.Z through asalog_test6.Z , it produces the output files GC_updoutbound.txt containing:

UDP intmgmt 10.20.100.48 internal 10.20.114.120 53

and the compressed tar archive file GC_ports.tgz and writes the following to standard output and standard error output:

Search started at: 04/13/2016 07:37:58
Processing asalog files...
./asalog_test1.Z:	   43.4%
./asalog_test2.Z:	   43.4%
./asalog_test3.Z:	   43.4%
./asalog_test4.Z:	   43.4%
./asalog_test5.Z:	   43.4%
./asalog_test6.Z:	   43.4%

Compressing files for transport...
a ./GC_udpoutbound.txt

Process completed for Gold Camp at: 04/13/2016 07:37:58
user	0m0.00s
sys	0m0.00s

while it runs.

While your script from post #3 in this thread (using bash but converted to use files in the current directory) produces the output:

Search started at:
04/13/2016 07:39:54



Sorting data into proper files.



Compressing files for transport
a ./GC_ports.txt
a ./GC_udpoutbound.txt
Process completed for Gold Camp at:
04/13/2016 07:39:54


0m0.002s 0m0.017s
0m0.011s 0m0.014s

As I said before, I imagine that a good portion of the time in this script is spent decompressing the asalog* files and, depending on the sizes of your four output files, recompressing the data as it creates the compressed archive, but I'm hoping the reduced number of processes running and the reduced number of times the uncompressed data is read and written will make this noticeably faster when you're working with real data.

Note that you provided sample UPD outbound records as sample input data and you showed sample output data for TCP inbound records. So, I'm not sure that I produced the correct output formats for inbound or outbound records (since the output format for inbound records is not the same as the output format for outbound records).

Hope this helps,

Don

garlandxj11 · April 14, 2016, 12:00am

Thank you so much for your time. I will give this a try this evening after business hours. I will also cut the mtime down to 1 so it only processes 24 zipped files at once for the test run.

Sorry for the small snippet of sample data. This is a production firewall that is generating the raw data so I was trying to be careful to only include non-identifying data in the sample.

As for the output to the files, I can tweak that to get exactly what I want if any of the fields are incorrect.

I believe I am beginning to understand how this script works at a basic level on its interaction with the files. I do think your correct that the majority of the time is spent repeatedly un-compressing and re-compressing data unnecessarily.

While the script runs I will have a second connection open to the box running the top command to watch the processor and memory usage to determine the load being placed on the system before expanding it to additional files.

I will let you know how it runs. Your script is far more elegant than my cobbled together one. It just goes to show that just because "my" way works doesn't mean it is the best way to get things done.

---------- Post updated at 11:00 PM ---------- Previous update was at 02:26 PM ----------

The script you provided cuts the time by over 50%!
I do have a few tweaks to do to get the output correct but that is something I can easily handle.

I did do a couple of tests

Copied the files down to a local directory and changed the script to search that directory. /home/kenneth.cramer/temp To see if the files being located on the ZFS was having an impact on the speed. I did not see an improvement in the speed of the script decompressing the file.
Copied the files to a local directory and unzipped them first. The speed was improved but the time taken to copy and uncompress the file made it balance out. There is no real gain unless I create a timed script to copy down the files and uncompress them before I need to run the script. So that approach is impractical.
Tested the size of the compressed vs the uncompressed files. Each file represents an hours worth of data. Compressed each files averages 54 MB. Uncompressed they average 1 gig per file. 7 days with 24 files per day is 168 files. So taking an hour or so to sift through 168 gig of data is not bad for time. The shear size also makes it impractical to copy down the files and uncompress them just to do these few operations on them.

Thank you for all the help. I believe I can manage the last few tweaks from here and get the output I need in the format I need.

Thank you again for all your help.

Peasant · April 14, 2016, 1:24am

Can you use ZFS filesystem properties to compress ?

Then you can use 'regular' commands on those files and let zfs handle compression operations.
Perhaps noticeable speed can be gained there for read operations, depending on the zfs setup and memory arc/l2arc occupies.

Also, one might use zfs filesystem goodies, in terms of using snapshots and send recv to make backups, which would simplify the archive procedure.

But again, not all operating systems have ZFS and if you are looking for a portable solution shell is the way.

Hope that helps
Best regards
Peasant.