Bash script help - removing certain rows from .csv file

Hello Everyone,

I am trying to find a way to take a .csv file with 7 columns and a ton of rows (over 600,000) and remove the entire row if the cell in forth column is blank.

Just to give you a little background on why I am doing this (just in case there is an easier way), I am pulling information from a PCAP into a .csv file and I only want to view the rows from the .csv file if it lists something in the http.host (forth column) entry (i.e. google.com). If that entry is blank because it is not a http.host website then I would like to remove the row. By doing this it would seriously cut down on the amount of rows I have to review to make sure my users are not visiting sites that they should now be.

So far my script looks like this:

#/bin/bash

echo -n "What is the name of your PCAP file? "
read in_pcap

echo -n "What is the name of your CSV file? "
read out_csv

tshark -r "$in_pcap" -T fields -e frame.number -e ip.src -e ip.dst -e http.host -e frame.time -e frame.time_relative -E header=y -E separator=, > "$out_csv"

_____
I ran the script on a current PCAP and it wors like a charm getting the information I need from a pcap file to a csv file unfortunately I am running into the aforementioned blank row situation as every entry does not list a value in the http.host cell. In fact of the over 600,000 I am guessing there are only several hundred rows that I need. So adding to the script above (or creating a new script if need be) to remove rows with a blank entry in the forth column of every row would be the perfect solution however I am not sure how to do that. The condition that needs to be met for the loop (assuming a loop is the solution) for the loop to stop would be for each of the 7 columns to be blank a.k.a. the row after the last of the 600,000+ entries.

Can anyone help me edit my current script and or write a new script to loop over (or otherwise remove) blank entries?

Thanks in advance!

I don't know "tshark" but having done a google search, I think you should look into using the "-R <read/display filter>" option.

Can't be of more help other than to provide this link:

tshark - The Wireshark Network Analyzer 1.8.0

Also adding this link, which talks about the syntax of a filter:

http://www.wireshark.org/docs/man-pages/wireshark-filter.html

tshark -r "$in_pcap" -T fields -e frame.number -e ip.src -e ip.dst -e http.host -e frame.time -e frame.time_relative -E header=y -E separator=, |egrep -v '^[^,]*,[^,]*,[^,]*,,'> "$out_csv"

Thanks a million for this answer!

I went from 600,000+ rows to analyze to less than 3100 and it did everything I wanted it to.

I do have one problem though. If I run this line:
tshark -r test.pcap -T fields -e frame.number -e ip.src -e ip.dst -e http.host -e frame.time -e frame.time_relative -E header=y -E separator=, |egrep -v '[1]*,[^,]*,[^,]*,,'> test.csv
Everything works perfectly fine.

However if I run my script mentioned above with the variables $in_pcap and $out_csv, I get the following error message:
./pcapAnalyze: line 9: $out_csv: ambiguous redirect
tshark: Output fields were specified with "-e", but "-Tfields" was not specified.

I wanted to make sure the command ran on its own without the variables to limit the amount of things that could go wrong. After simply replacing the hard coded .pcap and .csv files with variables, I get that error message. The only thing that I changed was implementing the variable... Why is it doing that?

Thanks in advance!


  1. ^, ↩︎