Checking number of commas in each line.

Anupam_Halder · April 4, 2013, 8:04am

Hi All,

I am checking whether each line is having "n" number of commas or nor. In case not then I need to exit the process.

I tried

cat "$TEMP_FILE" | while read LINE
do 
	processing_line=`expr $processing_line + 1`
	no_of_delimiters=`echo "$LINE" | awk -F ',' '{ print NF }'`
	if [ $no_of_delimiters -ne $no_of_expected_fields ]
                echo "Error at line $processing_line"
		exit
	fi
done

It's working fine. However the number of records in the file is around .5 million. So it's taking too much time to process it. Anyway I can improve the performance?

Thanks in advanced.

guruprasadpr · April 4, 2013, 8:14am

You can replace your whole code with one awk statement:

awk -F,  'NF>3{print "Error at line "NR;exit}' $TEMP_FILE

In place of 3, put the count of your no. of expected fields.

Guru.

PikK45 · April 4, 2013, 8:27am

Don't cat a very huge file.

Try like

awk -F"," -v l=$no_of_expected_fields '{if(NF != l){print "Error at line "NR; exit}}' $TEMPFILE

@guru:
You got me on this one

---------- Post updated at 05:57 PM ---------- Previous update was at 05:52 PM ----------

guruprasadpr:

You can replace your whole code with one awk statement:
awk -F,  'NF!=3{print "Error at line "NF;exit}' $TEMP_FILE
In place of 3, put the count of your no. of expected fields.

Guru.

I guess NF should not be equal to a specified number. You can use

awk -F"," -v l=$no_of_expected_fields 'NF!=l{ print "Error at line "NR; exit}' TEMP_FILE

if you need a variable to be compared

Corona688 · April 4, 2013, 12:34pm

cat-ing small files are the biggest problem, really. For a huge file, the overhead of running cat doesn't matter all that much. But running cat 10,000 times to process 10,000 tiny files will slow it down a lot, the same way it takes longer to say a sentence if you must make a separate phone call for each word.

Don_Cragun · April 4, 2013, 2:04pm

anupam_halder:

Hi All,

I am checking whether each line is having "n" number of commas or nor. In case not then I need to exit the process.

I tried
cat "$TEMP_FILE" | while read LINE
do 
	processing_line=`expr $processing_line + 1`
	no_of_delimiters=`echo "$LINE" | awk -F ',' '{ print NF }'`
	if [ $no_of_delimiters -ne $no_of_expected_fields ]
   echo "Error at line $processing_line"
		exit
	fi
done
It's working fine. However the number of records in the file is around .5 million. So it's taking too much time to process it. Anyway I can improve the performance?

Thanks in advanced.

Note that the variable names $no_of_expected_fields and $no_of_delimiters are not representative of what awk actually does. When you aren't using the default awk field separator (<space>), every occurrence of the field separator separates two fields; it doesn't terminate a field. So for every non-empty line read by awk when the value of FS is a comma (such as by having -F, on the command line), the value of NF (Number of Fields) is the number of delimiters plus 1; not the number of delimiters.
If you want to print an error for any file that does not have $n commas on every line in the file, you need something like:

awk -F,  -v n="$n" 'NF!=(n-1){print "Error at line "NF;exit 1}' $TEMP_FILE
if [ $? -ne 0 ]
then    exit
fi
# Continue processing $TEMP_FILE...

in your script.

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .