Grepping non-alpa-numerics from first column only

owwow14 · October 22, 2014, 6:52am

I have data in the following tab-separated format (consists of 200 columns all together, this is just a sampling)

</s> 0.001701 0.002025 0.002264 0.001430 -0.001300 
. -0.205240 0.177341 -0.426209 -0.661049 -0.048884 0.027032 
the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
, -0.282318 -0.692638 0.350441 -0.600493 -0.370671 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

I want to remove all lines in which the FIRST column contains a non alpha-numeric column.

The desired result is this:

the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

I have tried this is

grep

and

awk

with no success.

cat INPUT | cut -f 1 | grep -v "[[:punct:]]"

awk 'NR>1{t=$1;gsub(/[^[:punct:]]/,"");$0=t "\t" $0}1' INPUT

HOw can I solve this?

RavinderSingh13 · October 22, 2014, 7:10am

owwow14:

I have data in the following tab-separated format (consists of 200 columns all together, this is just a sampling)

</s> 0.001701 0.002025 0.002264 0.001430 -0.001300 
. -0.205240 0.177341 -0.426209 -0.661049 -0.048884 0.027032 
the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
, -0.282318 -0.692638 0.350441 -0.600493 -0.370671 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

I want to remove all lines in which the FIRST column contains a non alpha-numeric column.

The desired result is this:

the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

I have tried this is

grep

and

awk

with no success.

cat INPUT | cut -f 1 | grep -v "[[:punct:]]"

awk 'NR>1{t=$1;gsub(/[^[:punct:]]/,"");$0=t "\t" $0}1' INPUT

HOw can I solve this?

Hello owwow14,

Could you please try following, not tested though.

awk '{if($1 !~  /[[:punct:]]/ && $1 !~ /[[:digit:]]/) {print $0}}'  Input_file

Thanks,
R. Singh

owwow14 · October 22, 2014, 7:13am

Great R. Singh! Thanks. As usual, you provided a great solution. It successfully removed all of the unwanted characters.

RudiC · October 22, 2014, 7:39am

How about

grep  "^[[:alnum:]]" file
the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

RavinderSingh13 · October 22, 2014, 7:47am

rudic:

How about

grep  "^[[:alnum:]]" file
the -0.159145 0.084377 0.056968 0.050934 0.160689 
of -0.230698 0.030112 0.021657 -0.091374 0.069027 
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308 
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633 
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229 
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

Hello Rudy,

This above solution will still catch the lines in which puncuation or digits are present first column in between of first column, OP wants to remove the lines completly if first column contains digits or punctuations in it. Following is an example of same.(I just made a change in input file to test it.)

grep  "^[[:alnum:]]" test24
th<e -0.159145 0.084377 0.056968 0.050934 0.160689
of -0.230698 0.030112 0.021657 -0.091374 0.069027
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

Thanks,
R. Singh

Akshay_Hegde · October 22, 2014, 8:05am

I think Ravinder's solution can be further simplified like this

awk 'gsub(/^[[:alnum:]]/,"&",$1)' file

RudiC · October 22, 2014, 8:06am

This depends on what the "first column" is - frequently people call that the first char position. The sample didn't give a hint on how to interpret it. Nevertheless, accepting your argument, try

grep  "^[[:alnum:]]* " file

RavinderSingh13 · October 22, 2014, 8:08am

Hello Akshay,

I guess this solution will also not catch the special character or digit which is present in 1st column following is an example, please do correct me if needed as I can see OP requested digit/punctuation shouldn't be present in first column.

awk '$1~/^[^[:punct:]]/'  test24
th<e -0.159145 0.084377 0.056968 0.050934 0.160689
of -0.230698 0.030112 0.021657 -0.091374 0.069027
is -0.074473 -0.245787 0.246335 -0.504011 -0.322308
in -0.086738 -0.004564 0.163076 -0.114565 -0.156633
to 0.178787 0.249158 -0.115754 -0.282477 -0.290229
was -0.293781 -0.435587 -0.142019 -0.624197 -0.103400

Thanks,
R. Singh