awk substr

_man · March 24, 2013, 2:13pm

Hello life savers!!

Is there any way to use substr in awk command for returning one part of a string from declared start and stop point?

I mean I know we have this:

substr(string, start, length)

Do we have anything like possible to use in awk ? :

substr(string, start, stop)

Example: my string is : Center=128181;Length=461
I wanna return 128181

but its length is varied in each line .

So I need substr to start from = and finish at ;

Sorry for the worst explanation ever!!

Thanks for your time.

zaxxon · March 24, 2013, 2:31pm

There is many different ways to retrieve what you are looking for, with awk or other tools but yes, there is a substr() with awk. Maybe just ry it out? Check the manual or any of the internet sites having documentation for awk.

RudiC · March 24, 2013, 2:35pm

awk 'match ($0, /=.*;/){print substr($0, RSTART+1, RLENGTH-2)}' file
128181

mjf · March 24, 2013, 2:39pm

Without substr:

echo 'Center=128181;Length=461' | nawk -F '=|;' '{ print $2 }'

_man · March 24, 2013, 2:51pm

my lines are like this:

chr2L	Center=128181;Length=461

I want this at output for each line:

chr2L	128181

I tried this:

awk '{print $1 "\t" substr($2,8)}' BG3_Ash1-4177_Ave_Rto_sorted_3SDcutOff_0f
old_500_360_Top6_MinVal0.gff > test.gff

as you see it returns this:

128181;Length=461

is there anyway to modify this with substr to get desired output?

Thanks

---------- Post updated at 07:51 PM ---------- Previous update was at 07:47 PM ----------

Rudic this seems to work:

awk 'match ($2, /=.*;/){print $1 "\t" substr($2,RSTART+1, RLENGTH-2)}' input > output

it is correct, right?

Scrutinizer · March 24, 2013, 3:35pm

Another way:

awk -F'[\t ;=]*' '{print $1,$3}' OFS='\t'

alister · March 24, 2013, 3:49pm

@man:

You can use the index() function twice on $2 to determine the locations of the first "=" and ";". You can then use those indices to calculate the correct arguments for substr().

Regards,
Alister

Jotne · March 24, 2013, 4:22pm

If there are good separators to use, do use that instead of counting characters.
Scrutinizer is the way to go.

alister · March 24, 2013, 7:14pm

That assertion is unjustified and may fill the OP with unwarranted confidence.

To be clear, I am not suggesting that there is something inherently wrong with Scrutinizer's code. What I am saying is that we don't have enough information to unequivocally state which approach is best.

It's easiest to demonstrate by presenting a hypothetical though very plausible scenario which does not contradict anything the OP has said in this thread:

Let's assume the following conditions are true:
1) The application which generates the data (generator) defines its format as a simple, straightforward, tab-delimited file.
2) There are no constraints on the contents of any field except that they cannot contain a tab.
3) There is some information in the second field which a second application (the consumer, i.e. a suggestion in this thread) needs to extract. This information is delimited by an "=" and a ";".
4) As is typically the case, the generator has no special knowledge of the consumer.

If the consumer uses anything but a single tab as the field separator, the door is opened for spectacular failure (hopefully it's nothing mission critical).

The generator, even if expertly coded with thorough error handling and sanity checks, will not hesitate to include a character which the consumer will treat as a non-tab field separator. To the generator, none of "ch=", ";ch", or "c<spc>h" is noteworthy. To a consumer that's in sync with the generator, also using a single tab to delimit, none of these is a problem. However, to a consumer accepting any of <spc>, <tab>, ;, and = as delimiters ... ouch!

While Scrutinizer's approach is clearly insufficiently robust for use under some circumstances, it may very well be a perfectly fine solution under others (e.g. a quick one-off script or for processing data whose fields comply with additional content restrictions). To which category the OP's situation belongs, we do not know.

Regards,
Alister

Yoda · March 24, 2013, 10:15pm

Using AWK split function:

awk ' { split($2,A,/[;=]/); print $1,A[2] } ' OFS='\t' file