awk regexp to print repetitive pattern

yifangt · July 15, 2016, 1:23pm

How to use regexp to print out repetitive pattern in awk?

$ awk '{print $0, "-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-"}'

output:

 -    -    -    -    -    -    -    -    -    -    -    -

I tried following which does not give what I want, of course.

awk '{print $0, "-\t{11}-"}'

output:

  -    {11}-

Any help, please? Thanks in advance!

vgersh99 · July 15, 2016, 1:27pm

one way:

awk 'BEGIN{$11=OFS="-\t";print}'</dev/null

yifangt · July 15, 2016, 1:39pm

Thanks!
Can you elaborate your script, please?

MadeInGermany · July 15, 2016, 3:02pm

It is a trick; setting the last field forces a reformatting with the OFS. (The previous fields are empty.) The order matters, OFS must be set first. Therefore I think the following is cleaner

awk 'BEGIN{OFS="-\t"; $12="-"; print}' </dev/null

Perl has the x operator

perl -e '{print "-\t"x11, "-\n"}'

In bash and zsh:

{ for i in {1..11}; do printf -- "-\t"; done; printf -- "-\n"; }

yifangt · July 15, 2016, 5:24pm

Can I say there is no regexp for this case in awk, similar to your perl script ? perl -e '{print "-\t"x11, "-\n"}'
Thanks a lot!

Don_Cragun · July 15, 2016, 5:54pm

No. You can say that a string argument in an awk print statement is a string; not a regular expression. If you want an ERE to select lines containing exactly 11 copies of "-\t" followed by a final "-" and only print those lines, you could try:

awk 'BEGIN{print "start"; OFS="-\t"; $12="-"; print; print "end"}'|awk '/^(-\t){11}-$/'

which uses a regex in the 2nd awk script to print just that matched line out of the three lines printed by the 1st awk script

Note that redirecting the input from /dev/null won't hurt anything, but it isn't needed in an awk script that only contains one or more BEGIN clauses.

yifangt · July 15, 2016, 7:29pm

Thanks Don and all!
What I was doing is to combined two files (each half million lines) by a matching column, if no match (0.01%), make up the missing fields with the "-" , which triggered me asking if there is regex for my purpose:

$ cat file2
XLOC_000001 TCONS_00000001 LOC_Os02g39790.2 47.273 55 24 2 3855 4016 335 385 3.60e-04  
XLOC_000001 TCONS_00000002 LOC_Os02g39790.2 47.273 55 24 2 2368 2529 335 385 2.35e-04  
XLOC_000001 TCONS_00000007 LOC_Os02g39790.2 47.273 55 24 2 3553 3714 335 385 3.09e-04  
XLOC_000001 TCONS_00000009 LOC_Os02g39790.2 47.273 55 24 2 5083 5244 335 385 2.83e-04  
XLOC_000001 TCONS_00000011 LOC_Os02g39790.2 47.273 55 24 2 2200 2361 335 385 2.28e-04   
$ cat file1
TCONS_00000001    W5GKA3_WHEAT
TCONS_00000002    W5GKA3_WHEAT
TCONS_00000011    I1IBH3_BRADI
TCONS_00000009    W5GKA3_WHEAT
TCONS_00000005    I1IBH3_BRADI
TCONS_00000006    I1IBH3_BRADI
TCONS_00000007    W5GKA3_WHEAT

$ awk 'NR==FNR {A[$2]=$0; next}; {if (A[$1]) print $0, A[$1]; else print $0, "-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-\t-" }' file2 file1
TCONS_00000001    W5GKA3_WHEAT XLOC_000001 TCONS_00000001 LOC_Os02g39790.2 47.273 55 24 2 3855 4016 335 385 3.60e-04  
TCONS_00000002    W5GKA3_WHEAT XLOC_000001 TCONS_00000002 LOC_Os02g39790.2 47.273 55 24 2 2368 2529 335 385 2.35e-04  
TCONS_00000011    I1IBH3_BRADI XLOC_000001 TCONS_00000011 LOC_Os02g39790.2 47.273 55 24 2 2200 2361 335 385 2.28e-04   
TCONS_00000009    W5GKA3_WHEAT XLOC_000001 TCONS_00000009 LOC_Os02g39790.2 47.273 55 24 2 5083 5244 335 385 2.83e-04  
TCONS_00000005    I1IBH3_BRADI -    -    -    -    -    -    -    -    -    -    -    -
TCONS_00000006    I1IBH3_BRADI -    -    -    -    -    -    -    -    -    -    -    -
TCONS_00000007    W5GKA3_WHEAT XLOC_000001 TCONS_00000007 LOC_Os02g39790.2 47.273 55 24 2 3553 3714 335 385 3.09e-04

Typing that long string "-\t" repetitively looks dull, and I felt dizzy counting the number of tabs while typing. I made mistakes and need re-run a couple of times to get it correct! So I thought if there is regex I could avoid the mistake.
Thanks again!

Don_Cragun · July 15, 2016, 9:49pm

I'm a little confused since the code you showed us is printing tabs between the hyphens in the output, but the sample output you provided from that command has four spaces at those spots instead of a singe tab. And, the output you showed us should have output lines in the order of the lines in file1, but it doesn't. It also seems strange that there are three trailing spaces on each line in your sample file2 , but there are zero, two, or three spaces on the corresponding lines in your sample output. I don't know if that matters for the output you hope to produce, but it makes it hard to guess at what you are really trying to do???

So, with lots of unsupported guesswork, the following (combining earlier suggestions with your code to merge files and making the wild assumption that if you want tabs between hyphens, you might also want tab to be your output field separator) might or might not come close to producing the output you want:

awk '
NR==FNR {
	A[$2] = $0
	if(FNR == 1) {
		NFA = NF
		$0 = ""
		OFS = "-\t"
		$NFA = "-"
		dashes = $0
		OFS = "\t"
	}
	next
}
{	print $0, (($1 in A) ? A[$1] : dashes)
}' file2 file1

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

RudiC · July 16, 2016, 4:55am

You seem to be caught in an error: regexes exist for searching/matching purposes ONLY. You can't print them. For printing a repetitive pattern, you need to type it as is, use a loop, or fall back to tricks as posted before.

RudiC · July 17, 2016, 6:02am

Adding a varying number of fields to an incomplete line might be achieved like

awk '
NR == FNR       {A[$2] = $0
                  next
                }
!A[$1]          {A[$1] = sprintf ("%*s", 14-NF, "-")
                 gsub (/ /, "-\t", A[$1])
                }
                {print $0, A[$1]
                }
' file2 file1 
TCONS_00000001    W5GKA3_WHEAT XLOC_000001 TCONS_00000001 LOC_Os02g39790.2 47.273 55 24 2 3855 4016 335 385 3.60e-04  
TCONS_00000002    W5GKA3_WHEAT XLOC_000001 TCONS_00000002 LOC_Os02g39790.2 47.273 55 24 2 2368 2529 335 385 2.35e-04  
TCONS_00000011    I1IBH3_BRADI XLOC_000001 TCONS_00000011 LOC_Os02g39790.2 47.273 55 24 2 2200 2361 335 385 2.28e-04   
TCONS_00000009    W5GKA3_WHEAT XLOC_000001 TCONS_00000009 LOC_Os02g39790.2 47.273 55 24 2 5083 5244 335 385 2.83e-04  
TCONS_00000005    I1IBH3_BRADI -	-	-	-	-	-	-	-	-	-	-	-
TCONS_00000006    I1IBH3_BRADI -	-	-	-	-	-	-	-	-	-	-	-
TCONS_00000007    W5GKA3_WHEAT XLOC_000001 TCONS_00000007 LOC_Os02g39790.2 47.273 55 24 2 3553 3714 335 385 3.09e-04

EDIT: Or just use a for loop...

.

yifangt · July 18, 2016, 12:13pm

It started to deviate from my original question, and before it goes too far, I accept the answer by RudiC

Can I say there is no regexp for this case in awk, similar to your perl script ?
perl -e '{print "-\t"x11, "-\n"}'
You seem to be caught in an error: regexes exist for searching/matching purposes ONLY. You can't print them. For printing a repetitive pattern, you need to type it as is, use a loop, or fall back to tricks as posted before.

Thanks again for all your replies.