awk to place specific contents filename within text file

cmccabe · January 29, 2016, 11:28am

I am trying to use awk to place the contens of a filename in $1 and $2 followed by the data in the text file. Basically, put the filename within the text file. There are over 1000 files in the directory and as of now each file is saved with a unique name but it is not within the file. Thank you :).

Text file:

 LastName,FirstName_123456.txt.hg19_multianno

Desired output:

$1                               $2           $3
LastName,FirstName     123456     data in files (the 24 columns)

awk '{f=$1; $1=$2=""; sub("  ", ""); print > f}' file

RudiC · January 29, 2016, 12:42pm

grep "" *

cmccabe · January 29, 2016, 5:37pm

If I only wanted the text up to the first . would:

 grep -o '[[.digit.]].*$' | grep "" *

produce:

 $1                   $2           $3
LastName,FirstName     123456     data in files (the 24 columns)

with the input being:

 LastName,FirstName_123456.txt.hg19_multianno

. Thanks :).

Aia · January 29, 2016, 6:44pm

Are you saying that each file contains only 24 columns in one line?

Let's take a shot at it.

perl -07 -ne '@np=$ARGV =~/^([^_]*)_(\d+)\./ and print "@np $_"'

cmccabe · January 29, 2016, 6:56pm

Each text file contains 24 columns with multiple rows in it. I am trying to print the LastName,FirstName in field 1 and the 123456 in field 2 of row 1. Each new filename has a header in row 1. I will post a sample as soon as I can, I am on my blackberry and can not right now. Thank you :).

Don_Cragun · January 29, 2016, 7:58pm

If I understand what you're trying to do (and I am not at all sure that I do), the following seems to do what you want:

awk '
FNR == 1 {
	if((n = split(FILENAME, name, "_")) < 2) {
		print "********************************"
		printf("Filename (%s) does not fit expected pattern.\n",
		    FILENAME)
		print "********************************"
		exit 1
	}
	split(name[2], number, ".")
}
{	print name[1], number[1], $0
}' OFS='\t' *multianno

which, with your attached sample data (stored in a file named LastName,FirstName_123456.txt.hg19_multianno ) produces the following as the 1st five lines of its output:

LastName,FirstName	123456	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
LastName,FirstName	123456	1	43394661	43394661	A	exonic	SLC2A1		nonsynonymous SNV	SLC2A1:NM_006516.2:exon8:c.T1016C:p.I339T	0.0002	.	.	.	.	.	0.0001	0.0002	unknown	.	.
LastName,FirstName	123456	2	166870221	166870221	A	intronic	SCN1A				0.01	0.01	.	0.0028	.	0.01	0.0072	0.002	0.0099	.	Common			
LastName,FirstName	123456	9	135802555	135802555	C	intronic	TSC1				0.01	0.0046	.	0.01	.	0.01	0.0075	0.002	0.01	.	Common	untested	Tuberous_sclerosis_database_(TSC1)	TSC1_00008
LastName,FirstName	123456	22	40745898	40745898	C	exonic	ADSL		synonymous SNV	ADSL:NM_000026.3:exon2:c.C216T:p.I72I,ADSL:NM_001123378:exon2:c.C216T:p.I72I	0.01	0.0027	0.002	0.01	.	0.0006	0.0011	0.0003	.

Is this what you're trying to do?

Note, as always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Aia · January 29, 2016, 8:00pm

Assuming the following file names:

ls *unw*
one_1234.unwanted_part  three_3214.unwanted_part  two_2314.unwanted_part

With the following content header, rows and columns:

cat *unw*
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Do you want an output like this?:

perl -07 -ne '@np=$ARGV =~/^([^_]*)_(\d+)\./ and print "@np\n$_"' *unw*
one 1234
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Or do you want this output?:

perl -07 -ne '@np=$ARGV =~/^([^_]*)_(\d+)\./ and print "@np $_"' *unw*
one 1234 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Maybe this output?:

perl -pe '@a=$ARGV =~/^([^_]*)_(\d+)\./; $_="@a $_"' *unw*
one 1234 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
one 1234 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
one 1234 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
one 1234 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
three 3214 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
two 2314 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

cmccabe · January 30, 2016, 8:33am

@Aia the perl below is close except for the 1 in the output goes under column a and the output is tab-deliminated for excel (that was how the original input files was). Thank you :).

 perl -07 -ne '@np=$ARGV =~/^([^_]*)_(\d+)\./ and print "@np $_"' *unw*
 
one 1234 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three 3214 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
            1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
            1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
            1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two 2314 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
         1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

---------- Post updated at 07:33 AM ---------- Previous update was at 07:10 AM ----------

@Don Cragun
the awk is all most perfect, except the first two columns field only need to appear in the header row.

 LastName,FirstName    123456    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL    ESP6500si_AA    ESP6500si_EA    CG46    common    clinvar    clinvarsubmit    clinvarreference
                                   1    43394661    43394661    A    exonic    SLC2A1        nonsynonymous SNV    SLC2A1:NM_006516.2:exon8:c.T1016C:p.I339T

Aia · January 30, 2016, 11:32am

 perl -ne 'BEGIN{$"="\t"}if($.== 1){@np=$ARGV =~/^([^_]*)_(\d+)\./; $sf = length "@np"; print "@np\t$_"}else{ printf "%s\t%s", " "x$sf, $_}; if(eof){$. = 0};' *unw*

Or:

perl -ne 'BEGIN{$,=$"="\t"}if($fname ne $ARGV){@n=$ARGV =~/^([^_]*)_(\d+)\./; $ind = length "@n"; print @n,$_}else{ print " "x$ind,$_}; $fname=$ARGV' *unw*

 
one     1234    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
three   3214    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
two     2314    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Don's awk version can be modified as:

awk '
FNR == 1 {
	if((n = split(FILENAME, name, "_")) < 2) {
		print "********************************"
		printf("Filename (%s) does not fit expected pattern.\n",
		    FILENAME)
		print "********************************"
		exit 1
	}
	split(name[2], number, ".")
        print name[1], number[1], $0
        gsub(".", " ", name[1])
        gsub(".", " ", number[1])
        next        
}
{	print name[1], number[1], $0
}' OFS='\t' *multianno

Don_Cragun · January 30, 2016, 5:18pm

Or, without then <space> padding in the 1st two output fields on non-header lines, you could also modify my awk script this way:

awk '
FNR == 1 {
	if((n = split(FILENAME, name, "_")) < 2) {
		print "********************************"
		printf("Filename (%s) does not fit expected pattern.\n",
		    FILENAME)
		print "********************************"
		exit 1
	}
	split(name[2], number, ".")
	print name[1], number[1], $0
	next
}
{	print "", "", $0
}' OFS='\t' *multianno

cmccabe · January 30, 2016, 10:51pm

Thank you both :).