Search files in directory for keywords using bash

cmccabe · January 11, 2016, 1:49pm

I have ~100 text files in a directory that I am trying to parse and output to a new file. I am looking for the words chr,start,stop,ref,alt in each of the files. Those fields should appear somewhere in those files. The first two fields of each new set of rows is also printed. Since this is on a windows os I used "path\to\folder' in the bash

example of files to search (each is a seperate file)

name1	1111	chr	start	stop	ref	alt	comment		
		1	10	25	a	t	snp		
		1	20	75	t	-	del		
		2	30	120	-	a	ins		
		10	10	80	a	g	snp		
name2	222	id	chr	start	stop	ref	alt	comment	
		1111	1	10	25	a	g	snp	
name3	333333	id	symbol	chr	start	stop	ref	alt	comment
		222	name	1	20	75	c	-	del
		222	name	2	30	120	-	t	ins

desired output

name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

Thank you :).

bash tried

for f in "C:\Users/test\Desktop\file\folder*.txt" ; do
        bname=${f##*/}
        pref=${bname%%.bam}
        awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt
done

cjcox · January 11, 2016, 3:00pm

Is this a free form input set of files? Are those spaces? Fixed width columns? Can columns run into each other for example, could you have:

start  stop
5500056000

where start is 55000 and stop is 56000

Just trying to clear up the unknowns...

Are column titles arbitrary, can the columns be in any order?

(I'm sure I could think of more things to ask)

cmccabe · January 11, 2016, 3:08pm

The input are excel xlsx files that I converted to text in VBA, so they should all be separated by a tab. The in input files are 133 individual text files with the column titles in random order. In some it will be chr,start,stop,ref,alt in others id,chr,start,stop,ref,alt and in others name,symbol,id,chr,start,stop,ref,alt . Does this help? Thank you :).

cjcox · January 11, 2016, 3:39pm

Are the lines with column titles always lines that begin with a non-whitespace character (e.g. name1)?

sea · January 11, 2016, 4:01pm

Theoreticly, you could have saved them as CSV (Comma Seperated Values).
However, the IFS here seems to be NL/tab.

Your awk code is invalid.

1 ~/tmp $ LC_ALL=C sh ccmbade.sh 
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                     ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                    ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                          ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                                       ^ unexpected newline or end of string

Having 3 different kinds of 'columns' doesnt really help.

You know, you dont have to use awk, you could use regular scripting?
If that is easier for you, that is.

This said, counts for me too, here is somethign to get you started:

for f in *.dat ; do
	bname=${f##*/}
	#pref=${bname%%.bam}	## dont have that
	#awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt"
	while read content_line
	do
		if echo "$content_line" | grep -q ^name
		then	
			MODE="default"	# Reset parse mode
			echo "$content_line" | grep -v symbol | grep -q id && MODE=id
			echo "$content_line" | grep -q symbol && MODE=symbol
		fi
		
		case $MODE in
		default)	while read chr start stop ref alt comment;do
					line_print="$chr $start $stop $ref $alt $commet"
				done<<<"$content_line" ##>> ccmcbabe.output
				;;
		id)		echo "id handling"	;;
		symbol)		echo "symbol handling"	;;
		esac
		
		echo "$MODE :: $line_print"
	done < "$f"
done

hth
EDIT:
Which then outputs as:

sh ccmbade.sh 
default :: name1 1111 chr start stop 
default :: 1 10 25 a t 
default :: 1 20 75 t - 
default :: 2 30 120 - a 
default :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
0 ~/tmp $

cmccabe · January 11, 2016, 4:02pm

Yes, if the two below files were used name1,123 would be file 1 and name2,1234 would be file 2: does this help? Thank you :).

 name1 123  chr  start  stop  ref  alt 
                      1    10     20    a    -
                      1    30    150  -     aaaa
 name2  1234 chr   start  stop  ref  alt
                       2     220   250   t     c

Scrutinizer · January 11, 2016, 4:02pm

Another way you could try:

BEGIN {
  FS=OFS="\t"
  header="chr,start,stop,ref,alt"
  n=split("x,x," header,H,",")
}
{
  split($0,F)
  if($1!="") for(i=3; i<=NF; i++) O[$i]=i
  $0=x
  $1=F[1]; $2=F[2]
  for(i=3; i<=n; i++) $i=F[O[H]]
  print
}

cd path/to/folder
awk -f /path2/to/script *.txt > file.out

It might be too many files. Then you could:

for i in *.txt
do
  cat "$i"
done |
awk -f /path2/to/script > file.out

--
Output with sample:

name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

sea · January 11, 2016, 4:02pm

<--REMOVED-->
Accidently quoted post, instead of edited... fixed code (DONE) in Search files in directory for keywords using bash Post: 302964162