I have ~100 text files in a directory that I am trying to parse and output to a new file. I am looking for the words chr,start,stop,ref,alt in each of the files. Those fields should appear somewhere in those files. The first two fields of each new set of rows is also printed. Since this is on a windows os I used "path\to\folder' in the bash
example of files to search (each is a seperate file)
name1 1111 chr start stop ref alt comment
1 10 25 a t snp
1 20 75 t - del
2 30 120 - a ins
10 10 80 a g snp
name2 222 id chr start stop ref alt comment
1111 1 10 25 a g snp
name3 333333 id symbol chr start stop ref alt comment
222 name 1 20 75 c - del
222 name 2 30 120 - t ins
desired output
name1 1111 chr start stop ref alt
1 10 25 a t
1 20 75 t -
2 30 120 - a
10 10 80 a g
name2 222 chr start stop ref alt
1 10 25 a g
name3 333333 chr start stop ref alt
1 20 75 c -
2 30 120 - t
Thank you :).
bash tried
for f in "C:\Users/test\Desktop\file\folder*.txt" ; do
bname=${f##*/}
pref=${bname%%.bam}
awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt
done
The input are excel xlsx files that I converted to text in VBA, so they should all be separated by a tab. The in input files are 133 individual text files with the column titles in random order. In some it will be chr,start,stop,ref,alt in others id,chr,start,stop,ref,alt and in others name,symbol,id,chr,start,stop,ref,alt . Does this help? Thank you :).
Having 3 different kinds of 'columns' doesnt really help.
You know, you dont have to use awk, you could use regular scripting?
If that is easier for you, that is.
This said, counts for me too, here is somethign to get you started:
for f in *.dat ; do
bname=${f##*/}
#pref=${bname%%.bam} ## dont have that
#awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt"
while read content_line
do
if echo "$content_line" | grep -q ^name
then
MODE="default" # Reset parse mode
echo "$content_line" | grep -v symbol | grep -q id && MODE=id
echo "$content_line" | grep -q symbol && MODE=symbol
fi
case $MODE in
default) while read chr start stop ref alt comment;do
line_print="$chr $start $stop $ref $alt $commet"
done<<<"$content_line" ##>> ccmcbabe.output
;;
id) echo "id handling" ;;
symbol) echo "symbol handling" ;;
esac
echo "$MODE :: $line_print"
done < "$f"
done
hth
EDIT:
Which then outputs as:
sh ccmbade.sh
default :: name1 1111 chr start stop
default :: 1 10 25 a t
default :: 1 20 75 t -
default :: 2 30 120 - a
default :: 10 10 80 a g
id handling
id :: 10 10 80 a g
id handling
id :: 10 10 80 a g
symbol handling
symbol :: 10 10 80 a g
symbol handling
symbol :: 10 10 80 a g
symbol handling
symbol :: 10 10 80 a g
0 ~/tmp $
cd path/to/folder
awk -f /path2/to/script *.txt > file.out
It might be too many files. Then you could:
for i in *.txt
do
cat "$i"
done |
awk -f /path2/to/script > file.out
--
Output with sample:
name1 1111 chr start stop ref alt
1 10 25 a t
1 20 75 t -
2 30 120 - a
10 10 80 a g
name2 222 chr start stop ref alt
1 10 25 a g
name3 333333 chr start stop ref alt
1 20 75 c -
2 30 120 - t