Extracting a portion of a data file with identifier

Lucky_Ali · December 8, 2009, 11:56am

Hi,
I do have a TAB delimted text file with the following format.

1 (- identifier of each group. this text is not present in the file only number)
1 3 4 65 56 WERTF
2 3 4 56 56 GHTYHU
3 3 5 64 23 VMFKLG
2
1 3 4 65 56 DGTEYDH
2 3 4 56 56 FJJJCKC
3 3 5 64 23 FNNNCHD
3
1 3 4 65 56 JDHJDH
2 3 4 56 56 FDFDJ
3 3 5 64 23 FHDKF

.
.
.
.
50

1 3 4 56 56 GHTYHU
1 33 4 64 76 WERTF
3 3 5 64 23 VMFKLG

I want to search the entire file with a text, for ex. WERTF (user input) and then output all the lines that have that text in the 6th column along with the identifier.

for ex. if I search for 'WERTF', I would like to have an output:
1 1 3 4 65 56 WERTF
50 1 33 4 64 76 WERTF

where the identifiers are in bold.

Is there a best way to perform this either using regular expressions along with script or awk programming.

Please let me know.

jim_mcnamara · December 8, 2009, 12:06pm

pattern='WERTF'
awk -v pat="$pattern" 'NF==1 {first=$1}
                               $NF==pat { print first, $0} ' infile > outfile

Lucky_Ali · December 8, 2009, 1:40pm

That worked...but it was for the exact match.

Sorry I forgot to mention before this.

Is it possible to out put if if we a portion of it matched.

for ex. if my pattern given is ERT, can we adjust the code to match every text that contains ERT as core .

ie if I specify the pattern, ERT, whether it would output the text that contains WERTF and SERTE and AERTk and all

As if in regular expression.
Please let me know.

---------- Post updated at 01:40 PM ---------- Previous update was at 12:28 PM ----------

Or Please let me know how to implement the best regular expression in awk to solve this problem.i.e matching some core and ignoring the tails.

Please let me know.

LA

momo.reina · December 8, 2009, 2:21pm

AFAIK, awk will not accept variables that need to be expanded, which means that any pattern matching you want to do will have to be hard coded.

this solution doesn't use awk, giving you more freedom by not requiring you to hard code the pattern. unfortunately it's not as elegant as the previous solution:

#!/bin/bash

read pattern
while read line; do
	[ ${#line} == 1 ] && identifier="$line"
	pat=$(echo $line | grep $pattern)
	[ $? == 0 ] && echo $identifier $pat
done <your_file_here

Lucky_Ali · December 8, 2009, 3:32pm

Thanks,
How do I run it? Is it like a shell script and save it as a .sh file and run.
Also for 'read pattern' I just have to write read GTKDH ?

Please let me know.

LS

---------- Post updated at 03:32 PM ---------- Previous update was at 02:29 PM ----------

I tried the second code (shell) and it worked very well.
But there is a problem in numbering the identifier in the output file.

For Identifier's less than 10, the program out putted the corrected number while when Identifier is 10 or greater than 10, the identifier in the output will only be 9.

This is an example of the real output data.

1 3 9 36 281 2.0e+004 ATTGCATGC
2 4 12 50 403 1.3e+005 GCATGCAAATTT
7 8 15 9 90 7.2e+008 TGCATGCAAAAATGC
9 8 7 14 103 3.4e+008 GCATGCA
9 2 7 35 293 1.4e-004 GCATGCA
9 3 11 27 225 1.5e+006 GCATGCAAAAT
9 3 9 31 273 1.8e-004 TTGCATGCA
9 7 7 9 75 4.4e+005 TGCATGC
9 1 9 21 186 4.3e-002 TGCATGCAA
9 1 19 12 165 3.9e-005 TGGCGGGAAATGCATGCAG
9 1 20 49 538 1.4e-036 TTTAAAATTGCATGCATGCA
9 6 7 17 132 1.7e+007 GCATGCA
9 4 11 14 128 2.2e+006 TGCATGCACAC
9 4 7 20 145 6.0e+008 TGCATGC
9 3 9 15 149 5.7e-001 TGCATGCAA
9 1 9 25 231 7.3e-007 GCATGCAAA
9 1 16 34 357 5.9e-014 AAATTTGCATGCAAAC
9 5 11 8 88 2.5e+004 AAATGCATGCA
9 7 7 10 86 1.6e+005 TGCATGC
9 4 9 18 150 6.7e+006 TTTGCATGC
9 1 16 45 480 4.6e-034 GCATGCATTTGGCGCC
9 3 9 45 360 3.0e+002 CTTGCATGC
9 3 9 16 150 6.3e+000 GCATGCAAA
9 5 9 8 80 2.7e+004 TGCATGCAA
9 4 9 16 157 1.5e-001 GCATGCAAA
9 1 14 32 347 1.3e-022 GTTGCATGCATGCA
9 2 9 9 89 2.0e+004 TGCATGCAC
9 3 7 14 116 1.7e+005 CGCATGC
9 1 12 21 223 1.1e-012 TTTTGCATGCAA
9 3 9 16 150 1.0e+001 TGCATGCAA
9 6 9 17 142 3.2e+007 GCATGCACA
9 4 9 6 62 2.1e+005 GCATGCAAA
9 2 9 14 144 2.0e-002 TGCATGCAA
9 3 8 15 121 1.0e+005 GCATGCAA
9 6 9 14 117 2.6e+006 TGCATGCAT
9 2 16 13 163 2.6e-005 ATTTGCATGCATTCAA
9 3 12 42 378 1.8e-007 ATATGCATGCAA
9 2 9 54 468 5.3e-012 TTGCATGCA

Please let me know how to correct this problem.

LA

Spartukus · December 8, 2009, 3:38pm

Hey guys keep having problems with the below script syntax error near unpexpected token '0' exit 0 I have two directorys backups and Usr in the usr i have sub dir's wp,ss,pic which i would like to back up (copy those directorys to the backups directory) with user acknowledgement from command line. I would then want the ability to restore those files back to the Usr from Backups so the code is below if any one could be so kind to help me out, would be much appreciated. Thanks
-----------------------------------------------------------------------------------------
echo "please choose backup or restore"
read bor case "$bor" in
backup )
ls -d Usr/*
echo "please choose directory to backup (example wp,ss,pic)"
read dir
case "$dir" in
wp )
echo "word processor directory backedup"
cp -r Usr/wp Backups/wp;;
ss )
echo "word processor directory backedup"
cp -r Usr/ss Backups/ss;;
pic )
echo "word processor directory backedup"
cp -r Usr/pic Backups/pic;;
restore )
ls -d Backups/*
echo "please choose directory to restore (wp,ss,pic)"
read bdir
case "$bdir" in
wp )
echo "word processor directory restored"
cp -r Backups/wp Usr/wp;;
ss )
echo "word processor directory restored"
cp -r Backups/ss Usr/ss;;
pic )
echo "word processor directory restored"
cp -r Backups/pic Usr/pic;;
exit 0

Lucky_Ali · December 8, 2009, 3:42pm

Spartukus

Please start a new thread for your question. That way my questions won't be ignored or unattended.

Sorry for the inconvience.

LA

Spartukus · December 8, 2009, 3:44pm

sorry for the hi-jack but I really dont know how to create a post the first time i tryed it said I have to buy access?

rdcwayx · December 8, 2009, 7:12pm

similar post as here: Getting the correct identifier in the output file

awk '/GTKDH/ {print int(NR/4)+1,$0}' urfile

summer_cherry · December 8, 2009, 10:24pm

nawk '{
if(/^[0-9]+$/){
  id=$0
}
else{
print id," ",$0
}
}' a.txt | grep $1