Delete columns if a pattern met

zajtat · February 11, 2013, 6:40pm

Hi,

I'd like to ask for some help with the following task, please:

there is a big file with a header (this is file.in):

NAME A_1.X A_1.Y A_1.Z B_1.X B_1.Y B_1.Z
name1 AB 0.11 0.12 BB 0.45 0.67 
name2 BB 0.34 0.56 AA 0.89 0.68

what I need is to recognize a pattern in the header of this file (pattern is in another file) and delete the column with that header

for example, the file with the pattern looks like this (this is file.with.patterns)

A_1
A_2
C_4
D_7

so, it would recognize A_1 and will delete all the columns containing A_1; thus, the output would look like this (this is file.out):

NAME B_1.X B_1.Y B_1.Z
name1 BB 0.45 0.67 
name2 AA 0.89 0.68

I am not sure I've got the best approach. What I was thinking to do is to put all the columns whose header does not contain the specified pattern in one output file (so, those columns whose header does match the pattern will be let out, deleted):

while read i
do
awk 'NR==1{for(a=1,a<=NF;a++) if ($a!~/$i/)f[n++]=a}
{for(a=0;a<=n;i++)printf"%s%s",a?":"",$f[a];print''} file.in >> file.out
done < file.with.patterns

one problem is that I would like to have all the columns whose header does not match the patterns in the file.with.patterns to be in the file.out and I am not sure if append sign (>>) would do that... it didn't really work well so far...

Another option I was thinking about is to establish the number of the columns whose header contains the pattern and then delete them with cut -f, but don't know how to do that.

Any ideas will be greatly appreciated!

Many thanks for your time!

kg_gaurav · February 11, 2013, 6:54pm

From your post what i understood is you need a file which contains filtered content of header and rule for filter is given by a file.
let's say rule file is rule.txt (containing A_1 A_2 etc) and file with header is file.html

head -1 file.html >temp
 
for $pat `cat rule.txt`
grep -vPw '$pat\.[A-Z]' temp >temp1
cat temp1 >temp
done

rdrtx1 · February 11, 2013, 6:57pm

try also:

awk '
NR==FNR {p[$0]=$0; next}
FNR==1 {for (i=1; i<=NF; i++) {s=$i; sub("[.].*","",s); if (p) o=s}}
{l=""; for (i=1; i<=NF; i++) if (!o) l=l $i" "; $0=l;}
1
' file.with.patterns file.in > file.out

Chubler_XL · February 11, 2013, 6:59pm

try

awk '
  FNR==NR{P[$1];next}
  FNR==1{
    for(i=1;i<=NF;i++) {
      c=$i
      sub(/\.[XYZ]$/,"",c)
      if(c in P)S
    }
  }
  { a=x
    for(i=1;i<=NF;i++)
    if(!(i in S)) a=a " " $i;
    print substr(a,2)
  }' file.with.patterns file.in > file.out

Edit: Just a bit late with an awk solution, but this one doesn't put a space on the end of each output line.

Scrutinizer · February 11, 2013, 7:21pm

Another awk:

awk '
  NR==FNR{
    A[$1]
    next
  }
  {
    s=$1
    for(i=2; i<=NF; i++){
      if(FNR==1) for(j in A) if($i~j) D
      if( ! (i in D) ) s=s OFS $i
    }
    print s
  }
' file.with.patterns file.in

zajtat · February 12, 2013, 9:17am

Thanks a lot for the scripts! They work perfectly on the example file. However, they do nothing for my big file. I am not sure, but may be it is the field separator? the big file's columns are separated by a space, not tab; could that affect the script?

Many thanks in advance!

elixir_sinari · February 12, 2013, 9:22am

Are there any carriage return characters in the 2 files? Post the output of:

head -2 file.with.patterns|od -bc

and

head -2 file.in|od -bc

.

zajtat · February 12, 2013, 9:49am

OK!
so, here is the output of the file.with.patterns

0000000   101 137 063 062 067 012 101 137 063 062 070 012                
           A   _   3   2   7  \n   A   _   3   2   8  \n                
0000014

and this is the output of the file.in

0000000   116 141 155 145 011 103 150 162 011 120 157 163 151 164 151 157
           N   a   m   e  \t   C   h   r  \t   P   o   s   i   t   i   o
0000020   156 011 101 137 062 070 070 056 107 124 171 160 145 011 101 137
           n  \t   A   _   2   8   8   .   G   T   y   p   e  \t   A   _
0000040   062 070 070 056 130 011 101 137 062 070 070 056 131 011 101 137
           2   8   8   .   X  \t   A   _   2   8   8   .   Y  \t   A   _
0000060   062 070 071 056 107 124 171 160 145 011 101 137 062 070 071 056
           2   8   9   .   G   T   y   p   e  \t   A   _   2   8   9   .
0000100   130 011 101 137 062 070 071 056 131 011 101 137 062 071 060 056
           X  \t   A   _   2   8   9   .   Y  \t   A   _   2   9   0   .
0000120   107 124 171 160 145 011 101 137 062 071 060 056 130 011 101 137
           G   T   y   p   e  \t   A   _   2   9   0   .   X  \t   A   _
0000140   062 071 060 056 131 011 101 137 062 071 061 056 107 124 171 160
           2   9   0   .   Y  \t   A   _   2   9   1   .   G   T   y   p
0000160   145 011 101 137 062 071 061 056 130 011 101 137 062 071 061 056
           e  \t   A   _   2   9   1   .   X  \t   A   _   2   9   1   .
0000200   131 011 101 137 062 071 062 056 107 124 171 160 145 011 101 137
           Y  \t   A   _   2   9   2   .   G   T   y   p   e  \t   A   _
0000220   062 071 062 056 130 011 101 137 062 071 062 056 131 011 101 137
           2   9   2   .   X  \t   A   _   2   9   2   .   Y  \t   A   _
0000240   062 071 063 056 107 124 171 160 145 011 101 137 062 071 063 056
           2   9   3   .   G   T   y   p   e  \t   A   _   2   9   3   .
0000260   130 011 101 137 062 071 063 056 131 011 101 137 062 071 064 056
           X  \t   A   _   2   9   3   .   Y  \t   A   _   2   9   4   .
0000300   107 124 171 160 145 011 101 137 062 071 064 056 130 011 101 137
           G   T   y   p   e  \t   A   _   2   9   4   .   X  \t   A   _
0000320   062 071 064 056 131 011 101 137 062 071 065 056 107 124 171 160
           2   9   4   .   Y  \t   A   _   2   9   5   .   G   T   y   p
0000340   145 011 101 137 062 071 065 056 130 011 101 137 062 071 065 056
           e  \t   A   _   2   9   5   .   X  \t   A   _   2   9   5   .
0000360   131 011 101 137 062 071 067 056 107 124 171 160 145 011 101 137
           Y  \t   A   _   2   9   7   .   G   T   y   p   e  \t   A   _
0000400   062 071 067 056 130 011 101 137 062 071 067 056 131 011 101 137
           2   9   7   .   X  \t   A   _   2   9   7   .   Y  \t   A   _

sorry the output for the file.in was too big, hope this will be OK

Many thanks for your kind help!!!

---------- Post updated at 09:49 AM ---------- Previous update was at 09:47 AM ----------

don't know if that would help, but here is the end of the output for the file.in

0227300   062 065 064 061 064 061 011 102 102 011 060 056 060 061 066 070
           2   5   4   1   4   1  \t   B   B  \t   0   .   0   1   6   8
0227320   060 066 064 061 011 060 056 063 067 061 062 067 067 067 015 012
           0   6   4   1  \t   0   .   3   7   1   2   7   7   7  \r  \n
0227340

Many thanks for your time!

Scrutinizer · February 12, 2013, 12:08pm

There is a carriage return character with a line feed in the last sample, so that means it is in DOS-format. You can convert it to Unix Format like so:

tr -d '\r' < file.in > file.out

zajtat · February 13, 2013, 7:35am

yap! worked like a dream!

Many, many thanks for your kind help!

summer_cherry · February 21, 2013, 5:11am

import re
arr=[]
with open("b.txt","r") as f:
 for line in f:
  line=re.sub("\n","",line)
  arr.append(line)
str="|".join(arr)
id=[]
cnt=1
with open("a.txt","r") as f:
 for line in f:
  line=re.sub("\n","",line)
  items=line.split(" ")
  if cnt==1:
   for i in range(len(items)):
    if re.match(str,items,):
     id.append(i)
   cnt+=1
  print(items[0:min(id)],items[max(id)+1:])