Help with generating a script

kellywilliams · September 14, 2011, 12:49am

I am a biologist who is new to linux and am having difficulty generating a script to do what I want it to do! I have tried basic grep commands, but even that does not give me back the data I want.

I have many files that are all currently in .xslx and I'm not sure if they need to be .csv or .txt for this to work... Each of these files has ~90,000 lines.

Basically I want a script that will ask me what I am looking for:

echo Please input gene name you wish to look for
read GeneName

Then I want to look through all of the lines in all of my files (~200) and see if $GeneName is present in field 13 (it may be present multiple times in the one file, or not at all).

If it is present in field 13, I want the entire line/s it is in, plus the file name/s it is found in to be printed to a .txt file.

I have tried using grep command and it just gives me back all lines from the file, not the lines containing $GeneName. I apologise I do not have a better unix background, but I have been trying for days and I just cannot generate it!!! I would appreciate any help.

Kelly

mayursingru · September 14, 2011, 12:51am

Hi Kelly,
Please provide some sample data from the files.

Regards,
Mayur

kellywilliams · September 14, 2011, 1:29am

So the files all come in this format below:

There is a heading line at the top of each file, and each subsequent line starts with chr. For this example, I would be wanting to identify the lines containing WASH7P in file1.txt and file2.txt (or if this would work with .csv files?)

file1.txt

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	14907	14907	A	G	het	108	52	39	snp131	rs6682375	ncRNA	WASH7P	.	.	rs6682375	.	.
chr01	14930	14930	A	G	het	148	62	44	snp131	rs6682385	ncRNA	WASH7P	.	.	rs6682385	1000g2010nov_all	0.71
chr01	761752	761752	C	T	hom	225	69	69	snp131	rs1057213	ncRNA	NCRNA00115	.	.	rs1057213	1000g2010nov_all	0.544
chr01	761800	761800	A	T	hom	42	11	11	snp131	rs1064272	ncRNA	NCRNA00115	.	.	rs1064272	1000g2010nov_all	0.114

file2.txt

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	17556	17556	C	T	het	43	30	9	.	.	ncRNA	WASH7P	.	.	.	.	.
chr01	69511	69511	A	G	hom	225	106	106	snp131	rs2691305	exonic	OR4F5	nonsynonymous SNV	"OR4F5:NM_001005484:exon1:c.A421G:p.T141A,"	rs2691305	1000g2010nov_all	0.789
chr01	761732	761732	C	T	hom	225	103	102	snp131	rs2286139	ncRNA	NCRNA00115	.	.	rs2286139	1000g2010nov_all	0.537

I would like be asked to type in GeneName (i.e. WASH7P) and the output in a .txt file to be something like:

file1.txt:chr01	14907	14907	A	G	het	108	52	39	snp131	rs6682375	ncRNA	WASH7P	.	.	rs6682375	.	.
file1.txt:chr01	14930	14930	A	G	het	148	62	44	snp131	rs6682385	ncRNA	WASH7P	.	.	rs6682385	1000g2010nov_all	0.71
file2.txt:chr01	17556	17556	C	T	het	43	30	9	.	.	ncRNA	WASH7P	.	.	.	.	.

Many thanks,

Kelly

mayursingru · September 14, 2011, 1:39am

Hi Kelly,
Go through this and let me know if there's any problem. Use tags while posting your queries. The code below is working fine for me and giving the expected output.

 echo "the text"
         read Genename
         grep $Genename file1.txt > somefile.txt

Regards,
Mayur

Chubler_XL · September 14, 2011, 1:41am

You will need to convert to txt or csv first as awk/grep require text files not binary.

Try this:

echo Please input gene name you wish to look for
read GeneName
for file in *.txt
do
    awk -v F=$file -vN=$GeneName '$13 ~ N { print F": "$0 }' $file
done

mayursingru · September 14, 2011, 1:43am

Hi Kelly,
Missed one detail use following line for grep

 grep -H $Genename file1.txt > somefile.txt

Regards,
Mayur

kellywilliams · September 14, 2011, 7:22am

Thank you so so much Chubler_XL, this is what happened when I ran your script:

Please input gene name you wish to look for
SOD1
awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

Any clue why?

Something I forgot to mention, I would like the output file to be $GeneName_date.txt please

Thank you so much again for your help,

Kelly

---------- Post updated at 09:22 PM ---------- Previous update was at 09:16 PM ----------

Dear Mayur,

Thank you so much for providing me with this script, but I am having the same problem as I have had before

When I ran your script

echo "Please input gene name you wish to look for"
         read GeneName
         grep -H $GeneName *.txt > $GeneName.txt

All it did was concatenate the 6 .txt files I have in that directory? I just want single lines from these files that contain $GeneName.

Thank you again for your help.

Kelly

P.S. Maybe I can email two of my data files?

durden_tyler · September 14, 2011, 3:23pm

Not sure what that "invalid option" error is. What's your Unix/Linux system and version of awk? In short, what's the output of the following commands?

uname -a
uname --all
awk
awk --version

You may want to try this script:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -vNAME=$GENE '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done

Assuming the files "file1.txt" and "file2.txt" are tab-delimited files in the current directory, the execution of this script is as follows -

$
$
$ cat file1.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131     freqlenomes
chr01   14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs6682375       .    .
chr01   14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6682385       1000g0.71nov_all
chr01   761752  761752  C       T       hom     225     69      69      snp131  rs1057213       ncRNA   NCRNA00115      .       .       rs1057213    0.5442010nov_all
chr01   761800  761800  A       T       hom     42      11      11      snp131  rs1064272       ncRNA   NCRNA00115      .       .       rs1064272    0.1142010nov_all
$
$
$ cat file2.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131     freqlenomes
chr01   17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .       .
chr01   69511   69511   A       G       hom     225     106     106     snp131  rs2691305       exonic  OR4F5   nonsynonymous   SNV     "OR4F5:NM_0010.7892010nov_all421G:p.T141A,"
chr01   761732  761732  C       T       hom     225     103     102     snp131  rs2286139       ncRNA   NCRNA00115      .       .       rs2286139    0.5372010nov_all
$
$
$ cat search.sh
echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -vNAME=$GENE '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done
$
$
$ # Now run the script
$
$ . search.sh
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110914.txt
$
$ # Display the content of the output file
$
$ cat WASH7P20110914.txt
file1.txt: chr01        14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs668.375
file1.txt: chr01        14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6680.71g2010nov_all
file2.txt: chr01        17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .    .
$
$
$

Or you could try the following script that uses Perl -

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done

The execution of the script:

$
$
$ rm WASH7P20110914.txt
$
$ # Display the script content
$
$ cat search1.sh
echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done
$
$
$ # Now run the script
$
$ . search1.sh
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110914.txt
$
$
$ # Display the content of the output file
$
$ cat WASH7P20110914.txt
file1.txt:chr01 14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs6682375    .
file1.txt:chr01 14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6682385    0.71g2010nov_all
file2.txt:chr01 17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .       .
$
$
$

tyler_durden

apmcd47 · September 14, 2011, 4:08pm

I don't believe anyone has addressed this point:

Kelly, to use durden_tyler's solution you do need to export the files into tab-delimited text files. I assume you mean excel spreadsheet files (xlsx). The xlsx format is a proprietry binary format (probably a zipped xml document now but still in a proprietry format).

Andrew

kellywilliams · September 14, 2011, 6:23pm

Hi Tyler_Durden,

Thank you for your help, unfortunately the script is still not working. I have tried it on the two computers in my laboratory running linux. Here is the command output you suggested from computer 1 (via Terminal on a MacBook Pro):

$ uname -a
Darwin anzac-172-16-75-136.anzac.edu.au 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
$ uname --all
uname: illegal option -- -
usage: uname [-amnprsv]
$ awk
usage: awk [-F fs] [-v var=value] [-f progfile | 'prog'] [file ...]
$ awk --version
awk version 20070501

And computer2 (running RedHat):

$ uname -a
Linux neuro.anzac.edu.au 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39
EST 2011 x86_64 x86_64 x86_64 GNU/Linux
$ uname --all
Linux neuro.anzac.edu.au 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39
EST 2011 x86_64 x86_64 x86_64 GNU/Linux
$ awk
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options:
       -f progfile             --file=progfile
       -F fs                   --field-separator=fs
       -v var=val              --assign=var=val
       -m[fr] val
       -W compat               --compat
       -W copyleft             --copyleft
       -W copyright            --copyright
       -W dump-variables[=file]        --dump-variables[=file]
       -W exec=file            --exec=file
       -W gen-po               --gen-po
       -W help                 --help
       -W lint[=fatal]         --lint[=fatal]
       -W lint-old             --lint-old
       -W non-decimal-data     --non-decimal-data
       -W profile[=file]       --profile[=file]
       -W posix                --posix
       -W re-interval          --re-interval
       -W source=program-text  --source=program-text
       -W traditional          --traditional
       -W usage                --usage
       -W version              --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
       gawk '{ sum += $1 }; END { print sum }' file
       gawk -F: '{ print $1 }' /etc/passwd
$ awk --version
GNU Awk 3.1.5
Copyright (C) 1989, 1991-2005 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.

So when I ran your first script on computer1

I got the following output again

$ ./3SNPs_in_gene.sh 
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110915.txt
awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

$

And when I ran script 2 on computer 1:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done

The WASH7P20110915.txt file was empty.

Similarly, when I ran both scripts on computer 2, the WASH7P20110915.txt file was empty.

If you could help that would be great - thank you so much already for your help. Also, when the script is looking in *.txt, will that include looking in the $OUT.txt file?

Kelly

Corona688 · September 14, 2011, 6:28pm

There should be a space between "-v" and NAME=$var.

You should also be quoting it so it doesn't split on spaces.

So:

awk -v NAME="${VAR}"

ahamed101 · September 14, 2011, 6:50pm

awk -v pattern="WASH7P" '$13 ~ pattern {print FILENAME":"$0}' file1.txt file2.txt > out.txt

If you are using solaris, use nawk

--ahamed

---------- Post updated at 03:50 PM ---------- Previous update was at 03:41 PM ----------

or

grep WASH7P file1.txt file2.txt >> out.txt

--ahamed

kellywilliams · September 14, 2011, 7:48pm

Thank you Corona688, this stopped the -v invalid option, but the output file is still empty.

This is what I am using, but I am still getting an empty output file.

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -v NAME="$GENE" '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done

I have attached 2 example files. A GENE that is in both (thus will give an output) is SOX13.

Many thanks,

Kelly

ahamed101 · September 14, 2011, 8:24pm

I think there is some issue with file type

#your original file showed this and a normal grep SOX was not working on this
root@bt:/tmp# file file1.txt 
file1.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

#then I opened it in gedit and saved it once again with Character Encoding as "Current Locale UTF-8" and then it started working.
root@bt:/tmp# gedit file1.txt 
root@bt:/tmp# file file1.txt 
file1.txt: ASCII text, with very long lines, with CRLF line terminators

file2.txt has just one single line??

--ahamed

durden_tyler · September 14, 2011, 10:29pm

Ok, I think ahamed's observation is true. The files are not true ASCII text files.

So I downloaded both files in my Windows machine and used Cygwin Bash to prod around. Here's what I see:

$
$
$ # print the first line of file1.txt
$
$ head -1 file1.txt
c h r _ n a m e        c h r _ s t a r t       c h r _ e n d   r e f _ b a s e         a l t _ b a s e         h o m _ h e t   s n p _ q u a l i t y   t o t _ dqa
$
$
$

The first character looks quite unusual, and there's a space between each character e.g. "c" + space + "h" + space + "r" instead of "chr".
The octal dump of the first line shows this:

$
$ # octal dump of the first line of file1.txt
$
$ head -1 file1.txt | od -bc
0000000 377 376 143 000 150 000 162 000 137 000 156 000 141 000 155 000
        377 376   c  \0   h  \0   r  \0   _  \0   n  \0   a  \0   m  \0
0000020 145 000 011 000 143 000 150 000 162 000 137 000 163 000 164 000
          e  \0  \t  \0   c  \0   h  \0   r  \0   _  \0   s  \0   t  \0
0000040 141 000 162 000 164 000 011 000 143 000 150 000 162 000 137 000
          a  \0   r  \0   t  \0  \t  \0   c  \0   h  \0   r  \0   _  \0
0000060 145 000 156 000 144 000 011 000 162 000 145 000 146 000 137 000
          e  \0   n  \0   d  \0  \t  \0   r  \0   e  \0   f  \0   _  \0
0000100 142 000 141 000 163 000 145 000 011 000 141 000 154 000 164 000
          b  \0   a  \0   s  \0   e  \0  \t  \0   a  \0   l  \0   t  \0
0000120 137 000 142 000 141 000 163 000 145 000 011 000 150 000 157 000
          _  \0   b  \0   a  \0   s  \0   e  \0  \t  \0   h  \0   o  \0
0000140 155 000 137 000 150 000 145 000 164 000 011 000 163 000 156 000
          m  \0   _  \0   h  \0   e  \0   t  \0  \t  \0   s  \0   n  \0
0000160 160 000 137 000 161 000 165 000 141 000 154 000 151 000 164 000
          p  \0   _  \0   q  \0   u  \0   a  \0   l  \0   i  \0   t  \0
0000200 171 000 011 000 164 000 157 000 164 000 137 000 144 000 145 000
          y  \0  \t  \0   t  \0   o  \0   t  \0   _  \0   d  \0   e  \0
0000220 160 000 164 000 150 000 011 000 141 000 154 000 164 000 137 000
          p  \0   t  \0   h  \0  \t  \0   a  \0   l  \0   t  \0   _  \0
0000240 144 000 145 000 160 000 164 000 150 000 011 000 144 000 142 000
          d  \0   e  \0   p  \0   t  \0   h  \0  \t  \0   d  \0   b  \0
0000260 123 000 116 000 120 000 011 000 144 000 142 000 123 000 116 000
          S  \0   N  \0   P  \0  \t  \0   d  \0   b  \0   S  \0   N  \0
0000300 120 000 061 000 063 000 061 000 011 000 162 000 145 000 147 000
          P  \0   1  \0   3  \0   1  \0  \t  \0   r  \0   e  \0   g  \0
0000320 151 000 157 000 156 000 011 000 147 000 145 000 156 000 145 000
          i  \0   o  \0   n  \0  \t  \0   g  \0   e  \0   n  \0   e  \0
0000340 011 000 143 000 150 000 141 000 156 000 147 000 145 000 011 000
         \t  \0   c  \0   h  \0   a  \0   n  \0   g  \0   e  \0  \t  \0
0000360 141 000 156 000 156 000 157 000 164 000 141 000 164 000 151 000
          a  \0   n  \0   n  \0   o  \0   t  \0   a  \0   t  \0   i  \0
0000400 157 000 156 000 011 000 144 000 142 000 123 000 116 000 120 000
          o  \0   n  \0  \t  \0   d  \0   b  \0   S  \0   N  \0   P  \0
0000420 061 000 063 000 062 000 011 000 061 000 060 000 060 000 060 000
          1  \0   3  \0   2  \0  \t  \0   1  \0   0  \0   0  \0   0  \0
0000440 147 000 145 000 156 000 157 000 155 000 145 000 163 000 011 000
          g  \0   e  \0   n  \0   o  \0   m  \0   e  \0   s  \0  \t  \0
0000460 141 000 154 000 154 000 145 000 154 000 145 000 040 000 146 000
          a  \0   l  \0   l  \0   e  \0   l  \0   e  \0      \0   f  \0
0000500 162 000 145 000 161 000 015 000 012
          r  \0   e  \0   q  \0  \r  \0  \n
0000511
$
$

So that first two characters are those corresponding to octal numbers 377 and 376; that's decimal 255 and 254. Also, there's the character corresponding to number 0 i.e. chr(0) after each character. It is seen as "\0" in the octal dump above.

The newline or End-of-Line (EOL) character should be "\r\n" for Windows and "\n" for Unix/Linux. (Not sure, but I think it's "\r" for Mac OS and "\n" for Mac OSX). None of those EOL characters are present in the text file, which would confuse awk or Perl.

The other file - "file2.txt" appears to have "\r" characters as EOL.

$
$
$ # does "file2.txt" have any "\n" characters?
$
$ cat file2.txt | perl -lne '$count = s/\n//g; print "Number of \\n characters = $count"'
Number of \n characters =
$
$
$ # does "file2.txt" have any "\r" characters?
$
$ cat file2.txt | perl -lne '$count = s/\r//g; print "Number of \\r characters = $count"'
Number of \r characters = 13421
$
$

The "\r" character is the "Carriage Return" character (from the good ol' days of the typewriter); it goes back and starts overwriting the text that was already printed. So it looks like it's "one single line". The octal dump shows the difference clearly.

$
$
$ # what's the first occurrence of "\r" in file2.txt?
$
$ perl -lne 'print index($_, "\r")' file2.txt
162
$
$ # and the second?
$
$ perl -lne 'print index($_, "\r", 163)' file2.txt
264
$
$ # print the first 160 characters of file2.txt
$
$ perl -lne 'print substr($_,0,160)' file2.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele fres
$
$ # looks good, but print the first 200 characters now
$
$ perl -lne 'print substr($_,0,200)' file2.txt
chr01ame14930   14930tarA       Ghr_end het_base137     65t_base33      som_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele freq
$
$
$ # doesn't look good because "\r" started overwriting the characters printed already
$ # od -bc shows it better; notice the "\r" in bold red below
$
$ perl -lne 'print substr($_,0,200)' file2.txt | od -bc
0000000 143 150 162 137 156 141 155 145 011 143 150 162 137 163 164 141
          c   h   r   _   n   a   m   e  \t   c   h   r   _   s   t   a
0000020 162 164 011 143 150 162 137 145 156 144 011 162 145 146 137 142
          r   t  \t   c   h   r   _   e   n   d  \t   r   e   f   _   b
0000040 141 163 145 011 141 154 164 137 142 141 163 145 011 150 157 155
          a   s   e  \t   a   l   t   _   b   a   s   e  \t   h   o   m
0000060 137 150 145 164 011 163 156 160 137 161 165 141 154 151 164 171
          _   h   e   t  \t   s   n   p   _   q   u   a   l   i   t   y
0000100 011 164 157 164 137 144 145 160 164 150 011 141 154 164 137 144
         \t   t   o   t   _   d   e   p   t   h  \t   a   l   t   _   d
0000120 145 160 164 150 011 144 142 123 116 120 011 144 142 123 116 120
          e   p   t   h  \t   d   b   S   N   P  \t   d   b   S   N   P
0000140 061 063 061 011 162 145 147 151 157 156 011 147 145 156 145 011
          1   3   1  \t   r   e   g   i   o   n  \t   g   e   n   e  \t
0000160 143 150 141 156 147 145 011 141 156 156 157 164 141 164 151 157
          c   h   a   n   g   e  \t   a   n   n   o   t   a   t   i   o
0000200 156 011 144 142 123 116 120 061 063 062 011 061 060 060 060 147
          n  \t   d   b   S   N   P   1   3   2  \t   1   0   0   0   g
0000220 145 156 157 155 145 163 011 141 154 154 145 154 145 040 146 162
          e   n   o   m   e   s  \t   a   l   l   e   l   e       f   r
0000240 145 161 015 143 150 162 060 061 011 061 064 071 063 060 011 061
          e   q  \r   c   h   r   0   1  \t   1   4   9   3   0  \t   1
0000260 064 071 063 060 011 101 011 107 011 150 145 164 011 061 063 067
          4   9   3   0  \t   A  \t   G  \t   h   e   t  \t   1   3   7
0000300 011 066 065 011 063 063 011 163 012
         \t   6   5  \t   3   3  \t   s  \n
0000311
$
$
$

Now if you are working with "file2.txt" in Mac OS, then you'd want to use MacPerl for processing, and I'd assume it takes care of EOL characters. I have no experience with any Mac system though, so don't quote me on that.

On the other hand, if you want to work in RedHat Linux, then you may want to ensure that the EOL characters are "\n" only, before running any of those shell scripts.

You mentioned that those files exist ".xlsx" files i.e. MS Excel 2007 or higher. In that case, saving them as "tab delimited files" should be pretty straightforward.

tyler_durden

kellywilliams · September 15, 2011, 12:24am

Hi Ahamed and tyler_durden

I think you are right that there is definitely a problem with the files.

Thank you for doing that analysis - way over my beginner unix head!

I performed the following:

$ grep -c chr01 file2.txt 
1
$ grep -c chr01 file1.txt 
0

And there should be ~7000 in each file...

So this has turned into a much worse problem than I thought. I'm guessing that all of the scripts described above in this thread will work properly if I can work out how to correctly turn my .xlsx into tab-delimited text or .csv and they will not join all of the data into 1 line.

So the examples I provided (file1.txt and file2.txt) are only about 10% of the size of the actual files (I couldn't upload bigger files). I generated these modified tab-delimited files using Microsoft Excel and using the Save As feature. I have no idea why it would write all of the data into one line.

I cannot Save As an .xls or .xml (which would then be easy to convert to a .txt or .csv) because there are too many lines (~90000) in the original file. I also cannot open the .xlsx on RedHat because it only recognises this as azip file!

I feel stuck :wall:

durden_tyler · September 15, 2011, 3:09am

You may want to clean up the files you posted and then try the scripts on them.

I do hope you have Perl in your RedHat Linux box.
If not, then the following suggestion won't work.
Otherwise, do this:

(1) Download the files "file1.txt" and "file2.txt" you attached in your post, to your RedHat Linux system. Put them in a new/freshly created directory.

(2) Back them up first, using the following commands, which will create copies of the two files:

cp file1.txt file1.txt.orig
cp file2.txt file2.txt.orig

(3) Now clean up file1.txt using the following commands:

perl -lne 'BEGIN {$x = chr(0); $y=chr(254); $z=chr(255)} s/$x//g; s/$y//g; s/$z//g; s/\r//g; print' file1.txt > file1.txt.new
mv file1.txt.new file1.txt

The Perl one-liner shown above strips off all ASCII characters corresponding to 0, 254, 255. And then it removes all "\r" characters as well. Hopefully those were the only offending characters. The output is redirected to "file1.txt.new", which is then renamed back to "file1.txt".

So, we should be left with "file1.txt" that has the Linux EOL character "\n" and no non-printable character.

(4) Next, clean up file2.txt using the following command:

perl -plne 's/\r/\n/g' file2.txt > file2.txt.new
mv file2.txt.new file2.txt

This one simply substitutes all "\r" characters to "\n", which is the EOL character for Linux.

If everything has worked fine till now, then you should be left with the following files in your directory:

file1.txt  <== cleansed file
file1.txt.orig <== original corrupted file
file2.txt <== cleansed file
file2.txt.orig <== original corrupted file

You may now want to go back and create the shell scripts posted earlier and test those. They will process the files "file1.txt" and "file2.txt" and create a new file in the current directory.

tyler_durden