Problem in extraction when space is a field delimiter

I have more than 1000 files to parse. Each file contains few lines (number of lines varies) followed by a header line having all column's name (SPOT, NAME etc) and then values for those columns.
**Example File:

sdgafh
dfhaadfha
sfgaf dhah jkthdj
SPOT  NAME  GENE_NAME CH_MEAN   CHDN_MED  CH2B_MEAN
1         IYPR1         abc                      1.5                 3                      4.5
2         IYPR9         def                       3.6                                         6.3
3         IYPR11        ghi                      2.6                 4                      2.8
4         IYPR13        jkl                       1.6                                         6.7
5         IYPR19        mno                    2.5                 7                      4.3

. . . . . .
. . . . . .
Problems:

1) I need to remove all lines before the header line (for each file).
2) I need to extract 2 columns, let say 2nd (NAME) & 5th column (CHDN_MED). Since there is no values for 2nd & 4th rows for 5th column, it should give a blank space for each in output file. But its giving 6.3 and 6.7 respectively from the 6th column (I am tring with AWK).

The desired output file is :

IYPR1      3
IYPR9   
IYPR11    4
IYPR13   
IYPR19    7
.                 .
.                 .

**the actual files are with .xls extension and each file has more than 50 columns and 9,000 rows.

Please advise

Thanks

Does it mean that the files are in excel format?

This script handles one of your files. You may change the rows to print out. (here: 2 and 5 -> RTP="2 5")
I assume that the number of rows is usually six and only row 5 can be empty. If not we need to enhance the case-structure.

You need to run the script with the filename as first parameter:

    ./script.sh file.txt
#! /bin/bash

RTP="2 5"

while read -a A ; do
    case ${#A[@]} in
        5) A[5]=${A[4]} ; A[4]="" ;;
        *) : ;;
    esac
    L=""
    for n in ${RTP} ; do
        L="${L} ${A[$((N=n-1))]}"
    done
    echo "${L}"
done < <(awk '/SPOT/,/EOF/ {print}' ${1})

Assuming the format is a flat text file:

awk '
f{printf("%s%s\n", $2, NF>5? "\t" $5:"")}
/^SPOT/{f=1} 
' file

Franklin52

thanks guys for quick replies.
@Franklin52 : yeah, all Input files are excel files. Its (script provided by you) working fine with " .txt " file but not with excel file, I afraid.

@elbrand : any row value for any column can be empty.

---------- Post updated at 12:44 PM ---------- Previous update was at 12:42 PM ----------

thanks guys for quick replies.
@Franklin52 : yeah, all Input files are excel files. Its (script provided by you) working fine with " .txt " file but not with excel file, I afraid.

@elbrand : any row value for any column can be empty.

You could save the files as a csv file to process the files.

Regards

I simply created a artificial ".txt" file with same content what I have given in my 1st post. your script is working very fine on it. But when I convert ".xls" file into ".txt "or ".csv" file, its not working. Again its taking values from other columns. This is what i did :

mv 10042.xls 10042.csv
OR
mv 10042.xls 10042.txt

Also, above command is for one file. I need it in a way so that it can be run on all files.

Giving the file another name shouldn't work, the file must be save as a csv file with excel.

After processing the file you can save the file back in excel format with excel.

@Franklin52 :
Sorry for being such a pain here but I really don't know how to do that for more than 1000 files.
I am using Fedora 8.

Maybe you can use perl to do the job.

I hope someone with knowledge of perl can put you in right direction.

Regards

Guys, I am still in a real fix. All I want is to extract column from excel files as it is.
If there are blanks in the column, they should be in the output.:confused:

If you want to read excel sheet and print the value of cell, you can try below perl solution (read_excel.pl).

input file - myxl.xls

field 1	field 2	field 3
test1	1	6
test2	2	
test3	3	7
test4	4	
test5	5	8
#!/usr/bin/perl

use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

my $sourcename = shift @ARGV;
my $source_excel = new Spreadsheet::ParseExcel;

my $source_book = $source_excel->Parse($sourcename) or die "Could not open source Excel file $sourcename: $!";
my $storage_book;

foreach my $source_sheet_number (0 .. $source_book->{SheetCount}-1)
{
 my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];

 print "--------- SHEET:", $source_sheet->{Name}, "\n";

 next unless defined $source_sheet->{MaxRow};
 next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
 next unless defined $source_sheet->{MaxCol};
 next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};

 foreach my $row_index (1 .. $source_sheet->{MaxRow})
 {
  foreach my $col_index ($source_sheet->{MinCol} .. $source_sheet->{MaxCol})
  {
   if ($col_index == 0 or $col_index == 2 ) {
   my $source_cell = $source_sheet->{Cells}[$row_index][$col_index];
   if ($source_cell)
   {
    #print "( $row_index , $col_index ) =>", $source_cell->Value, "\n";
    print $source_cell->Value, "\t";
    } # end of source_cell check
   }
  } # foreach col_index
  print "\n";
 } # foreach row_index
} # foreach source_sheet_number

invocation

perl read_excel.pl myxl.xls

output

--------- SHEET:Sheet1
test1   6
test2
test3   7
test4
test5   8