Problem in extraction when space is a field delimiter

AshwaniSharma09 · August 9, 2010, 8:49am

I have more than 1000 files to parse. Each file contains few lines (number of lines varies) followed by a header line having all column's name (SPOT, NAME etc) and then values for those columns.
**Example File:

sdgafh
dfhaadfha
sfgaf dhah jkthdj
SPOT  NAME  GENE_NAME CH_MEAN   CHDN_MED  CH2B_MEAN
1         IYPR1         abc                      1.5                 3                      4.5
2         IYPR9         def                       3.6                                         6.3
3         IYPR11        ghi                      2.6                 4                      2.8
4         IYPR13        jkl                       1.6                                         6.7
5         IYPR19        mno                    2.5                 7                      4.3

. . . . . .
. . . . . .
Problems:

1) I need to remove all lines before the header line (for each file).
2) I need to extract 2 columns, let say 2nd (NAME) & 5th column (CHDN_MED). Since there is no values for 2nd & 4th rows for 5th column, it should give a blank space for each in output file. But its giving 6.3 and 6.7 respectively from the 6th column (I am tring with AWK).

The desired output file is :

IYPR1      3
IYPR9   
IYPR11    4
IYPR13   
IYPR19    7
.                 .
.                 .

**the actual files are with .xls extension and each file has more than 50 columns and 9,000 rows.

Please advise

Thanks

Franklin52 · August 9, 2010, 9:17am

Does it mean that the files are in excel format?

elbrand · August 9, 2010, 9:23am

This script handles one of your files. You may change the rows to print out. (here: 2 and 5 -> RTP="2 5")
I assume that the number of rows is usually six and only row 5 can be empty. If not we need to enhance the case-structure.

You need to run the script with the filename as first parameter:

    ./script.sh file.txt

#! /bin/bash

RTP="2 5"

while read -a A ; do
    case ${#A[@]} in
        5) A[5]=${A[4]} ; A[4]="" ;;
        *) : ;;
    esac
    L=""
    for n in ${RTP} ; do
        L="${L} ${A[$((N=n-1))]}"
    done
    echo "${L}"
done < <(awk '/SPOT/,/EOF/ {print}' ${1})

Franklin52 · August 9, 2010, 9:43am

Assuming the format is a flat text file:

awk '
f{printf("%s%s\n", $2, NF>5? "\t" $5:"")}
/^SPOT/{f=1} 
' file

AshwaniSharma09 · August 10, 2010, 3:14am

Franklin52

thanks guys for quick replies.
@Franklin52 : yeah, all Input files are excel files. Its (script provided by you) working fine with " .txt " file but not with excel file, I afraid.

@elbrand : any row value for any column can be empty.

---------- Post updated at 12:44 PM ---------- Previous update was at 12:42 PM ----------

thanks guys for quick replies.
@Franklin52 : yeah, all Input files are excel files. Its (script provided by you) working fine with " .txt " file but not with excel file, I afraid.

@elbrand : any row value for any column can be empty.

Franklin52 · August 10, 2010, 3:28am

You could save the files as a csv file to process the files.

Regards

AshwaniSharma09 · August 10, 2010, 6:43am

I simply created a artificial ".txt" file with same content what I have given in my 1st post. your script is working very fine on it. But when I convert ".xls" file into ".txt "or ".csv" file, its not working. Again its taking values from other columns. This is what i did :

mv 10042.xls 10042.csv
OR
mv 10042.xls 10042.txt

Also, above command is for one file. I need it in a way so that it can be run on all files.

Franklin52 · August 10, 2010, 7:14am

Giving the file another name shouldn't work, the file must be save as a csv file with excel.

After processing the file you can save the file back in excel format with excel.

AshwaniSharma09 · August 10, 2010, 7:52am

@Franklin52 :
Sorry for being such a pain here but I really don't know how to do that for more than 1000 files.
I am using Fedora 8.

Franklin52 · August 10, 2010, 8:14am

Maybe you can use perl to do the job.

I hope someone with knowledge of perl can put you in right direction.

Regards

AshwaniSharma09 · August 16, 2010, 5:40am

Guys, I am still in a real fix. All I want is to extract column from excel files as it is.
If there are blanks in the column, they should be in the output.

pravin27 · August 16, 2010, 6:36am

If you want to read excel sheet and print the value of cell, you can try below perl solution (read_excel.pl).

input file - myxl.xls

field 1	field 2	field 3
test1	1	6
test2	2	
test3	3	7
test4	4	
test5	5	8

#!/usr/bin/perl

use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

my $sourcename = shift @ARGV;
my $source_excel = new Spreadsheet::ParseExcel;

my $source_book = $source_excel->Parse($sourcename) or die "Could not open source Excel file $sourcename: $!";
my $storage_book;

foreach my $source_sheet_number (0 .. $source_book->{SheetCount}-1)
{
 my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];

 print "--------- SHEET:", $source_sheet->{Name}, "\n";

 next unless defined $source_sheet->{MaxRow};
 next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
 next unless defined $source_sheet->{MaxCol};
 next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};

 foreach my $row_index (1 .. $source_sheet->{MaxRow})
 {
  foreach my $col_index ($source_sheet->{MinCol} .. $source_sheet->{MaxCol})
  {
   if ($col_index == 0 or $col_index == 2 ) {
   my $source_cell = $source_sheet->{Cells}[$row_index][$col_index];
   if ($source_cell)
   {
    #print "( $row_index , $col_index ) =>", $source_cell->Value, "\n";
    print $source_cell->Value, "\t";
    } # end of source_cell check
   }
  } # foreach col_index
  print "\n";
 } # foreach row_index
} # foreach source_sheet_number

invocation

perl read_excel.pl myxl.xls

output

--------- SHEET:Sheet1
test1   6
test2
test3   7
test4
test5   8