Extract string from multiple file based on line count number

guns · April 4, 2011, 4:00am

Hi,

I search all forum, but I can not find solutions of my problem
I have multiple files (5000 files), inside there is this data :

FILE 1:

 1195.921  -898.995 0.750312E-02-0.497526E-02 0.195382E-05 0.609417E-05
-2021.287  1305.479-0.819754E-02 0.107572E-01 0.313018E-05 0.885066E-05
    85.928 -1529.405 0.990965E-02 0.224840E-02-0.157472E-04 0.581893E-05

FILE 2:

  1228.633  -924.264 0.174728E-02-0.961339E-03-0.594874E-06 0.177402E-05
 -1988.820  1279.863-0.177465E-02 0.219633E-02 0.309343E-06 0.251814E-05
    118.121 -1554.893 0.216157E-02 0.612947E-03-0.354522E-05 0.183121E-05

FILE 3:

  1195.921  -898.995 0.750312E-02-0.497526E-02 0.195382E-05 0.609417E-05
 -2021.287  1305.479-0.819754E-02 0.107572E-01 0.313018E-05 0.885066E-05
     85.928 -1529.405 0.990965E-02 0.224840E-02-0.157472E-04 0.581893E-05

I need to extract this data so that result in another single file (call it result.txt) will be :

 0.750312E-02  0.174728E-02  0.750312E-02
-0.819754E-02 -0.177465E-02 -0.819754E-02
 0.990965E-02  0.216157E-02  0.990965E-02

I am truly newby with awk/perl,
Any helps are very much appriciated.

Gunk.

kato · April 4, 2011, 5:19am

From your file 1:

Is the 3rd column really 0.750312E-02-0.497526E-02?
Or is there a space missing, so it should be 0.750312E-02 -0.497526E-02?

I see this is true of all your files, but the joined columns changed between rows. Is this correct?

guns · April 4, 2011, 5:30am

The file is just like it is.
Indeed there is no space, just continuously no.

The original data has the same start line no. (when I check in notepad),
For example,
the resulting 1st coulomb should started between linecount no 20 to 33 of the original file 1,
and the 2nd col. also linecount 20 to 33 of file 2,
and so on.. it is the same also for all files.

Need help please.

rdcwayx · April 4, 2011, 6:30am

Ok, your data are mess, the - with different meaning to make the troubles.

use below command to fix it first (if your sed support -i option)

sed -i 's/\([0-9]\)-/\1 -/g' file*

In this case, file1 will be converted to:

1195.921 -898.995 0.750312E-02 -0.497526E-02 0.195382E-05 0.609417E-05
-2021.287 1305.479 -0.819754E-02 0.107572E-01 0.313018E-05 0.885066E-05
85.928 -1529.405 0.990965E-02 0.224840E-02 -0.157472E-04 0.581893E-05

then run the awk command to get the result.

awk '{a[FNR]=a[FNR]?a[FNR] FS $3:$3}END{for (i=1;i in a;i++) print a}' file*

0.750312E-02 0.174728E-02 0.750312E-02
-0.819754E-02 -0.177465E-02 -0.819754E-02
0.990965E-02 0.216157E-02 0.990965E-02

Of course, if your sed don't support -i option, I will give you solution later.

alister · April 4, 2011, 10:39am

rdcwayx:

Ok, your data are mess, the - with different meaning to make the troubles.

use below command to fix it first (if your sed support -i option)
sed -i 's/$[0-9]$-/\1 -/g' file*
...<snip>...
[/CODE]Of course, if your sed don't support -i option, I will give you solution later.

You can just fix that in the awk script. Fix the record and assign the result to $0. That will force recalculation of NF and reassignment to each field variable.

Regards,
Alister

---------- Post updated at 10:39 AM ---------- Previous update was at 09:54 AM ----------

The simplest way to do that is probably:

gsub(/E-/, "E"); gsub(/-/, " -"); gsub(/E/, "E-")

If using gawk, then I suppose gensub with backreferences can manage it in one stroke, like the sed suggestion above.

Regards,
Alister

pravin27 · April 4, 2011, 12:29pm

could this help you?

 perl -nle 'if(/(\-\d+|\d+)(\.|\d+)(\d+E-\d+)(-|\s+)/) {$i=$ARGV eq $prev?++$i:1;if(exists $hash{$i}){$hash{$i}=$hash{$i}." ".$1.$2.$3}else{$hash{$.}=$1.$2.$3;}$prev=$ARGV} END{print $hash{$_} foreach (sort(keys(%hash)))}' files*

cgkmal · April 4, 2011, 5:21pm

Modified a little bit rdcway's code, to format before to process input data.

awk '{gsub(/-/," -");a[FNR]=a[FNR]?a[FNR] FS $3:$3}END{for (i=1;i in a;i++) print a}' FILE* # With gsub()

awk '{$0=gensub(/([0-9])-/,"\\1 -","g")
a[FNR]=a[FNR]?a[FNR] FS $3:$3}END{for (i=1;i in a;i++) print a}' FILE* # More precise with gensub() back reference regex.

Hope this helps,

Regards

alister · April 4, 2011, 5:48pm

That won't work because of the "-" used by the exponential notation. See my post for a gsub() sequence that should work.

Regards,
Alister

cgkmal · April 4, 2011, 6:05pm

Right!,

Thanks for your observation Alister, I had not considered the "E-[0-9][0-9]" part.
The option would be:

awk '{$0=gensub(/([0-9])-/,"\\1 -","g");a[FNR]=a[FNR]?a[FNR] FS $3:$3}END{for (i=1;i in a;i++) print a}' FILE* 

or a little bit shorter:
awk '{$0=gensub(/([0-9])-/,"\\1 -","g");a[FNR]=a[FNR]?a[FNR] FS $3:$3}END{for(i in a) print a}' FILE*

Regards

rdcwayx · April 4, 2011, 8:05pm

This will lost original sequence, you can't short it by this way.

That's why I use below code.

for (i=1;i in a;i++)

guns · April 4, 2011, 8:30pm

@ rdcwayx : Thanks a million. Your solution working perfectly.

@ alister, cgkmal : Thanks for showing other way. Now I should learn more. Thanks you.

@ pravin27 : Your script makes an inconsistency result between data after line 2 to below.. Thanks anyway for using perl as an option.

THANK YOU VERY MUCH FOR ALL OF YOU.

cgkmal · April 4, 2011, 9:37pm

Thanks, something new I learned:b:

Regards

summer_cherry · April 7, 2011, 3:50am

my @arr;
my @files = glob("*.txt");
foreach my $f (@files){
open FH,"<$f";
while(<FH>){
my @tmp = split(/(?:(?<=[0-9])(?=-)|\s+)/g,$_);
$arr[$.]= $arr[$.]." ".$tmp[2];
}
close FH;
}
for(my $i=1;$i<=$#arr;$i++){
print $arr[$i],"\n";
}

ctsgnb · April 7, 2011, 4:26am

@cgkmal
@rdcwayx
@alister

I just wanted to pointed out a potential unexpected behaviour :

If one of the considered a[FNR] fields of the scanned files has the value 0

the code :

a[FNR]=a[FNR]?a[FNR] FS $3:$3

will behave a wrong way :

a[FNR]? will be evaluated to "false" so that the a[FNR] will be reset to $3 and may lose the previous concatenated values which doesn't fit with what we are looking for ...

rdcwayx · April 7, 2011, 8:07am

Thanks, it can be changed to :

a[FNR]=(a[FNR]=="")?$3:a[FNR] FS $3

alister · April 7, 2011, 12:00pm

Nice catch.

---------- Post updated at 12:00 PM ---------- Previous update was at 11:33 AM ----------

Alternatively, you can ensure that the contents of a[FNR] are never evaluated as a numeric by casting its first assignment to a string.

a[FNR]=a[FNR]?a[FNR] FS $3:$3""

The ternary statement's second expression does the string conversion implicitly during string concatenation.

Regards,
Alister