Pattern Matchin Huge File

senthil.ak · February 10, 2011, 4:29am

Hi Experts,
I've issue with the huge file.
My requirement is I need to search a pattern between the 155-156 position and if its match's to 31 or 36 then need to route that to a new separate files.
The main file has around 1459328 line and 2 GB in size. I tired with the below code which take around 2 hrs to execute.

while read line
do
    record_type=`echo "$line" | cut -c 155-156`
    if [ "$record_type" -eq 31 ] ; then
    print "$line" >> ./31.txt
    elif  [ "$record_type" -eq 39 ] ; then
    print "$line" >> ./39.txt
    fi
done < LOAD.txt

Where as I modified this and used awk which is still taking more than 30 minutes but the results vary.

 
awk '/839I/ {print $0}' LOAD.txt > record_39.txt &
awk '/831I/ {print $0}' LOAD.txt > record_31.txt &
cat LOAD.txt | cut -c 155-156 > smp.log
grep -c '31' smp.log
 1182483
wc -l record_type_31.txt 
 1182495 record_31.txt

I even tired this too

 
awk '$5 ~ 39{print $0;}' LOAD.txt

but always the $5 wont come in between 155-156 position.
Sample records.

14115726     0000000000         00000000000000000000000000000000000000000000000000000000                                                      000         00I201
06485726     0000000000         00000000000000000000000000000000000000000000000000000000                                                      000        805I201
18005726ABCUS0000005726         01002080000000000000000000000000000000000000000000000000370291010381009    20090218                           000 I      839I201
18005726ABCUS0000005726         08009100000000000000000000000000000000000000000000000000370290173421008    20101203                           000I       839I201
18005726ABCUS0000005726         00000020000000000000000000000000000000000000000000000000370282295281006    20060706                           000C       831I201
18005726ABCUS0000005726         01002080000000000000000000000000000000000000000000000000370282010171003    20090216                           000 I      831I201

Do we have any other way in which I can get the currect results.

Thanks
Senthil.

ctsgnb · February 10, 2011, 4:56am

Do you have the same result running

cat LOAD.txt | cut -c 155-156 > smp.log
grep -c '31' smp.log

and

cat LOAD.txt | cut -c 154-157 > smp.log
grep -c '831I' smp.log

???

Franklin52 · February 10, 2011, 4:58am

awk '{p=substr($0,155,2)} p ~ "3[19]" {print > p ".txt"}' file

birei · February 10, 2011, 5:03am

Hi,

Test next 'perl' script:

$ perl -ne 'BEGIN { open $f31, ">", "31.txt" or die $!; open $f39, ">", "39.txt" or die $!; } ($a) = unpack "x154 A2", $_; if ($a == 31) { print $f31 $_; } elsif ($a == 39) { print $f39 $_; }' infile

Regards,
Birei

Scrutinizer · February 10, 2011, 5:07am

Perhaps this will go faster:

grep '^.\{154\}31' infile > 31.txt

To just count the records:

grep -c '^.\{154\}31' infile

Likewise for 39

senthil.ak · February 10, 2011, 5:35am

@ctsgnb

 
cat L*.txt | cut -c 155-156 > smp.log
grep -c '31' smp.log
1182483
grep -c '39' smp.log
32855
cat L*.txt | cut -c 154-157 > smp.log 
grep -c '831I' smp.log
1182483
grep -c '839I' smp.log
32855

@ Franklin52 - Many thanks this deserve a party.

 
time awk '{p=substr($0,155,2)} p ~ "3[19]" {print > p ".txt"}' LOAD.txt &
real    1m50.57s
user    0m23.54s
sys     0m44.26s
wc -l 39.txt 31.txt
   32855 39.txt
 1182483 31.txt

@ Birei - I'm sorry I wont have perl in the box so not possible to try.
@ Scrutinizer - Do you please explain me the command little bit I'm dump to understand the expert level command.

 
time grep '^.\{154\}39' LOAD.txt > 39.txt &
real    2m43.49s
user    0m18.01s
sys     0m17.35s
wc -l 39.txt
32855 39.txt

rdcwayx · February 10, 2011, 5:46am

grep "83[19]I...$" LOAD.txt

senthil.ak · February 10, 2011, 6:16am

It ok with the timings now, Because earlier it took me around an hour but from your command its only minutes.

I love to run the command in backgroud as this wont distrub if i left it.

Also a small request for you. Vould please explain me the command which you used.

 
grep '^.\{154\}39' LOAD.txt > 39.txt

Scrutinizer · February 10, 2011, 6:24am

Hi you can of course run your commands in the background, but to accurately determine the fastest solution the speed tests need to be run in the foreground on a preferrably quiet system (or rather with sufficient priority).

My command means select lines that match 154 characters ( . ) at the beginning of the line ( ^ ), followed by 39