Want to extract certain lines from big file

I don't think I can do better than what I have already offered. With an input file of three similar records with different transnums, it extracts the desired one:

transnum=ABC160120XYZ0983920
sed -n "/~${transnum}~/{H;g;}; /~${transnum}~/,/EOT/p; h" file
##PAYMNT,ABCDEFGH,        ,        ,TEST01                  ,0000004308,0000004216,1104      ,000000, ,00110,USD,   ,T,TESTST008                    ,2016-01-18T09:30:47                ,pain.001.001.03                    ,00000000000000001200,00000000000000018.00
%%YEDTRN~0000004646~ABC160120XYZ0983920~20160120_085131~20160120_085021~20160120_085021_41222370                              ~20160120_084728_15401168                     ~20160120_084728~0000004644~          ~TEST01                       ~pain.001.001.03                    ~U ~0.02               ~C~FWT       ~          ~SFTS                           ~99999999801                        ~WireCmpIDa                         ~021000018                          ~99999998799002                     ~20020101~Payee Name 1104                    ~TstTrceNbr1104                     ~PR ~USD~   ~   ~Y~US~PmtGrpWire0b                       ~OOXXMXM                            ~TestWirePay002                     ~Y~01~HARDCOPY  ~pain.001.001.03                    ~TESTST008                    ~2016-01-18T09:30:47                ~00000000000000000000000000000000001200~0000000000000018.00~N~                                                                                                    ~00~               ~ ~                                   ~                                   ~   ~               ~                                   ~ ~          ~                                                                                                    ~00~               ~                                                                                                    ~00~ ~               ~               ~                                   ~                                   ~   ~               ~ 
0000ISA00          00          ZZABCDETEST01   ZZABCDEFGHI      1601200849U005010000043080T 
0000GSRA   201601200849    000004216X 005010      
0000ST<>820<>1104<>PmtGrpWire0b<>
0010BPR<>U<>0000000000000000.02C<>FWT<><>01<>043000261<>DA<>99999999801<>WireCmpIDa<><>01<>021000018<>DA<>99999998799002<>20020101<><><><><>
0010TRN<>1<>TstTrceNbr1104<><>OOXXMXM<>
0010CUR<>PR<>USD<>00000000000<><><><>00000000        <>00000000        <>00000000        <>00000000        <>00000000        
0010REF<>TN<>TestPay002<><><><><><><><>
0020N1<>O2<>Test Initiating Party<><><><><>
0020N1<>O1<>Test Debtor Bank<>13<>043000261<><><>
0020N4<><><><>US<><><><>
0020N1<>PR<>Debtor Pyr Nm 0a<><><><><>
0020N3<>Payer Address 00a Line1<><>
0020N3<>Payer Address 00a Line2<><>
0020N4<>Payer City<>PA<>12345<>US<><><><>
0020N1<>BK<>Test Payee Bank 002<><><><><>
0020N1<>PE<>Payee Name 1104<><><><><>
0020N3<>Payee Address 002 Line1<><>
0020N3<>Payee Address 002 Line2<><>
0020N4<>Payee Town 002<>PA<>12345<>US<><><><>
0000SE<>00000000001104<>
0000GE<>000001000004216
0000IEA<>00001000004216
0000EOT<><>000000000000019000000000000000000000000000019<><><>

What else can I do?

1 Like

Hi RudiC

Let me try this sed in your last post.

Thanks.

You have already been told that shell variables are not expanded inside single quotes! This is true in any shell script. It doesn't matter whether the single quoted string is a sed script inside a shell script or an awk script inside a shell script.

What is in the file named /tmp/transnum ? As stated in my post describing this script, that file must contain a list of one or more transaction numbers to be extracted, with one transaction number per line. IF YOU DO NOT PUT THE TRANSACTION NUMBERS YOU WANT TO EXTRACT IN THAT FILE, MY SCRIPT CANNOT WORK! There is nothing shown in your script that puts any data in /tmp/transnum .

What am I missing? Why is 23962395676 important as the last part of your output filename before the transaction number. (We know this is not a transaction number because you have told us that transaction numbers are 19 characters (not 12). And, the code I provided already included the transaction number as the last part of the output file's pathname. If what you want is the 2nd input file's pathname followed by an underscore followed by the transaction number; just change every occurrence of:

"TX:" transnum

in the script I posted in post #14 in this thread to:

FILENAME "_" transnum

And since the end of a transaction is NOT 0000EOT as you repeatedly told us, it is no wonder that the scripts that have been provided to you do not work. Since the end of a transaction is a line like:

0000EOT<><>000000000000019000000000000000000000000000019<><><>

instead of the exact line:

0000EOT

that you described before, you also need to change the line in my script:

$1 == "0000EOT" {

to:

/^0000EOT/ {

And, if the transaction number you're trying to extract is ABC160120XYZ0983921 (and, since this transaction number does not appear in your latest sample input, there would be no output), the transaction number has also changed positions from where you said it was (from following the 1st tilde to following the 2nd tilde), then you also need to change the line in my script:

$1 == "%%YEDTRN" && $2 in t {

to:

$1 == "%%YEDTRN" && $3 in t {

This is a classic case of what computer scientists refer to as GIGO (Garbage In, Garbage Out). If the specification of the input data does not match the input data provided for processing, please don't blame the scripts that we suggested! You HAVE to give us representative samples of the data you are processing if you need our help in writing your code!

If you don't fully understand how awk is processing this script, you might also want to keep the comments I provided instead of throwing them away. :frowning:

@RudiC,

Thanks for your efforts trying to help me. But my version of unix(AIX) is not working for this SED command you have provided.
Thanks.

I'm sorry to hear that. Did you try any of the suggestions given on an (artificially) simplified data sample?

Hi Don,

Sorry, Since i am new to use blogging websites i am afraid of giving a banks transaction input structure in a public website. I am afraid since it might end up me in trouble and also apologize for a faulty input. I am learning 1 by 1 towards perfection.

I am going to try out your new suggestions will update you in 15 mins.

Thanks.

---------- Post updated at 05:40 PM ---------- Previous update was at 05:01 PM ----------

Hi Don

I am getting the below error after doing the changes what ever you have suggested.

awk: Cannot divide by zero.

 The input line number is 32042. The file is /tmp/remedixz.20160120_085021_41222370_1.
 The source line number is 18.

The line 32042 is the EOT line of the particular transaction reference number. Please find the code below

big_file='/tmp/remedixz.20160120_085021_41222370_1'
trannum="/tmp/transnum"

/tmp> cat /tmp/transnum
ABC160120XYZ0983921 

##In the above you can see the transnum given

awk -F '~' '
    FNR == NR {
      t[$1]
      tc = FNR
      next
      } 
      {
      l[++lc] = $0
      }
    $1 == "%%YEDTRN" && $3 in t {
        remove t[transnum = $2]
        tc--
    }

    /^0000EOT/ {
        if(transnum) {
            for(i = 1; i <= lc; i++)
                print l > (/tmp/remedixz.20160120_085021_41222370_1_new "_" transnum)
            close(/tmp/remedixz.20160120_085021_41222370_1_new "_" transnum)
            printf("Transaction #%s extracted to file /tmp/remedixz.20160120_085021_41222370_1_new "_" transnum:%s\n", transnum,
                transnum)
        }
        if(tc) {
            lc = 0
            transnum = ""
        } else {
            exit
        }
    }' $trannum $file

This time i just directly gave the output file name rather than a variable.
Kindly let me know where i am missing something.

Thanks.

Try this adaptation of RudiC's suggestion and Don's adaption for proper shell quoting on AIX:

sed -n "
/~$transnum~/ {
H
g
}
/~$transnum~/,/EOT/p
h
" file

---

Not so much 2047 bytes, in most implementations much higher or unlimited, and for some there is a much lower limit but unrelated to LINE_MAX, as I think we worked out before here: http://www.unix.com/shell-programming-and-scripting/259884-sequence-extraction.html\#post302951349

1 Like

hi Don

One more request too, the way how i want to give my output file was through a variable and not directly a file name. Please suggest for it.

Thanks.

---------- Post updated at 05:51 PM ---------- Previous update was at 05:43 PM ----------

Dear Scrutinizer,

Thanks a lot this time SED worked.

It gave me desired output .

Thanks.

---------- Post updated at 05:54 PM ---------- Previous update was at 05:51 PM ----------

Hi,

This msg is intended to all who are all replied to help me out.
Hats off for your efforts to help me. Also i request each one of you to suggest me a link of good materials as you feel it was, for me to learn the SED & AWK atleast the basics.

Thanks.

[..]
You are welcome! Please note I updated my post and added the tildes to the search string ( ~$transnum~ )which had fallen off before and which should make it it a bit more accurate which was also suggested by Don earlier...

Hi,
As sed but in awk:

awk  "/~$transnum~/{\$0=X\"\n\"\$0};/~$transnum~/,/EOT/;{X=\$0}" file

Regards.

Realize that I have been up all night trying to help you (and it is now almost 6AM where I am), so I may not be thinking clearly. But, could you please explain why you chose to change the code I suggested:

			print l > (FILENAME "_" transnum)

to:

                print l > (/tmp/remedixz.20160120_085021_41222370_1_new "_" transom)

FILENAME is an awk variable holding the name of the current input file. But, /tmp/remedixz.20160120_085021_41222370_1_new is an attempt to divide nothing by the contents of the variable tmp divided by contents of the variable remedixz followed by a syntax error. And since neither tmp nor remedixz have been defined in this awk script, both are treated as a division by zero.

Would you PLEASE just try the following script without changing it:

#!/bin/ksh
big_file='/tmp/remedixz.20160120_085021_41222370_1'
transnums='/tmp/transnum'

awk -F '~' '
FNR == NR {
	# Gather transaction numbers...
	t[$1]
	tc = FNR
	next
}
{	# Gather transaction lines.
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $3 in t {
	# We have found a transaction number for a transaction that is to be
	# extracted.  Save the transaction number and remove this transaction
	# from the transaction list.
	delete t[transnum = $2]
	file = FILENAME "_" transnum
	tc--
}
/^0000EOT/ {
	# If we have a transaction that is to be printed, print it.
	if(transnum) {
		# Print the transaction.
		for(i = 1; i <= lc; i++)
			print l > file
		close(file)
		printf("Transaction #%s extracted to file %s\n", transnum, file)
		# Was this the last remaining transaction to be extracted?
		if(tc) {# No.  Reset for next transaction.
			lc = 0
			transnum = ""
		} else {# Yes.  Exit.
			exit
		}
	}
}' "$transnums" "$big_file"

Note that this has a few changes to match your latest description of your transaction format, has a typo fixed, and has some minor performance improvements. It also now includes your filenames (which had not been provided before).

If /tmp/transnum contains the single line:

ABC160120XYZ0983921

and there is a transaction in your big transaction file with that transaction number, it should produce a file named /tmp/remedixz.20160120_085021_41222370_1_ABC160120XYZ0983921 containing that transaction. And, as stated before, if /tmp/transnum contains multiple transaction numbers on separate lines, one invocation of this script will produce an output file for each transaction given.

If this all works, you could also add an END clause to print a list of any transaction numbers that were specified in your transaction numbers file that were not found in your big transactions file.

3 Likes

If I understand you correctly, the command could have been:

export t="ABC160120XYZ0983920"; perl -ne '/^##transaction\b/ and @t=(); if (/^##transaction\b/ .. /EOT$/){push @t, $_; $f = 1 if /$ENV{t}/;  if (/EOT$/ && $f){print @t; last}}' mad_man.example

Here's another script that will search a file with one transaction number per line and it will output a file ending in .transaction number per each find.

#!/usr/bin/env perl

use strict;
use warnings;

my $trans = shift || die "No search paramenters file given\n";
my $haystack = shift || die "Missing data file\n";

my %trans = ();
my @transaction = ();

open my $fh, '<', $trans or die "open $trans: $!\n";
while(<$fh>){
    chomp;
    $trans{$_} = $_;
}
close $fh;

open $fh, '<', $haystack or die "Could not open $haystack: $!\n";
while(<$fh>){
    if(/^##transaction\b/ .. /EOT$/){
        push @transaction, $_;
        if(/EOT$/){
            process_tran();
            @transaction = ();
        }
    }
}
close $fh;

sub process_tran {
    for my $k (keys %trans){
        my $yes = grep /$k/, @transaction;
        if($yes){
            write_tran ("$haystack.$k", \@transaction);
            delete $trans{$k};
            last;
        }
    }
}

sub write_tran {
    my ($save_tran, $tran_ref) =  @_;
    open my $wfh, '>', $save_tran
        or die "Could not write to $save_tran: $!\n";
    print $wfh @{ $tran_ref };
    close $wfh;
}

Save as mad_man.pl
Run as perl mad_man.pl trans_numbers data_with_trans

Or chmod +x mad_man.pl
/path/to/mad_man.pl /path/to/trans_numbers /path/to/data_with_trans
It will save in /path/to/data_with_trans.<number>

Hi sed code which Scrutinizer posted worked for a set of transaction which is actually 3455 characters

Thanks

---------- Post updated at 12:48 PM ---------- Previous update was at 12:36 PM ----------

Hi

I am going to try all of your new suggestions today and reply you back.

Thanks.

I sincerely apologize. In each case, the output file you got had a filename derived from the 2nd field (i.e., the data between the 1st and 2nd tildes which seems to be a constant for the transactions you selected to print) in a line that contained a transaction number you wanted to print, and the contents of that file was the transactions starting with the transaction after the next to the last transaction number you requested in the big input file through the last transaction number you requested from the big input file.

It comes from me not getting nearly enough sleep, you not providing sample data that matched the actual format of your data, and from me not getting nearly enough sleep. (There were three problems and I'm blaming two of them on not getting enough sleep.) Now that I have cleaned up my test data to match what I believe is your current data format, the following seems to work. Please try this replacement:

#!/bin/ksh
big_file='/tmp/remedixz.20160120_085021_41222370_1'
trannum='/tmp/transnum'

awk -F '~' '
FNR == NR {
	# Gather transaction numbers...
	t[$1]
	tc = FNR
	next
}
{	# Gather transaction lines.
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $3 in t {
	# We have found a transaction number for a transaction that is to be
	# extracted.  Save the transaction number and remove this transaction
	# from the transaction list.
	delete t[transnum = $3]
	file = FILENAME "_" transnum
	tc--
}
/^0000EOT/ {
	# If we have a transaction that is to be printed, print it.
	if(transnum) {
		# Print the transaction.
		for(i = 1; i <= lc; i++)
			print l > file
		close(file)
		printf("Transaction #%s extracted to file %s\n", transnum, file)
		# Did we just print the last transaction requested?
		if(!tc)	{
			# Yes.  We are done.
			exit
		}
		# No.  Clear found transaction number.
		transnum = ""
	}
	# Reset for next transaction.
	lc = 0
}' "$trannum" "$big_file"

Hopefully, this will do what you want.

As stated before, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

Hi Don,

Thanks this was working as expected. it written all the 3 transactions as expected to separate files. I want to change the code in such a way that i want to write all three transactions set into single file. could you please help me?

Thanks.

I would be happy to help you!

So, exactly what pathname should this single output file have?

What good is this file going to be given that the script that will be reading this file can only handle a single transaction?

Looking at the awk script I provided, what do you think should be changed to produce a single output file instead of one output file per transaction?

My guess would be that one line needs to be removed and one line needs to be changed. And, it might make sense (as a minor optimization) to move that changed line from its current location into a BEGIN clause or an FNR==1 clause depending on whether the desired output file pathname is a constant or is a modification of the second input file's pathname).

1 Like

Hi Don,

I just changed

delete t[transnum = $3] to 
delete t[transnum = 123456]

print l > file
print l >> file

Now it started to write all the transaction numbers into a same output file

/tmp/remedixz.20160120_085021_41222370_1_123456

I will make the (hopefully not too wild guess) from this that the name of pathname of the output file you want is the pathname of the input file with the string _123456 appended.

The variable transnum in that awk script is intended to be the transaction number of the transaction that is being copied from the input file to the output file. And, since your transaction numbers are 19 character alphanumeric strings (not six digit decimal strings), setting transnum = 123456 is NOT appropriate.

Changing the:

print l > file

to:

print l >> file

means that instead of creating a new output file each time you run this script, it will append all of the transactions requested on the latest run to the output produced on any earlier runs. This would not seem to be a desirable side effect.

Please undo the changes you made and make the following changes instead:
First, change the line:

	file = FILENAME "_" transnum

to:

	file = FILENAME "_123456"

and, second, delete the line:

		close(file)

With these changes, the transaction number printed when a transaction is copied to the output file will again be printed correctly and a single output file will be produced each time the script is run (and will contain only the transactions extracted on that execution of the script). Later executions of the script will replace the contents of that file (if it still exists from an earlier run) or create that file (if it had been removed).