Want to extract certain lines from big file

mad_man · January 22, 2016, 10:25am

Hi All,

I am trying to get some lines from a file i did it with while-do-loop. since the files are huge it is taking much time. now i want to make it faster.

The requirement is the file will be having 1 million lines.
The format is like below.

##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT
##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT
##transaction, , , ,blah, blah
%%blah~transnum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT

What i know from the file is transnum in a set. I want to copy the ##transaction to till the next EOT for the particular transnum.
Also my requirement is from that file i want to copy only one set because my process will know only one transnum only.
So my output file will have only 10 to 15 lines (Only 1 transaction)
So please help me thanks.

RudiC · January 22, 2016, 10:46am

Please use code tags as required by forum rules, and please show your attempts so far as well. Is above the exact file structure? No empty lines? Is "EOT" exactly this string? Or a token/control char?

Don_Cragun · January 22, 2016, 10:09pm

In addition to what RudiC already said, if every transaction contains the literal line:

%%blah~transnum~blah~blah~blah

how do you know which transnum set you want? We might guess that blah isn't literal and we might guess that transnum isn't literal and that transnum is different in each set, but you haven't given us enough information to make a reasonable guess at a BRE that will match the transum set you want.

Please give us:

some more realistic sample data,
a description of any file(s) that your script is expected to read,
a description of any file(s) that your script is expected to write,
a description of any arguments you intend to pass to your script,
the operating system and shell you're using, and
the exact output you want your script to produce with the sample data provided in #1 above and sample arguments you provided in $4 above.

(And, don't forget to use CODE tags when showing us your sample input, sample output, and your attempts at writing a script to perform these tasks.)

mad_man · January 23, 2016, 8:29am

@RudiC

The above sample is the exact structure of my input file these set of lines from ##transaction .. to 000EOT will be repeated. There are no empty lines in between and 0000EOT is the exact string. There are no other token/control characters in the file.
Thanks.

---------- Post updated at 06:59 PM ---------- Previous update was at 06:40 PM ----------

@Don:
The transnum will be read from other file. The file extraction part, is a part of my script. This script is a long script. After extraction from the file we will be processing the transaction. so the transaction extraction part is making the performance issue. The transnum will be in a variable. After reading the file line by line i will cut the "transnum" with tilde delimiter and then i will use if condition to check if they are matching. If they match i will copy the current line(earlier line will be copied in to a new variable) and subsequent lines until next EOT into a new file.

Scrutinizer · January 23, 2016, 8:55am

Try something like:

awk '{p=p $0 RS} /EOT/{if(p~s)printf "%s",p; p=x}' s='~trannum~' file

mad_man · January 23, 2016, 11:04am

Just thought to say this added info
I am using AIX version of unix

---------- Post updated at 09:21 PM ---------- Previous update was at 08:42 PM ----------

@Don
Please give us:
some more realistic sample data,

##PAYMNT, , , ,blah, blah
%%YEDTRN~trannum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT

In my above input sample, the tags "##PAYMNT", "%%YEDTRN", & "0000EOT" are the constant values, all the other values are varying with transactions.
a description of any file(s) that your script is expected to read

Input file is a transaction details file- flat file

a description of any file(s) that your script is expected to write,

Out put file should have 1 of the desired transaction record

a description of any arguments you intend to pass to your script,

The argument to this part of the script is a transaction number value which will be in script

the operating system and shell you're using

AIX 6 OS and korn shell

So kindly give me your different suggestions.
Thanks

---------- Post updated at 09:34 PM ---------- Previous update was at 09:21 PM ----------

@Scrutinizer - Thanks for your reply i request you to go thru my above explanation to Don. So kindly give my more possible commands. So i will try them when i reach office tomorrow.

Thanks

Don_Cragun · January 23, 2016, 9:30pm

It looks like Scrutinzer's suggestion should work just fine as long as:

trannum does not contain any characters that are special in an ERE, and
the number of bytes in a single transaction (from ##PAYMNT through 0000EOT is not more than 2047 bytes.

So:

What is the format of trannum ? Is it all alphanumeric characters? (If it isn't all alphanumeric characters, what characters can be included in a trannum ?) How many characters are in a trannum ? (Is it always the same number of characters or does it vary? If it varies, what are the minimum and maximum number of characters in a trannum ?)
What is the maximum number of bytes (not characters; bytes) in a transaction? If that number is larger than 2047, what is the maximum number of bytes in a single line in a transaction? (As long as the number of byte in a line (including the terminating <newline> character is no larger than 2048 bytes, we can easily do that. If it is more than 2048 bytes, it takes more work to get what you want on AIX.)

I would do it slightly differently (to quit after the desired transaction is found):

awk '{p=p $0 RS} /EOT/{if(p~s){printf "%s",p;exit}else p=x}' s="~$trannum~" file

which should cut the time awk spends reading your large file about in half, on average.

But, the way to make big gains here would be to search for and extract multiple transactions in a single pass through your large file. If you could, for example, extract 10 transactions at a time, you would only have to read the large file once instead of 10 times and you would only have to invoke awk once instead of 10 times; both of which would be big wins for performance.

Note that extracting 10 transactions at a time does not mean that the extracted transactions would all be saved in a single file; each transaction could easily be extracted into a separate file. And, 10 is just an example; an awk script could easily extract thousands of transactions into separate files in a single pass through your large transaction file increasing your script's processing speed immensely if your script is being used to process thousands of transactions. Note also that this is why we want details about what you are doing instead of vague statements about a tiny piece of the script you are writing. The more we know, the better chance we have of making a suggestion that will significantly improve your script.

Aia · January 23, 2016, 11:21pm

Extracting the first transaction only.

perl -ne 'print if /^##transaction\b/ .. /EOT$/; last if /EOT$/' mad_man.example

##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT

---------- Post updated at 09:21 PM ---------- Previous update was at 08:54 PM ----------

Extract the first two transactions:

perl -ne 'print if /^##transaction\b/ .. /EOT$/; /EOT$/ and ++$n; last if $n==2' mad_man.example

##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content01
0001content02
.
.
0010contentnn
0000EOT
##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content03
0001content04
.
.
0010contentnn
0000EOT

Extract only the second transaction:

perl -ne 'if(/^##transaction\b/ .. /EOT$/){ print if $n==1; /EOT$/ and ++$n }; last if $n==2' mad_man.example

##transaction, , , ,blah, blah
%%blah~trannum~blah~blah~blah
0000content03
0001content04
.
.
0010contentnn
0000EOT

Extracting any transaction by using a variable:

export t=2; perl -ne 'if(/^##transaction\b/ .. /EOT$/){ print if $n==$ENV{t}; /EOT$/ and ++$n }; last if $n==$ENV{t}+1' mad_man.example

##transaction, , , ,blah, blah
%%blah~transnum~blah~blah~blah
0000content05
0001content06
.
.
0010contentnn
0000EOT

Extract last transaction:

perl -ne '/^##transaction\b/ and @t=(); push @t, $_ if /^##transaction\b/ .. /EOT$/; END{print @t}' mad_man.example

##transaction, , , ,blah, blah
%%blah~transnum~blah~blah~blah
0000content05
0001content06
.
.
0010contentnn
0000EOT

mad_man · January 23, 2016, 11:48pm

Hi Don,

The transnum is alphanumeric, with no special characters and it will be always 19 in length.

The maximum no of characters in a line is limited to 1600 and for a transaction set it can be upto ~ 4000 to 5000 characters.

---------- Post updated at 10:18 AM ---------- Previous update was at 10:11 AM ----------

HI Aia,

Thanks i am going to try all your solutions too. will update here about it.

Thanks.

Aia · January 23, 2016, 11:49pm

Hello mad_man,

In your first post you mentioned:

However, all your transnum are the same in your example. How would you choose what particular transnum. In what way are they different in your file?

mad_man · January 24, 2016, 3:22am

Hi Aia,

The transnum are alpha numeric and they will be unique for each set of transactions.

Thanks.

---------- Post updated at 11:58 AM ---------- Previous update was at 11:36 AM ----------

Hi Aia,

The required transaction set will be decided by the transaction reference number 'transnum' from another file. This value i will be extracting from another file, as i explained this already to Don the script is a large script in which the transaction extraction is a part of it. when script reaches this section a variable will be holding the transnum value. So using it i will take out the particular transaction set. let me know if you have any other queries

Thanks

---------- Post updated at 01:44 PM ---------- Previous update was at 11:58 AM ----------

Hi Don,

Thanks for your command

awk '{p=p $0 RS} /EOT/{if(p~s){printf "%s",p;exit}else p=x}' s="$transnum" $file > $file_new

Worked the way in which i required. By when i discussed this with my prior he said AWK commands are not allowed by our onsite counterparts since they are giving issue when we upgrade the AIX and leads us to fix them again. So any SED or perl equivalent to the above AWK would be helpful for me. Kindly help me out

Thanks.

---------- Post updated at 01:52 PM ---------- Previous update was at 01:44 PM ----------

Hi Aia,

Thanks for your command

export t=2; perl -ne 'if(/^##transaction\b/ .. /EOT$/){ print if $n==$ENV{t}; /EOT$/ and ++$n }; last if $n==$ENV{t}+1' mad_man.example

This is not working. I exported the value of transnum to variable t. The output file doesn,t have the required output.

Please find one of the existing inline perl we use. If you give me your command in the same format it will be helpful

/usr/local/perl/bin/perl -e '$record = $ENV{"record"};' -e '@fields=split(/~/,$record);' -e '$req_flag=uc $fields[29];' -e 'print "$req_flag\n";' > /tmp/$file_name

What the above code will do is it will export a value which is tilde seperated and get the 29th field to a temp file. This is just a sample code the reason why i pasted here is to show you the existing code punctuation. Now i am purely dependent on perl or sed kindly help me.

Thanks

RudiC · January 24, 2016, 3:40am

How about

sed -n '/~transnum~/ {H;g}; /~transnum~/,/EOT/p;h' file

mad_man · January 24, 2016, 3:55am

Hi RudiC

I am getting the error cannot be parsed. for this sed command.
Please find below how i used.
transnum="ABC160120XYZ0983921"

sed -n '/"$transnum"/ {H;g}; /"$transnum"/,/EOT/p;h' $file > $file_new

please suggest

Don_Cragun · January 24, 2016, 4:08am

It sounds like I have wasted the last hour of my life trying to help you, but maybe this will help someone else. The following awk script only uses POSIX specified awk features and should work on any system (although you would need to change awk to /usr/xpg4/bin/awk or nawk if and only if you want to run this on a Solaris/SunOS system). It takes two files as inputs (which is what you said you had earlier). The first file (named trannums in this script) contains one or more lines with each line containing a transaction number to be extracted from your big file. The second file (named bigfile in this script) contains your big file containing transactions. It extracts each transaction listed in trannums into a separate output file with a name that is the string TX: followed by the transaction number:

#!/bin/ksh
awk -F '~' '
FNR == NR {
	# Gather transaction numbers...
	t[$1]
	tc = FNR
	next
}
{	# Gather transaction lines.
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $2 in t {
	# We have found a transaction number for a transaction that is to be
	# extracted.  Save the transaction number and remove this transaction
	# from the remaining transaction list.
	remove t[transnum = $2]
	tc--
}
$1 == "0000EOT" {
	# If we have a transaction that is to be printed, print it.
	if(transnum) {
		# Print the transaction.
		for(i = 1; i <= lc; i++)
			print l > ("TX:" transnum)
		close("TX:" transnum)
		printf("Transaction #%s extracted to file TX:%s\n", transnum,
		    transnum)
	}
	# Was this the last remaining transaction to be extracted?
	if(tc) {# No.  Reset for next transaction.
		lc = 0
		transnum = ""
	} else {# Yes.  Exit.
		exit
	}
}' trannums bigfile

Don_Cragun · January 24, 2016, 4:24am

mad man:

Hi RudiC

I am getting the error cannot be parsed. for this sed command.
Please find below how i used.
transnum="ABC160120XYZ0983921"
sed -n '/"$transnum"/ {H;g}; /"$transnum"/,/EOT/p;h' $file > $file_new
please suggest

Shell variables are not expanded within single quotes. You didn't correctly copy the script RudiC provided. And, this code depends on a feature that is not supported by some (standards-conforming) versions of sed . (I'm not sure if it will work without the semicolon I added on AIX or not; but it won't work on OS X without the semicolon I added.) Try:

sed -n "/~$transnum~/ {H;g;}; /~$transnum~/,/EOT/p;h" "$file" > "$file_new"

mad_man · January 24, 2016, 4:55am

Hi Don,

Sorry for the inconvenience.

The code you have posted last is not working for me please find the way how i used it.

 
big_file='/tmp/remedixz.20160120_085021_41222370_1'
trannum="/tmp/transnum"
file_new="${big_file}_23962395676"
awk -F '~' '
FNR == NR {
	t[$1]
	tc = FNR
	next
}
{
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $2 in t {
	remove t[transnum = $2]
	tc--
}
$1 == "0000EOT" {

	if(transnum) {
		for(i = 1; i <= lc; i++)
			print l > ("$file_new:" transnum)
		close("$file_new:" transnum)
		printf("Transaction #%s extracted to file $file_new:%s\n", transnum,
		    transnum)
	}
	if(tc) {
		lc = 0
		transnum = ""
	} else {
		exit
	}
}' $trannum $big_file

---------- Post updated at 03:25 PM ---------- Previous update was at 03:13 PM ----------

Hi Don,

The SED you have modified and posted did not thrown any error like last time but the output file is empty.

Thanks

RudiC · January 24, 2016, 5:09am

The sed script as provided was tested and worked for me, as is on Linux, with Don Cragun's semicolon inserted on FreeBSD.
So, how about

supplying meaningful samples of input data (as requested several times in this thread)?
trying to solve the issues yourself by some playing around (modifying and testing) with the solution offered?

mad_man · January 24, 2016, 5:20am

Hi RudiC

Will post the actual input here now in 5 mins

thanks

---------- Post updated at 03:50 PM ---------- Previous update was at 03:42 PM ----------

Hi RudiC,

Please find the actual input below. The below is a single transaction this set will be repeated as many as transactions in a file for 1000 transactions this below set will be repeated 1000 times. But the ~ABC160120XYZ0983920~ is unique for each and every transactions.
tags like ##PAYMNT , %%YEDTRN & 0000EOT are constant for every transaction.

##PAYMNT,ABCDEFGH,        ,        ,TEST01                  ,0000004308,0000004216,1104      ,000000, ,00110,USD,   ,T,TESTST008                    ,2016-01-18T09:30:47                ,pain.001.001.03                    ,00000000000000001200,00000000000000018.00
%%YEDTRN~0000004646~ABC160120XYZ0983920~20160120_085131~20160120_085021~20160120_085021_41222370                              ~20160120_084728_15401168                     ~20160120_084728~0000004644~          ~TEST01                       ~pain.001.001.03                    ~U ~0.02               ~C~FWT       ~          ~SFTS                           ~99999999801                        ~WireCmpIDa                         ~021000018                          ~99999998799002                     ~20020101~Payee Name 1104                    ~TstTrceNbr1104                     ~PR ~USD~   ~   ~Y~US~PmtGrpWire0b                       ~OOXXMXM                            ~TestWirePay002                     ~Y~01~HARDCOPY  ~pain.001.001.03                    ~TESTST008                    ~2016-01-18T09:30:47                ~00000000000000000000000000000000001200~0000000000000018.00~N~                                                                                                    ~00~               ~ ~                                   ~                                   ~   ~               ~                                   ~ ~          ~                                                                                                    ~00~               ~                                                                                                    ~00~ ~               ~               ~                                   ~                                   ~   ~               ~ 
0000ISA00          00          ZZABCDETEST01   ZZABCDEFGHI      1601200849U005010000043080T 
0000GSRA   201601200849    000004216X 005010      
0000ST<>820<>1104<>PmtGrpWire0b<>
0010BPR<>U<>0000000000000000.02C<>FWT<><>01<>043000261<>DA<>99999999801<>WireCmpIDa<><>01<>021000018<>DA<>99999998799002<>20020101<><><><><>
0010TRN<>1<>TstTrceNbr1104<><>OOXXMXM<>
0010CUR<>PR<>USD<>00000000000<><><><>00000000        <>00000000        <>00000000        <>00000000        <>00000000        
0010REF<>TN<>TestPay002<><><><><><><><>
0020N1<>O2<>Test Initiating Party<><><><><>
0020N1<>O1<>Test Debtor Bank<>13<>043000261<><><>
0020N4<><><><>US<><><><>
0020N1<>PR<>Debtor Pyr Nm 0a<><><><><>
0020N3<>Payer Address 00a Line1<><>
0020N3<>Payer Address 00a Line2<><>
0020N4<>Payer City<>PA<>12345<>US<><><><>
0020N1<>BK<>Test Payee Bank 002<><><><><>
0020N1<>PE<>Payee Name 1104<><><><><>
0020N3<>Payee Address 002 Line1<><>
0020N3<>Payee Address 002 Line2<><>
0020N4<>Payee Town 002<>PA<>12345<>US<><><><>
0000SE<>00000000001104<>
0000GE<>000001000004216
0000IEA<>00001000004216
0000EOT<><>000000000000019000000000000000000000000000019<><><>

Thanks.

RudiC · January 24, 2016, 5:40am

That sed works with your sample data as specified, it extracts exactly that record from a set of different records.
The data structure is NOT what you posted earlier. E.g. the transnum is not the second ~delimited field but the third, EOT is not the last element in the final line, ... That's why other solutions offered may have failed.

mad_man · January 24, 2016, 5:54am

Hi RudiC,

can you please suggest a SED solution for the above input?
mean while i am also playing around with solutions offered.

Thanks.