Merging of rows

Hi guys,

Wish you all a very Happy New Year!!!.

Thanks in advance.

I want to read a file and merge the rows which have '\n' in it.
The rows could be > 50,000 bytes. The script should merge all the rows till the next row starts with word 'Type|'.
ex.
Type|Ticket|TradeID|42224A|SystemID|DDDD61266|SourceSystem|RMS|Version|5|LatestVersionYN|Y|Counterparty|OB|DBLegalEntity|London|Notes|cal events added|Salesperson|DHX|LastUpdated|DEC 2008 18:31:40|Action|Exp|DealCaptureSystem|Sales|CustomerSourceSystem|RMS|Customer|AUGS|ActionEffectiveDate|DEC 2008 00:00:00|BookingLocation|London|EnteredBy|GONU|MarkupAmt|100|MarkupCcy|KEX|PremiumCurrency|EUR|PremiumAmount|000|PremiumDate|DEC 2008|PremiumHasPaidYN|N|BookingDetails|9000/mihl
Client sells KEZUSD
st ref 1.3724
Leveraged Sele is at ST
52 sele, starting

Type|FxDiscreteDNT|TradeID|SSA32553H|SystemID|GGT04481|SourceSystem|RMS|SourceSystemBook|OT|Book|OT|BookBranchParent|Go|Version|8|LatestVersionYN|Y|Counterparty|DOB|Status|KnockedOut|IsLiveYN|N|DBLegalEntity|London|LastUpdated|JAN 2009 00:26:35|Action|KnockedOut|CounterpartySourcesystem|RMS|TradeDate|JUL 2006|ProductID|KKI99106|StructureID|DDS32553|DealCaptureSystem|RMS|CustomerSourceSystem|RMS|Customer|ESL|PricingSite|Option|UnderCurrency1|JPY|AccCurrency1|US|UpBarrier1|1.5|LowBarrier1|1|UnderCurrency2|JPY|AccCurrency2|CH|UpBarrier2|1.64|LowBarrier2|1.49|UnderCurrency3|EUR|AccCurrency3|JPY|UpBarrier3|175|LowBarrier3|105|LastFixed

The outout should be
The below record should be in one line.

Type|Ticket|TradeID|42224A|SystemID|DDDD61266|SourceSystem|RMS|Version|5|LatestVersionYN|Y|Counterparty|OB|DBLegalEntity|London|Notes|cal events added|Salesperson|DHX|LastUpdated|DEC 2008 18:31:40|Action|Exp|DealCaptureSystem|Sales|CustomerSourceSystem|RMS|Customer|AUGS|ActionEffectiveDate|DEC 2008 00:00:00|BookingLocation|London|EnteredBy|GONU|MarkupAmt|100|MarkupCcy|KEX|PremiumCurrency|EUR|PremiumAmount|000|PremiumDate|DEC 2008|PremiumHasPaidYN|N|BookingDetails|9000/mihl Client sells KEZUSD st ref 1.3724 Leveraged Sele is at ST 52 sele, starting

Type|FxDiscreteDNT|TradeID|SSA32553H|SystemID|GGT04481|SourceSystem|RMS|Version|8|LatestVersionYN|Y|Counterparty|DOB|DBLegalEntity|London|LastUpdated|JAN 2009 00:26:35|Action|KnockedOut|DealCaptureSystem|RMS|CustomerSourceSystem|RMS|Customer|ESL|PricingSite|Option|UnderCurrency1|JPY|AccCurrency1|US|UnderCurrency3|EUR|AccCurrency3|JPY|UpBarrier3|175|LowBarrier3|105|LastFixed

I tried using below mentioned script but AWK can't handle > 20,000 bytes. Also, using String as 'LINE_BREAK' for identification rows got merged.

FileName=LINE_TEST.TXT
FileName1=O_LINE_TEST.TXT
OUTPUT_FILE=OUTPUT_WITHOUT_LINE_BREAK.tmp
if [[ -e $FileName ]]; then
sed "s/%/percentage/g" $FileName > $FileName1
## To remove '%' as 'awk' can not handle it.

/usr/xpg4/bin/awk 'NR==1\{printf $0;next\}
!/^TradeType/ \{printf "LINE_BREAK" $0;next\}
\{printf "\\n" $0\}
END\{print ""\}
' $FileName1 > $OUTPUT_FILE

fi

Try with nawk:

nawk '$1=$1' FS='\n' RS= ORS='\n\n' infile

I am getting the below mentioned error :

13460 Segmentation Fault(coredump). Is it related Memory ?

The Actual File size is : 95532469 bytes.

Try with Perl:

perl -00ple'tr.\n. .' infile

How to identfy which lines were merged while processing the file ?

Just append some string to identify them:

perl -00ple'tr.\n. . and $_.=" __FIXED__"' infile

Hi Radoulov,

It's working !!! :slight_smile:

Thank you very much for quick reply. Appriciated !!! :b:

One last request

Could you please confirm how this script works ?

Have a doubt for the below code --

nawk '$1=$1' FS='\n' RS= ORS='\n\n' infile

In this code there is no print statement. Even then we get the output after running this script. How is that possible ?
Also did not get the meaning of the code '$1=$1' . What does it do exactly ?

I believe you meant explain.
It just transliterates the newlines to spaces inside the paragraphs:
the 0 switch with the 0 argument (-00) cause Perl to slurp files in paragraph mode
(one or more lines separated by one or more blank lines),
tr transliterates the \n to a space.

It's because in AWK the default action is print the current record:

% print OK|awk 42
OK

Quoting Ed Morton:

frankly speaking, I never worked in 'perl', hence unable to understand the code.
Is it possible for you to explain the code step by step?

Thanks.

Well,
in this case I believe you need something different.
You could start here.

Hi,
I forgot to ask, How this script will check whether the next line starts with word 'Type|' and if it is then should not merge with first row.

*When I ran the script, its concating both the lines mentioned below which should not happen. (these lines should not merge)

Type|Cash|TradeID|Funding Cash|SystemID|AA6192|SourceSystem|
Type|Cash|TradeID|Funding Cash|SystemID|BTB59029|SourceSystem|

OK,
please post a bigger sample of your real data that includes both cases.

Hi,

Please find the sample data. The file contains 3 rows. First row breaks into many lines, row 2nd and 3rd are without any line breaks. So, when the script runs, row starts with word 'Type|', should not merge with previous row but if not then should merge with previous line.

Row 1

Type|GenericSalesTicket|TradeID|103242224A|SystemID|179461266|SourceSystem|RMS|SourceSystemBook|GENERIC|Book|GENERIC|BookBranchParent|Sales|Version|5|LatestVersionYN|Y|Counterparty|DBOB|Status|Expired|IsLiveYN|N|DBLegalEntity|London|Notes|LDN - IRATE - TPF - cal events added|Salesperson|HUNDHX|LastUpdated|18 DEC 2008 18:31:40|Action|Expired|CounterpartySourcesystem|RMS|TradeDate|16 DEC 2008|ProductID|173898893|StructureID|103242224|DealCaptureSystem|RMSSales|CustomerSourceSystem|RMS|Customer|BMK/AUGS|ActionEffectiveDate|18 DEC 2008 00:00:00|BookingLocation|London|EnteredBy|GONZMANU|MarkupAmt|12000|MarkupCcy|EUR|MarkupLocation|Branch|AtRiskAmt|5000|AtRiskCcy|EUR|AtRiskLocation|Branch|PremiumCurrency|EUR|PremiumAmount|-12000|PremiumDate|18 DEC 2008|PremiumHasPaidYN|N|BookingDetails|079000/mihl
Client sells EURUSD
spot ref 1.3724
Leveraged Settlement is at SPOT
52 settlements, starting : see table
Dates see table

Each cashflow : notional 0.0500 million

Row 2
Type|CashBalance|TradeID|Funding CashBalance|SystemID|2126192|SourceSystem|RMS|SourceSystemBook|LNFEXOT|Book|LNFEXOT|BookBranchParent|Forward|Version|1652|LatestVersionYN|Y|Counterparty|FUNDING CASHBALANCE|Status|Open|IsLiveYN|Y|DBLegalEntity|London|LastUpdated|02 JAN 2009 05:04:48|Reason|TREASURY|Action|Funding|CounterpartySourcesystem|RMS|TradeDate|23 AUG 2002|ProductID|1194926|StructureID|0|DealCaptureSystem|RMS|CustomerSourceSystem|RMS|Customer|FUNDING CASHBALANCE|Currency|BND|Amount|161796.5913075|OpenAmount|0|ActionEffectiveDate|31 DEC 2008 00:00:00|SourceSystemProfitCentre|Funding|BookingLocation|London|EnteredBy|RMS_OWNER|SourceSystemID|2126192|DealCaptureSystemID|2126192|DateCreated|23 AUG 2002 00:04:06|ProductVersion|1|TradeClassification|0|PublicationTime|01 Jan 2009 23:59:49

Row 3
Type|FxGeneralAccrualOpt|TradeID|103667640A|SystemID|180131956|SourceSystem|RMS|SourceSystemBook|EURGEN|Book|EURGEN|BookBranchParent|Global|Version|5|LatestVersionYN|Y|Counterparty|PUTZ|Status|Open|IsLiveYN|Y|DBLegalEntity|London|Notes|Amended fixing ref to WMR 11 per David|Salesperson|WEISDAV|LastUpdated|29 DEC 2008 21:38:45|Action|Fixed|CounterpartySourcesystem|RMS|TradeDate|18 DEC 2008|ProductID|174561397|StructureID|103667640|DealCaptureSystem|FXPricer(Sales)|CustomerSourceSystem|RMS|Customer|PUTZ|UnderCurrency|EUR|AccCurrency|USD|Notional|4000000|Strike|1.395|CP|C|PayoffType|O|ExpiryDate|26 MAR 2009|SettlementDate|30 MAR 2009|FixingReference|WMRSPOT11|FractionalNotionalSoFar|0.25|PayCurrency|USD|AccrualStartDate|18 DEC 2008|AccrualEndDate|26 MAR 2009|FixingHolidays|EUR,USD|WeightHolidays|EUR,USD|LastFixedDate|26 DEC 2008|BarrierStartDate|18 DEC 2008|BarrierEndDate|26 MAR 2009|MinNotionalLeverage|0|MaxNotionalLeverage|700|FixingDate|26 DEC 2008|FixingWeight|1|FixingValue|1.4054|FixingDate|28 JAN 2009|FixingWeight|1|FixingDate|25 FEB 2009|FixingWeight|1|FixingDate|26 MAR 2009|FixingWeight|1|AccrualRegionStartDate|18 DEC 2008|AccrualRegionEndDate|26 MAR 2009|UpperExtinguishingLevel|1.58|UpperResurrectingLevel|1.58|LowerResurrectingLevel|1.395|LowerExtinguishingLevel|0.00001|UpperNotionalLeverage|0|MidNotionalLeverage|1|LowerNotionalLeverage|7|ActionEffectiveDate|26 DEC 2008 00:00:00|BookingLocation|London|EnteredBy|ROLLMIC|MarkupAmt|259220|MarkupCcy|EUR|MarkupLocation|London|SourceSystemID|180131956|DealCaptureSystemID|180131956|DateCreated|19 DEC 2008 00:39:20|ProductVersion|2|ProductLegId|1|ReferenceSpot|1.4235|ReferenceForwardPoints|-0.003740976|ReferenceVolatility|24.23007467|ReferenceDepoRate|1.87784704|ReferenceDepoCurrency|USD|ReferenceDelta|-0.2971194766996|ReferenceDeltaCurrency|EUR|TradeExtraDetailID|0|TradeExtraDetailVersion|0|TradeClassification|2|DealCaptureSalesUser|FXPRICERSALES_WEISDAV|FixingsAccruedSoFar|1|TotalFixings|4|AccruedNotionalSoFar|1000000|PublicationTime|29 Dec 2008 14:56:11

The Output Should be

Row 1

Type|GenericSalesTicket|TradeID|103242224A|SystemID|179461266|SourceSystem|RMS|SourceSystemBook|GENERIC|Book|GENERIC|BookBranchParent|Sales|Version|5|LatestVersionYN|Y|Counterparty|DBOB|Status|Expired|IsLiveYN|N|DBLegalEntity|London|Notes|LDN - IRATE - TPF - cal events added|Salesperson|HUNDHX|LastUpdated|18 DEC 2008 18:31:40|Action|Expired|CounterpartySourcesystem|RMS|TradeDate|16 DEC 2008|ProductID|173898893|StructureID|103242224|DealCaptureSystem|RMSSales|CustomerSourceSystem|RMS|Customer|BMK/AUGS|ActionEffectiveDate|18 DEC 2008 00:00:00|BookingLocation|London|EnteredBy|GONZMANU|MarkupAmt|12000|MarkupCcy|EUR|MarkupLocation|Branch|AtRiskAmt|5000|AtRiskCcy|EUR|AtRiskLocation|Branch|PremiumCurrency|EUR|PremiumAmount|-12000|PremiumDate|18 DEC 2008|PremiumHasPaidYN|N|BookingDetails|079000/mihl Client sells EURUSD spot ref 1.3724 Leveraged Settlement is at SPOT 52 settlements, starting : see table Dates see table Each cashflow : notional USD 0.0500 million

Row 2

Type|CashBalance|TradeID|Funding CashBalance|SystemID|2126192|SourceSystem|RMS|SourceSystemBook|LNFEXOT|Book|LNFEXOT|BookBranchParent|Forward|Version|1652|LatestVersionYN|Y|Counterparty|FUNDING CASHBALANCE|Status|Open|IsLiveYN|Y|DBLegalEntity|London|LastUpdated|02 JAN 2009 05:04:48|Reason|TREASURY|Action|Funding|CounterpartySourcesystem|RMS|TradeDate|23 AUG 2002|ProductID|1194926|StructureID|0|DealCaptureSystem|RMS|CustomerSourceSystem|RMS|Customer|FUNDING CASHBALANCE|Currency|BND|Amount|161796.5913075|OpenAmount|0|ActionEffectiveDate|31 DEC 2008 00:00:00|SourceSystemProfitCentre|Funding|BookingLocation|London|EnteredBy|RMS_OWNER|SourceSystemID|2126192|DealCaptureSystemID|2126192|DateCreated|23 AUG 2002 00:04:06|ProductVersion|1|TradeClassification|0|PublicationTime|01 Jan 2009 23:59:49

Row 3
Type|FxGeneralAccrualOpt|TradeID|103667640A|SystemID|180131956|SourceSystem|RMS|SourceSystemBook|EURGEN|Book|EURGEN|BookBranchParent|Global|Version|4|LatestVersionYN|N|Counterparty|PUTZ|Status|Open|IsLiveYN|Y|DBLegalEntity|London|Notes|Amended fixing ref to WMR 11 per David|Salesperson|WEISDAV|LastUpdated|29 DEC 2008 21:37:22|Action|NonEconomic|CounterpartySourcesystem|RMS|TradeDate|18 DEC 2008|ProductID|174561397|StructureID|103667640|DealCaptureSystem|FXPricer(Sales)|CustomerSourceSystem|RMS|Customer|PUTZ|UnderCurrency|EUR|AccCurrency|USD|Notional|4000000|Strike|1.395|CP|C|PayoffType|O|ExpiryDate|26 MAR 2009|SettlementDate|30 MAR 2009|FixingReference|WMRSPOT11|PayCurrency|USD|AccrualStartDate|18 DEC 2008|AccrualEndDate|26 MAR 2009|FixingHolidays|EUR,USD|WeightHolidays|EUR,USD|BarrierStartDate|18 DEC 2008|BarrierEndDate|26 MAR 2009|MinNotionalLeverage|0|MaxNotionalLeverage|700|FixingDate|26 DEC 2008|FixingWeight|1|FixingDate|28 JAN 2009|FixingWeight|1|FixingDate|25 FEB 2009|FixingWeight|1|FixingDate|26 MAR 2009|FixingWeight|1|AccrualRegionStartDate|18 DEC 2008|AccrualRegionEndDate|26 MAR 2009|UpperExtinguishingLevel|1.58|UpperResurrectingLevel|1.58|LowerResurrectingLevel|1.395|LowerExtinguishingLevel|0.00001|UpperNotionalLeverage|0|MidNotionalLeverage|1|LowerNotionalLeverage|7|ActionEffectiveDate|29 DEC 2008 00:00:00|BookingLocation|London|EnteredBy|ROLLMIC|MarkupAmt|259220|MarkupCcy|EUR|MarkupLocation|London|SourceSystemID|180131956|DealCaptureSystemID|180131956|DateCreated|19 DEC 2008 00:39:20|ProductVersion|2|ProductLegId|1|ReferenceSpot|1.4235|ReferenceForwardPoints|-0.003740976|ReferenceVolatility|24.23007467|ReferenceDepoRate|1.87784704|ReferenceDepoCurrency|USD|ReferenceDelta|-0.2971194766996|ReferenceDeltaCurrency|EUR|TradeExtraDetailID|0|TradeExtraDetailVersion|0|TradeClassification|2|DealCaptureSalesUser|FXPRICERSALES_WEISDAV|PublicationTime|29 Dec 2008 14:56:11

Try this:

perl -nle'
  print $x and undef $x 
    if $x and /^Type/ or eof;
   ($x .= $_) =~ tr/\n/ /s;
  ' infile

How to identify which rows got Merged ?

It's not working if I modify the script as below

($x .= $_) =~ tr/\n/LINE_BREAK/s;

Actually tr is not needed because of the -l switch.

Try this:

perl -nle'
  print $x, $f > 1 ? "_FIXED_" : "" 
    and ($x, $f) = undef 
    if $x and /^Type/ or eof;
  $f++ if $x .= " ". $_ and !/^$/
  ' infile

You may add another linefeed if you need an extra blank line between the records.

Hi,
if I execute the script, there will be one space(:() at the begaining of every line, the word '_FIXED_' comes only at the end of file but not where the line was break (_FIXED_) .

Row 1

:(Type|GenericSalesTicket|TradeID|103242224A|SystemID|179461266|SourceSystem|RMS|SourceSystemBook|GENERIC|B ook|GENERIC|BookBranchParent|Sales|Version|5|LatestVersionYN|Y|Counterparty|DBOB|Status|Expired|IsLi veYN|N|DBLegalEntity|London|Notes|LDN - IRATE - TPF - cal events added|Salesperson|HUNDHX|LastUpdated|18 DEC 2008 18:31:40|Action|Expired|CounterpartySourcesystem|RMS|TradeDate|16 DEC 2008|ProductID|173898893|StructureID|103242224|DealCaptureSystem|RMSSales|CustomerSourceSystem|RMS|C ustomer|BMK/AUGS|ActionEffectiveDate|18 DEC 2008 00:00:00|BookingLocation|London|EnteredBy|GONZMANU|MarkupAmt|12000|MarkupCcy|EUR|MarkupLocation|Bran ch|AtRiskAmt|5000|AtRiskCcy|EUR|AtRiskLocation|Branch|PremiumCurrency|EUR|PremiumAmount|-12000|PremiumDate|18 DEC 2008|PremiumHasPaidYN|N|BookingDetails|079000/mihl _FIXED_ Client sells EURUSD _FIXED_ spot ref 1.3724 _FIXED_ Leveraged Settlement is at SPOT _FIXED_ 52 settlements, starting : see table _FIXED_ Dates see table _FIXED_ Each cashflow : notional USD 0.0500 million_FIXED_

sed '/^Type/{:a;N;/\nType/!{s/\n/ /;ba};s/\n/&&/}'