Proper Column wise matching

nikhil_jain · August 11, 2016, 8:46am

My below code works fine if none of the columns has pipe as its content in it, If it has the pipe in any of the content then the value moves to the next column.

I wanted my code to work fine even if the column has pipe in it apart from the delimiter.

NOTE : If there is a pipe in the content apart from the delimiter it is been escaped by \(backslash)

#set -x
awk  '
NR==1 {for (cc=1; cc<=NF; cc++) n[$cc]=$cc; t=$0; next;}
{
   if ($1 != '0') c[1]++;
   for (i=2; i<=NF; i++) if ($i != "NA" && $i != "null" && $i != "") c++;
}
END {
   print t;
   --NR
   r="";
   for (i=1 ; i<cc; i++) {
      p=(c/NR)*100;
      r=(i == 1) ? "" p : r OFS p;
   }
   print r
}
' FS="|" OFS="|" $1

[/CODE]

RudiC · August 11, 2016, 8:57am

Not sure I understand what you are up to. How about a decent input sample, the desired result, and the logics connecting them?

To ignore escaped delimiters, replace them by a token upfront, work on the modified file, and then reverse the replacement.

nikhil_jain · August 12, 2016, 4:27am

[sdp@blr-qe101 .nikhil]$ sh filler.sh c10.txt 
unique_bank_transaction_id|merchant name_GT|MERCHANT_NAME_TDE|output
100|100|100|100
[sdp@blr-qe101 .nikhil]$ sh filler.sh 10.txt 
unique_bank_transaction_id|merchant name_GT|MERCHANT_NAME_TDE|output
100|100|100|100

cat 10.txt 
unique_bank_transaction_id|merchant name_GT|MERCHANT_NAME_TDE|output
076679010|WALMART|Walmart|TP
2242937867|PUBLIX SUPER MARKETS INC|Publix Super Markets|TP
100441566|CHICK-FIL-A|Chick|jacke|TP
1000549208|BURLINGTON - BURLINGTON COAT FACTORY|Burlington Coat Factory|TP
1000146040284|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
1000146428873|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
1000539406|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
10005847326|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
100056070|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP

[sdp@blr-qe101 .nikhil]$ cat c10.txt  
unique_bank_transaction_id|merchant name_GT|MERCHANT_NAME_TDE|output
076679010|WALMART|Walmart|TP
2242937867|PUBLIX SUPER MARKETS INC|Publix Super Markets|TP
100441566|CHICK-\|FIL-A|Chick||TP
1000549208|BURLINGTON - BURLINGTON COAT FACTORY|Burlington Coat Factory|TP
1000146040284|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
1000146428873|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
1000539406|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
10005847326|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP
100056070|ABERCROMBIE & FITCH|Abercrombie & Fitch|TP

---------- Post updated 08-12-16 at 01:57 PM ---------- Previous update was 08-11-16 at 06:48 PM ----------

any one can plz help? In the above content, If u observe the BOLD one, You would realise that there is a extra pipe in it.

My query here is, If there is a extra pipe with the backslash (\|) It should be ignored not considered as the next column

RudiC · August 12, 2016, 5:05am

Did you try the hint given?

nikhil_jain · August 12, 2016, 8:14am

Rudi,

It is a huge file of some 8 GB's, the prob is we have constraint of space.. Hence can't try...

zaxxon · August 12, 2016, 8:34am

Uhm where is the problem to try it with a short example like you have already given? It comes to principle about the problem, not to process an 8GB file...

nikhil_jain · August 17, 2016, 4:43am

Zaxxon,

I'll try implementing if u give the solution for small file as well.
Plz help