Parsing a linux file and formatting it.

charithainfadev · January 28, 2011, 1:50pm

Hi, I have a linux file that has data like this..

REQUEST_ID|text^Ctext^Ctext^C
REQUEST_ID|text^Ctext^C
REQUEST_ID|
REQUEST_ID|
REQUEST_ID|text^Ctext^Ctext^Ctext^Ctext^Ctext^C....

Where ever I see a ^C character, I need to copy the corresponding REQUEST_ID and that part of the text to a new line. so my destination file should look like this..

REQUEST_ID|text
REQUEST_ID|text
REQUEST_ID|text

.
.
How should I do this? I have no idea how to do this because I have no knowledge in unix or linux so far..

---------- Post updated at 01:50 PM ---------- Previous update was at 11:49 AM ----------

hi, please someone post reply. I need help quickly. I guess it could be done using sed or awk. but no idea how to do. Please let me know how this could be done using sed or awk

DGPickett · January 28, 2011, 2:15pm

Narrative: Get rid of empty lines, remove any trailing ^C, branch conditionally to next line to clear t flag, substitute to make a leading line of the request id and first text, replacing the request id on the second line before the second text. If that substitute fired, print the first line, remove it from the buffer and try the substitute again. If not, branch to end, which prints the buffer.

sed '
  /|$/d
  s/^C$//
  t n
  :n
  s/^\(\([^|]*|\)[^^C]*\)^C/\1\
\2/
  t p
  b
  :p
  P
  s/.*\n//
  t n
 '

charithainfadev · January 28, 2011, 4:36pm

hi, thanks for the reply, actually, i did not understand the full command. I am sorry!
But i tried to give this command on the command line.. it did not work. I would appreciate in your further help in better understand this before doing, so that I can do it correctly.
Please correct me if i am wrong some where.

"sed '/|$/d" this is for removing empty lines (i was thinking sed '/^$/d')

"s/^C$//" this is for removing trailing control C s.

I did not understand the rest of the command
"s/^$\([^|]*|$[^^C]*\)^C/\1\"

While typing this command, do I have to type the "^" character or ^C character. Confused! Please help..

DGPickett · January 28, 2011, 4:48pm

The ^C's have to be real cntrl-C characters,
a line with no text was not showing in output, so it is virtually empty, and

the big substitute says pick up:

starting at the beginning of the line ^
string1 \( +> \1
and string 2 \( +> \2
string 2 runs from beginning of line to first | (not | = [^|], any number = *, then |)
string 1 is longer, runs to but not including first cntrl-C (not ^C, any number)
the first ^C

but lay back down in their place:

string 1 \1
a line feed \linefeed
string 2 \2

If that is possible, then a line with N texts is now a line of 1 text followed by a line of N-1 texts.

charithainfadev · January 28, 2011, 5:26pm

But that is not working.. i can post some sample data for you..

|111111111|
|222222222|1292251978 Prad Bu Development^C
|333333333|
|444444444|1294070403 Joe De Pete Keny is currently testing and will let me know shortly.^C1294416965 Joe De Revised the WF tigger and sent back for re-testing.^C1295020527 Joe De No update this week.^C
|555555555|

I am getting the output as

222222222|1292251978 Prad Bu Development
444444444|1294070403 Joe De Pete Keny is currently testing and will let me know shortly.^C1294416965 Joe De Revised the WF tigger and sent back for re-testing.^C1295020527 Joe De No update this week.

But I need the output as this..

111111111|
222222222|1292251978 Prad Bu Development
333333333|
444444444|1294070403 Joe De Pete Keny is currently testing and will let me know shortly.
444444444|1294416965 Joe De Revised the WF tigger and sent back for re-testing.
444444444|1295020527 Joe De No update this week.
555555555|

Please help! Thanks for all your patience..

vgersh99 · January 28, 2011, 6:11pm

Assuming '^C' is actually a SINGLE character (CTRL-C) - not 2 characters...

nawk -f char.awk myFile

char.awk:

BEGIN {
  FS=OFS="|"
  ctrlC=sprintf("%c", 003)
}
{
   sub("^[|]","")
   sub(ctrlC "$","")
   while (n=index($0,ctrlC))
      $0=substr($0,1,n-1) ORS $1 OFS substr($0,n+1)
   print
}

charithainfadev · January 28, 2011, 8:00pm

Hi, Can you please explain me the code? When I wrote this code, and give the nawk command on command line it is saying "command not found"
The code doesn't seem to be unix or linux code! I am not sure but I think "sprintf" is c or c++ code (sorry if I am wrong). Its not working! Can we do this using shell scripting? Control C is a single character. It can be typed on the command line by pressing control+v first followed by control+c. Just FYI, I am working on linux platform.

vgersh99 · January 29, 2011, 3:38am

Try using awk or gawk (instead of nawk).

ghostdog74 · January 29, 2011, 6:22am

nice. However, this guy is totally clueless about unix/linux scripting. I don't think even after your explanation of what it does, he will understand sed's terse syntax. The closest one can get, at least know what is going on , is using a tool/language with syntax meant for "human readability" (awk/Python/Ruby etc ...comes to mind).

charithainfadev · January 29, 2011, 8:57am

It is throwing the following error with both awk and gawk. Is it unix code only? Sorry if the question sounds stupid! I am wondering if it is c++ code or unix code.

awk: char.awk:3:   =sprintf("%c", 003)
awk: char.awk:3:   ^ invalid char '' in expression

Some how I want this to be working by today Please help!

vgersh99 · January 29, 2011, 9:17am

Please repost (using the code tags) the content of your script exactly as-is AND the way you call/execute it.
Make sure you don't have ^M-s in your script.

The script works fine under Solaris and AiX.

charithainfadev · January 29, 2011, 9:43am

BEGIN {
  FS=OFS="|"
  ^C=sprintf("%c", 003)
}
{
  sub("^[|]","")
  sub(^C "$","")
  while (n=index($0, ^C))
     $0=substr($0, 1, n-1) ORS $1 OFS substr($0, n+1)
  print
}

 gawk -f char.awk test8.out

or

 awk -f char.awk test8.out

And it throws the following error

gawk: char.awk:3:   =sprintf("%c", 003)
gawk: char.awk:3:   ^ invalid char '' in expression

---------- Post updated at 09:43 AM ---------- Previous update was at 09:42 AM ----------

May be in red hat linux, some changes have to be made?

vgersh99 · January 29, 2011, 9:44am

This is not the code I posted.
Repaste what's been posted originally without any modifications.

charithainfadev · January 29, 2011, 10:05am

nawk -f char.awk myFile

nawk is not recognized so I replaced with gawk or awk

char.awk: 
BEGIN {
  FS=OFS="|"
  ctrlC=sprintf("%c", 003)
}
{
   sub("^[|]","")
   sub(ctrlC "$","")
   while (n=index($0,ctrlC))
      $0=substr($0,1,n-1) ORS $1 OFS substr($0,n+1)
   print
}

I replaced ctrlC with the character control C. Thats it. I did nt do any other changes to your code,.

vgersh99 · January 29, 2011, 10:12am

As I said - do NOT make any modifications to the script itself.
Simply call it with 'awk' or 'gawk'.

charithainfadev · January 29, 2011, 10:13am

Oh! WAIT!! Now I got it... I should not replace ctrlC with control c character. Silly me!!! AWESOME! AWESOME!! THANKS THANKS THANKS...THANKS A TON!!

---------- Post updated at 10:13 AM ---------- Previous update was at 10:12 AM ----------

Yes! it worked! I did that!

vgersh99 · January 29, 2011, 10:44am

I'm glad it worked out for you!

charithainfadev · February 1, 2011, 4:47pm

Hi, Now I got a new problem. If the file is small this is working fine. But If I get all the data from production, its not working.
There are just 1300 records that get loaded into this flat file from production with 2 fields, one is varchar2(15 byte) and the other is CLOB field. But this awk command is truncating the file to contain only 140- 150 records instead of 1300 records in it. Please let me know what could be done for this?

---------- Post updated at 04:16 PM ---------- Previous update was at 04:11 PM ----------

Sorry! small correction.
I get 40,000 records into this file from production out of which I just have to capture only those records that contain data in the CLOB field. I can filter out the records for which CLOB field is null. So that would be 1300 records out of 40,000. I tried to filter like this.

awk 'length > 16' <in-file> out-file

But any kind of awk command is truncating the table to just contain 140 records. I need to push this to production today. But unfortunately I got stuck here. I need help real quick please!

---------- Post updated at 04:47 PM ---------- Previous update was at 04:16 PM ----------

Hey I solved this problem for now. I gave a sql override in informatica to fetch only those records from the database for which the CLOB field is not null. But, in the future, if I have to fetch thousands and thousands of records, then I will be facing this problem again. So, I would appreciate any kinds of suggestions on this situation. Thanks in advance.