Split a single record to multiple records & add folder name to each line

ram2581 · July 5, 2011, 12:25pm

Hi Gurus,

I need to cut single record in the file(asdf) to multile records based on the number of bytes..(44 characters). So every record will have 44 characters. All the records should be in the same file..to each of these lines I need to add the folder(<date>) name.
I have a dir. in which there are subdirectories(the structure remains the same, I need to add the main directory name (example below) as a column in a fixed width file which is present in the subdirectory of that particular folder.. Can you please help me out in this as I am new to scripting.
Ex: /abc/<date>/def/asdf.dat
/abc/20010125/def/asdf.dat
/abc/20020502/def/asdf.dat

here I need to take the <date> dir name and put it in the asdf file after each row as a column.

Then conactenate all the data of the similar files into a single file.
Please help me out!

Regards,
Ram

Corona688 · July 5, 2011, 12:54pm

Show an example of the input you have and the output you want.

ram2581 · July 5, 2011, 1:01pm

File is /abc/20010125/def/asdf.dat
The file is
Input file data:

OA        200909C+0000000.5600000091.0000 OA        200909  P+0000000.5600000090.0000 PA        200909  C+0000000.5800000091.0000

Output file name : /abc/20010125/def/asdf_split
Output file data

OA        200909C+0000000.5600000091.0000 20010125
OA        200909P+0000000.5600000090.0000 20010125
PA        200909C+0000000.5800000091.0000 20010125

please let me know

Corona688 · July 5, 2011, 1:11pm

Where is this date coming from? Is it given to your program?

---------- Post updated at 11:11 AM ---------- Previous update was at 11:04 AM ----------

OA 200909C+0000000.5600000091.0000

Is there supposed to be a space between 200909 and C+?

Morgan_Greywolf · July 5, 2011, 1:16pm

If you know that each of the records is exactly 44 characters, then you can just use the fold command (assuming you're using a POSIX compliant shell):

fold -w44 /abc/$(date +%Y%m%d)/def/asdf.dat | sed "s/$/ $(date +%Y%m%d)/" >/abc/$(date +%Y%m%d)/def/asdf_split

but that assumes that each of the records is always 44 characters and they never contain any embedded newlines, etc.

---------- Post updated at 01:16 PM ---------- Previous update was at 01:14 PM ----------

BTW -- that looks almost exactly like HPGL.

Corona688 · July 5, 2011, 1:17pm

If there is, then this will work:

awk -v D=20010125 '{
         for(N=1; N<=NF; N++)
         if($N ~ /[OP]A/)
         {
                 print $N, $(N+1) $(N+2), D;
                 N++;
         }
 }' < datafile > datafile-out

name datafile and datafile-out whatever you please.

---------- Post updated at 11:17 AM ---------- Previous update was at 11:16 AM ----------

That will wrap the lines, but won't add the new string onto the end of them...

Morgan_Greywolf · July 5, 2011, 1:28pm

That's why the 'sed' command on the end.

Corona688 · July 5, 2011, 1:28pm

There's also this sed method, which won't care whether OA 200909 is followed by a space or not:

$ echo "OA 200909C+0000000.5600000091.0000 OA 200909 P+0000000.5600000090.0000 PA 200909 C+0000000.5800000091.0000
" | sed 's/ \?\([OP]A\) \([0-9]*\) \?\([^ ]*\)/\1 \2 \3 20010125\n/g'
OA 200909 C+0000000.5600000091.0000 20010125
OA 200909 P+0000000.5600000090.0000 20010125
PA 200909 C+0000000.5800000091.0000 20010125

$

ctsgnb · July 5, 2011, 1:47pm

You mean something like this ?

myfile=/abc/20010125/def/asdf.dat
fold -w 44 $myfile | awk -v f="$myfile" '{match(f,"[0-9]+");print $0 OFS substr(f,RSTART,RLENGTH)}'

ram2581 · July 5, 2011, 2:23pm

I need to do this for all the folders in 'abc'. the script needs go to all the sub folder of abc say <date> folder and to dir folder and pick up the file asdf. Split the long line by 44 char then add the <date> folder as a seperate column.
At last I need to add all the data from the different files into a single file.
The date folder is generic.. it changes but not the sub directories..

ctsgnb · July 5, 2011, 2:58pm

find /abc -type f -name "*.dat" | while read f
do
p=$(echo $f | awk -F/ '{print $2}')
sed 's/ \(.+\)/\1/g' $f | fold -w 35 | sed "s/ *$/ $p/" >$f.output
done

Morgan_Greywolf · July 5, 2011, 3:24pm

@ctsgnb:
I think you forgot a closing paren on line 3, and I think he wants '-w 44', not '-w 35', but other than that I think it should work.

ctsgnb · July 5, 2011, 3:53pm

ooops , thx!
... fixed

Regarding the -w 35 or 44 or whatever, of course it must be adjusted depending on the input (i just based my formatting on the given sample)

ram2581 · July 5, 2011, 4:19pm

Thank you "ctsgnb" and "Morgan". the code is working with out the folder name..( if I remove /abc after find), but when i have the /abc it says directory does not exist.
This might be a silly question to U. can you please let me know why?

Morgan_Greywolf · July 5, 2011, 4:34pm

What's the error you get with '/abc'? Is '/abc' a mount point? If it's a mount point, is it mounted manually or with automount? Is /abc a symlink?

ram2581 · July 5, 2011, 5:54pm

I can see that the folder exists..when I run the script on the folder above abc say (/root/abc/19990702/asdf.dat). I am running the command at /root. then its giving me the error. abc is not a mount point, i created this directory.. i actually replicated the actual scenario which i need to implement..

Thanks,
Ram

Morgan_Greywolf · July 5, 2011, 9:40pm

Well, then you obviously need /root/abc, not /abc.

ctsgnb · July 6, 2011, 3:05am

This is the syntax of the find command : it waits for the root from which it is going to start the search.

By specifying /abc, it will start to search all the files that match the search criteria in /abc and in /abc subtree

ram2581 · July 6, 2011, 2:31pm

Its working fine if I put the path.. Thank you once again ctsgnb & Morgan..

---------- Post updated at 01:28 PM ---------- Previous update was at 11:43 AM ----------

Hi,

Just noticed that after the second column there has been only one space instead of 2 spaces.
Example:
Input:

LO 201612   P+0000133.0000000120.4642 20090511
LO 201612   C+0000150.0000000120.2220 20090511

output:

LO 201612 P+0000133.0000000120.4642 20090511
LO 201612 C+0000150.0000000120.2220 20090511

Code used:

find /abc -type f -name "*.dat" | while read f
do
p=$(echo $f | awk -F/ '{print $2}')
sed 's/ \(.+\)/\1/g' $f | fold -w 35 | sed "s/ *$/ $p/" >$f.output
done

Can you please let me know why?

---------- Post updated at 01:31 PM ---------- Previous update was at 01:28 PM ----------

i don't know why if i give an extra space in the reply above its not taking it in...

ram2581 · July 8, 2011, 5:15pm

ctsgnb,
Can U please tell how to get two spaces after the second column.. i am unable to figure that out..