Printing string from last field of the nth line of file to start (or end) of each line (awk I think)

samonl · February 6, 2018, 4:27am

My file (the output of an experiment) starts off looking like this,

_____________________________________________________________
Subjects incorporated to date: 001
Data file started on machine PKSHS260-05CP

**********************************************************************
Subject 1, 11/30/2017 16:07:17 on PKSHS260-05CP, DMDX 5.1.5.3, Windows 6.1.7601, refresh 16.67ms, ID ik60607
!  DMDX is running in auto mode (automatically determined raster sync)
!  Video Mode 1280,1024,32,60
!  Item File <C:\Users\XXXXXXX\Desktop\Experiment\Version2.rtf>
Item 11213, 1372.44
 1372.44,+Right Ctrl
Item 11213, 1052.90
 1052.90,+CTRL
Item 114109, -1102.03
 1102.03,+Right Ctrl
Item 11131, 721.06
 721.06,+Right Ctrl
Item 111325, 1075.30
 1075.30,+Right Ctrl

I used the following:

egrep '^(Item|Subject|!)' filename

to get it like this
Subjects incorporated to date: 001
Subject 1, 11/30/2017 16:07:17 on PKSHS260-05CP, DMDX 5.1.5.3, Windows 6.1.7601, refresh 16.67ms, ID ik60607
! DMDX is running in auto mode (automatically determined raster sync)
! Video Mode 1280,1024,32,60
! Item File <C:\Users\XXXXX\Desktop\Experiment\Version2.rtf>
Item 11213, 1372.44
Item 11213, 1052.90
Item 114109, -1102.03
Item 11131, 721.06

Now I want to use something like

awk 'NF > 5' | awk '{print $NF}'

to extract the ik60607 (which is an identifier for that participant in the experiment)

and produce something like this

ik60607,Item 11213, 1372.44
ik60607,Item 11213, 1052.90
ik60607,Item 114109, -1102.03
ik60607,Item 11131, 721.06

etc.
or ideally do something like this operation twice (using the rtf filename) to produce

Version2.rtf,ik60607,Item 11213, 1372.44
Version2.rtf,ik60607,Item 11213, 1052.90
Version2.rtf,ik60607,Item 114109, -1102.03
Version2.rtf,ik60607,Item 11131, 721.06

I think I might need to use 'paste' after making a file with the ID printed the same number of times as the original file has lines and then deleting the bits. I have tried using awk but got confused with record and field separators & do not understand what I have read in manual pages. I am not in anyway skilled with this but was forced to do some of this twice in my life for 2 month periods 10 and 25 years ago working with large dictionaries which is why I even tried. Please help the very naive non-programmer (& it is pre-processing data for even more hapless students). I have to do this on a 108 (currently separate) files all with different (non regular) names and am not sure whether using cat first to join them up would make things even worse (there would be record and field separators that way but it seems even more complex)

Any advice gratefully received.

___________________________________________________________

RudiC · February 6, 2018, 5:11am

Welcome to the forum!

Please make sure to enclose ALL code and data in code tags as required by the forum rules!

Your spec is not quite clear and consistent (why is Item 111325, 1075.30 not found in your desired output?), but would

awk '
/^Item/         {print FN "," ID "," $0
                }

/^Subject /     {ID = $NF
                }

/! Item/        {gsub (/^.*\\|>$/, _)
                 FN = $0
                }

' file
Version2.rtf,ik60607,Item 11213, 1372.44
Version2.rtf,ik60607,Item 11213, 1052.90
Version2.rtf,ik60607,Item 114109, -1102.03
Version2.rtf,ik60607,Item 11131, 721.06
Version2.rtf,ik60607,Item 111325, 1075.30

come close to what you need?

samonl · February 6, 2018, 5:27am

It's not found in my desired output because I copied it by hand. I couldn't copy and paste because the windows generated output end of line characters are/were messy everywhere (on mac text editor; in browser, on windows partition - it sometimes seems to result in one continuous line sometimes not)

Your suggestion produced this:

,Item 11213, 1372.44
,Item 11213, 1052.90
,Item 114109, -1102.03
,Item 11131, 721.06

which is close (added a , rather the end of the relevant field). Thank you though.

RudiC · February 6, 2018, 5:33am

I'm pretty sure that result is due to the "windows generated output end of line characters", making the "Item" line overwrite the file name and ID.

Try adding

                {sub (/\r$/, "")
                }

in front of the /^Item/ regex line.

samonl · February 6, 2018, 6:08am

Thank you. I think you are right. I am using 'cat' to join the whole lot together and try to solve the labelling afterwards once all the rtf rubbish has been removed. I am going to say this is solved because a) your code shows me the correct gsub b) I have forced myself to read the array pages / loop pages of awk to the point where I'll have fun trying until I give up and do the whole thing manually/with the vba bit of excel. All for some ungrateful third year students. Thanks very much.

---------- Post updated at 11:08 AM ---------- Previous update was at 10:39 AM ----------

,ik60607,Item 122116, 658.49
,ik60607,Item 12313, 550.71
,ik60607,Item 50, 30111.98
,ik60607,Item 51, 3807.15
,ik60607,Item 52, 2384.38

You have no idea how much pure joy you have generated. Thank you. Really.

RudiC · February 6, 2018, 6:17am

I can't understand that result. Please use "Manage Attachments" in the "Advanced" editor window to attach the (truncated meaningfully if need be) original file for analysis. If more than your actual post count is necessary therefor, post the output of

od -tx1c filename

as text.

samonl · February 6, 2018, 6:30am

Hi,

Thank you for trying. I have got to go and teach now though. I had to resave both files on this mac using TextEdit and whatever the default encoding was (UTF-8 I think). I made the awk one with ancient dredged up emacs muscle memories. Thanks - to be honest the subject id is good enough and it's not hard to get rid of a leading , which why I marked it solved. The files (not big) are attached - I had to edit the original output to remove potential identifier (hastily after I had posted it including it..).

RudiC · February 6, 2018, 6:40am

I was talking of the data file, assuming you copying / pasting the command correctly... Take your time.

RudiC · February 6, 2018, 7:01am

OK, found the problem: There are TWO spaces between ! and Item in the file name line, not obvious in your post#1 as you didn't enclose the data in code tags. Adapting the regex, it seems to work:

awk '
                {sub (/\r$/, _)
                }
/^Item/         {print FN "," ID "," $0
                }

/^Subject /     {ID = $NF
                }

/! +Item/       {gsub (/^.*\\|>$/, _)
                 FN = $0
                }

' /tmp/mozilla_coerdtr0/3pm5Version2copy.txt
Version2.rtf,ik60607,Item 11213, 1372.44
Version2.rtf,ik60607,Item 11213, 1052.90
Version2.rtf,ik60607,Item 114109, -1102.03
Version2.rtf,ik60607,Item 11131, 721.06
Version2.rtf,ik60607,Item 111325, 1075.30
.
.
.

This assumes that your awk recognizes the \r in the first regex and removes it. Should that fail, we need to rethink.

samonl · February 6, 2018, 5:19pm

It works. Ich bin sehr, von der Seele, dankbar. Especially since half the time they have saved Version 2 as Version 8 etc, and there is some kind of record of what they actually ran this way rather than relying on the filename.