Data Processing

I have below Data

***************************************************
********************BEGINNING-1********************

directive url is : https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762


Src urls are :
https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547
https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

directive url is : https://coursera-eu.mokar.com/directives/05570fd8-563a-316a-9428-a60a6f404303


Src urls are :
https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547
https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

directive url is : https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10


Src urls are :
https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX828_CB529168623_.jpg : 11.00293
https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX1242_CB529168623_.jpg : 13.24707

I want it in the below format

Directive Url,Src Url
https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/05570fd8-563a-316a-9428-a60a6f404303,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10i,https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX828_CB529168623_.jpg : 11.00293 https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX1242_CB529168623_.jpg : 13.24707

Please let me know how can i solve this problem.

With the suggestions that we have provided you on more than 50 other problems, we would hope that you have learned something from all of our previous help. What have you tried to solve this problem on your own?

1 Like

Hi Don,

I tried writing below code, It was not working, hence turned upto the forum for the help

paste -d, -s norm.txt |awk -F "directive url is : " '{print $2 $3 $4}' | awk -F ",,,Src urls are :," '{print $1 "," $2}' | awk -F ",," '{print $1}'

Below is the o/p i'm getting which is incorrect

https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

Try

awk '
BEGIN           {print "Directive Url,Src Url"
                }
sub (/^directive url is : /, "") \
                {printf "%s%s", TRS, $0
                 TRS = ORS
                }
/^https/        {printf ", %s", $0
                }
END             {printf RS
                }
' file

Hi Rudi,

Thanks for the solution, but the o/p i'm getting is bit different.

I dont want the comma(,) between the src urls, want it only after the directive url as there are only 2 columns.

Below is the O/p i wanted

Directive Url,Src Url
https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/05570fd8-563a-316a-9428-a60a6f404303,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10i,https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX828_CB529168623_.jpg : 11.00293 https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX1242_CB529168623_.jpg : 13.24707

I'm glad I could (almost) help. For your required modifications, why don't you give it a try, with 168 posts and a six year membership?

1 Like

Hi Rudi

I'm able to do through sed, was just bit curious if it was possible with the same awk script you shared. anyways thanks my solution below

content.txt is my data file and the process.sh is the script you shared

sh process.sh content.txt | sed 's/,\([^,]*\)$/ \1/'

Thanks again for your help

You're joking, aren't you?

I'm confused...

First: You requested empty lines between output sections that do not seem to be provided by RudiC's suggestion. Did you want them or not?

Second: It looks to me like RudiC's suggestion would duplicate the Directive URL as a Src URL in each output section (which you did not seem to want). But, you have not indicated that that is a problem. Did you want the Directive URL to be output as both a Directive URL and as a Src URL or not?

And, third: The last Directive URL in your sample input is:

directive url is : https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10

but the output you say you want corresponding to that input is:

https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10i,...

Where did the i come from in that output?

Assuming that you did want empty lines between your output records, assuming that you did not want the Directive URL to be included in the <space> separated list of Src URL field entries, and assuming that the extraneous i in the last line of your sample output was a typo; the following minor modification of RudiC's suggestion might be worth trying:

awk '
BEGIN		{print "Directive Url,Src Url"
		}
sub (/^directive url is : /, "") \
		{printf "%s%s", TRS, $0
		 TRS = ORS ORS
		 TFS = ","
		 next
		}
/^https/	{printf "%s%s", TFS, $0
		 TFS = " "
		}
END		{printf ORS
		}
' "$1"

If you invoke this script with the name of a file containing the sample input you provided in post #1 in this thread as its first operand, it produces the output:

Directive Url,Src Url
https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/05570fd8-563a-316a-9428-a60a6f404303,https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX828_CB529193423_.jpg : 11.685547 https://images-eu.ssl-images-mokar.com/images/G/31/img17/PCA/Watches/Watchestrack/Ingress/1041299_watches_1242x150_3._SX1242_CB529193423_.jpg : 12.743164

https://coursera-eu.mokar.com/directives/dc70a6d8-6422-30e4-bc9f-680ff0911a10,https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX828_CB529168623_.jpg : 11.00293 https://images-eu.ssl-images-mokar.com/images/G/31/img16/app/sweeps/courses/Books-bunk-top_1242x150._SX1242_CB529168623_.jpg : 13.24707

You could also try the following slightly different approach that produces exactly the same output as the above script:

awk '
BEGIN {	print "Directive Url,Src Url"
}
/^https/ {
	printf "%s%s", srccnt++ ? " " : ",", $0
}
sub(/^directive url is : /, "") {
	printf "%s%s", directivecnt++ ? ORS ORS : "", $0
	srccnt = 0
}
END {	printf ORS
}' "$1"

As always, if someone wants to try either of these on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .