Pattern Match and Rearrange the Fields in UNIX

arunkesi · October 16, 2015, 4:53am

For an Output like below

Input :

<Subject A="I" B="1039502" C="2015-06-30" D="010101010101">

Output :

<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

I have been using something like below but not getting the desired output :

awk -F ' ' '/Subject/ BEGIN{OFS=" ";} {print $1,$5,$3,$4,$2}' input_File > output_file

Could someone pl help

zaxxon · October 16, 2015, 5:01am

I don't know how generic this must be but you can try this:

$ awk '{sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"}' infile
<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

RavinderSingh13 · October 16, 2015, 5:10am

Hello arunkesi,

Could you please try following and let me know if this helps, considering that you need to reverse all fields separated by space apart from <Subject string in starting.

awk '{for(i=NF;i>1;i--){sub(/>/,X,$i);A=A?A OFS $i:$i};print "<Subject " A ">";A=""}'  Input_file

Output will be as follows.

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

Thanks,
R. Singh

zaxxon · October 16, 2015, 5:11am

Hi Ravinder,

if it is not an error by the OP, he wants to have the order not reversed but "D B C A".

arunkesi · October 16, 2015, 5:19am

Thanks Zaxxon, I would want to change only the lines which start with <Subject and all the other lines should be retained as in the original file

RudiC · October 16, 2015, 5:48am

Adapt Zaxxon's proposal

$ awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile

arunkesi · October 16, 2015, 6:03am

I have used below but I am only getting the records which match the pattern and the other lines which dont match pattern are omitted

awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile > outfile

RudiC · October 16, 2015, 6:10am

Please use code tags as required by forum rules!

Difficult to believe. The default action for the "1" pattern is "print the line in $0 unconditionally".

BTW, the "subject" might be lead in by an upper case "S".

RavinderSingh13 · October 16, 2015, 7:11am

Thank you Zaxxon. I have tried to make it more generic. May be following can help OP.

awk --re-interval '{match($0,/[A-Z]=\"[0-9]{4}\-[0-9]{2}\-[0-9]{2}\"/);val3=substr($0,RSTART,RLENGTH)};{match($0,/[A-Z]=\"[0-1]{12}\"/);val2=substr($0,RSTART,RLENGTH);{match($0,/[A-Z]=\"[0-9]{7}\"/);val4=substr($0,RSTART,RLENGTH)};{match($0,/[A-Z]=\"[a-zA-Z]+\"/);val5=substr($0,RSTART,RLENGTH);print "<Subject " val2 OFS val4 OFS val3 OFS val5 ">"}}' Input_file

Output will be as follows.

<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

Now in above solution I have assumed that pattern of Input_file will be like [A-Z]="[A-Za-z]" then [A-Z]=[0-9] till 7 digits then [A-Z]=[0-9] 4 digits -[0-9] 2 digits - [0-9] 2 digits like YYYY-MM-DD format and finally [A-Z]=[0-1] till 12 digits . So if all Input_file is having mentioned syntax input then above solution may help OP.

EDIT: Adding a non one-liner form of solution as follows.

awk --re-interval '{
                        match($0,/[A-Z]=\"[0-9]{4}\-[0-9]{2}\-[0-9]{2}\"/);
                        val3=substr($0,RSTART,RLENGTH)};
                   {
                        match($0,/[A-Z]=\"[0-1]{12}\"/);
                        val2=substr($0,RSTART,RLENGTH);
                   {
                        match($0,/[A-Z]=\"[0-9]{7}\"/);
                        val4=substr($0,RSTART,RLENGTH)};
                   {
                        match($0,/[A-Z]=\"[a-zA-Z]+\"/);
                        val5=substr($0,RSTART,RLENGTH);
                        print "<Subject " val2 OFS val4 OFS val3 OFS val5 ">"
                   }
                   }
                  ' Input_file

Thanks,
R. Singh

Klasform · October 16, 2015, 12:37pm

I've come-up with this using (GNU) sed.
Started to learn the command and would gladly appreciate input on efficiency and also stylistically!

sed -n -r 's/(<Subject )(.*)(.* )(.*)(.* )(.*)(.* )(.*)>/\1\8 \6 \4 \2>/p' testfile

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

Thanks

RudiC · October 16, 2015, 1:24pm

.* stands for "anystring", so .*.* is anystring followed by anystring, which is equivalent to .* . Your script yields the same as

sed -n -r 's/(<Subject )(.* )(.* )(.* )(.*)>/\1\5 \4 \3 \2>/p' file

. And it does not print the unmodified, i.e. unmatched lines, which are requested by the OP.

Don_Cragun · October 16, 2015, 2:23pm

The command:

sed -n -r 's/(<Subject )(.*)(.* )(.*)(.* )(.*)(.* )(.*)>/\1\8 \6 \4 \2>/p' testfile

should work because of greedy matching forcing the 1st .* to grab everything that the following .* could also grab. Stylistically, I generally avoid putting expressions in parentheses when I don't need the string matched by that expression in the replacement. With that in mind consider this simplification:

sed -n -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \4 \3 \2>/p' testfile

But, the requested output was:

<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

while the above sed scripts produce the output:

<Subject D="010101010101" C="2015-06-30" B="1039502" A="I">

The requested output could be achieved by rearranging the replacement string references:

sed -n -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \3 \4 \2>/p' testfile

And, of course, to print the lines that don't match the search pattern without changing them, get rid of the -n option:

sed -r 's/(<Subject )(.*) (.*) (.*) (.*)>/\1\5 \3 \4 \2>/p' testfile

And, if you don't have GNU sed (with the -r option), it can be done with standard basic regular expressions with:

sed 's/\(<Subject \)\(.*\) \(.*\) \(.*\) \(.*\)>/\1\5 \3 \4 \2>/p' testfile

arunkesi · October 25, 2015, 1:13am

Input

<Subject Q="I" W="1039502" E="2015-06-30" R="010101010101">

Output

<Subject R="010101010101" W="1039502" E="2015-06-30" Q="I">

Code

awk '/subject/ {sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"; next} 1' infile > outfile

Thanks for all the help, this seems to be working fine. But can we tweak this awk one-liner so that it handles all the cases in the input, meaning to say in the input we dont have a fixed order for
Q W E R (they can be any positions in the input) but we need to search for them and place the output in R W E Q order

Thanks so much for all your help thus far

Scrutinizer · October 25, 2015, 2:03am

Try:

awk '
  /Subject/ {
    split($0,F,/(^|=)[^ ]*( |$)/)
    for(i in F) P[F]=i
    sub(/>/,x)
    print $1,$P["R"],$P["W"],$P["E"],$P["Q"] ">"
    next
  }
  1
' file

Or using the order as a variable:

awk -v order="R W E Q" '
  BEGIN{
    split(order,O)
  } 
  /Subject/ {
    split($0,F,/(^|=)[^ ]*( |$)/)
    for(i in F) P[F]=i
    sub(/>/,x)
    print $1,$P[O[1]],$P[O[2]],$P[O[3]],$P[O[4]] ">"
    next
  }
  1
' file

Don_Cragun · October 25, 2015, 2:41am

For the latest fixed order you have specified you could also try:

awk '
/^<[Ss]ubject/ {
	sub(/>$/, "", $NF)
	for(i = 2; i <= NF; i++)
		d[substr($i, 1, 1)] = $i
	$0 = $1 OFS d["R"] OFS d["W"] OFS d["E"] OFS d["Q"] ">"
}
1' infile > outfile

arunkesi · October 25, 2015, 3:28am

Hi...thanks for your inputs but seem to be getting an awk field error when the pattern matches...

Don_Cragun · October 25, 2015, 3:41am

You have suggestions from me and from Scrutinizer since your post #13 in this thread. Which awk script of those three awk scripts is giving you an error?

What EXACTLY is the error?

What EXACTLY was the input line awk was processing when it reported the error?

looney · October 25, 2015, 2:36pm

Hi Zaxxon,
in code

awk '{sub(/>/,"",$NF); print $1,$5,$3,$4,$2">"}' infile
<Subject D="010101010101" B="1039502" C="2015-06-30" A="I">

please explain the highlighted..
Thanks,

Don_Cragun · October 25, 2015, 3:30pm

The awk command sub(/>/, "", $NF) changes (or substitutes) the string matching the extended regular expression > (which matches a literal greater than sign) to an empty string (specified by "" ), in the last field (specified by $NF ) on the line. Or, simply stated, it removes the trailing greater than sign from the end of that input line.

looney · October 26, 2015, 12:36am

Thanks Mr Don.