Grep Data Base on Header

pareshkp · February 26, 2016, 12:27pm

HI Guys,

File A.txt

UID,HD1,HD2,HD3,HD4
1,2,33,44,55
2,10,14,15,16

File B.txt

UID
HD1
HD4

A.txt B.txt >>>Output.txt

UID,HD1,HD4
1,2,55
2,10,16

shamrock · February 26, 2016, 2:06pm

What specifically are you looking to do and what have you tried to solve this...

pareshkp · February 26, 2016, 4:03pm

Just Copy Data From One file to another base on Header Name.

shamrock · February 26, 2016, 6:34pm

And what have you tried to solve it on your own...

pareshkp · February 26, 2016, 8:36pm

I have tried Below

$ cols=($(sed '1!d;s/, /\n/g' $A | grep -nf $B | sed 's/:.*$//'))

$ cut -d ',' -f 1$(printf ",%s" "${cols[@]}") $A

Don_Cragun · February 26, 2016, 9:26pm

You weren't far off. Your field separator is a comma, not a comma followed by a space. So, if you change your first sed from:

sed '1!d;s/, /\n/g' $A

to:

sed '1!d;s/,/\n/g' $A

as in:

#!/bin/ksh
A=A.txt
B=B.txt
cols=($(sed '1!d;s/,/\n/g' "$A" | grep -nf "$B" | sed 's/:.*$//'))
cut -d ',' -f 1$(printf ",%s" "${cols[@]}") "$A"

it seems to do what you want. You might also want to consider:

#!/bin/ksh
A='A.txt'
B='B.txt'
awk '
BEGIN {	FS = OFS = ","
}
FNR == NR {
	h[++hc] = $0
	next
}
FNR == 1 {
	for(i = 1; i <= hc; i++) {
		for(j = 1; j <= NF; j++) {
			if($j == h) {
				o = j
				break
			}
		}
		if(j > NF) {
			printf("Header \"%s\" not found in file \"%s\".\n",
			    h, FILENAME)
			exit 1
		}
	}
}
{	for(i = 1; i <= hc; i++)
		printf("%s%s", $o, (i == hc) ? ORS : OFS)
}' "$B" "$A"

which invokes awk once instead of invoking sed twice, grep once, and cut once; so it should run a bit faster.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

danmero · February 27, 2016, 10:13am

# cat A.txt
UID,HD1,HD2,HD3,HD4
1,2,33,44,55
2,10,14,15,16
# cat B.txt
UID
HD1
HD4

Solution

# awk 'FNR==NR{h[$0]=$0;next}FNR==1{for(;i++<NF;){if($i==h[$i]){o[l++]=i}}}{for(_ in o)printf "%s%s",$o[_],(_==(l-1))?RS:FS}' FS=\, B.txt A.txt
UID,HD1,HD4
1,2,55
2,10,16

RudiC · February 27, 2016, 11:05am

That doesn't preserve the desired sequence of columns for longer lists. man awk :

Small adaption:

awk '
BEGIN           {OFS = FS = ","}
FNR == NR       {h[$0] = NR
                 MX = NR
                 next
                }
FNR == 1        {for (i=1; i<=NF; i++) if ($i in h) a[h[$i]] = i
                }

                {for (i=1; i<=MX; i++) printf "%s%s", $a, (i==MX)?RS:OFS
                }
' file2 file1

danmero · February 27, 2016, 11:35am

https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html

Base on my experience by default awk will use incremental traversal , maybe I'm wrong.

Don_Cragun · February 27, 2016, 12:41pm

You're wrong. RudiC is correct. As an example, on OS X version 10.11.3 the command:

printf '%s\n' 1 2 3|awk '{a[$1]}END{for(i in a)print i}'

produces the output:

2
3
1

Using a different version of awk should produce the same lines of output, but the order in which they are printed is specifically not specified by the standards.

pareshkp · March 14, 2016, 11:15am

Thanks Guys.

Perfect ....

But its Stuck when header not found from file2.txt...

Don_Cragun · March 14, 2016, 3:15pm

Does the awk script I suggested in post #5 in this thread get "Stuck" in that case; or does it print a diagnostic message and exit?