Extract columns from a file if the name dont exist put blank

Hi,

I am very new to Unix script. Suppose i have a file with column header:
NAME1 NAME2 Address Tel

And I always need to make a file with column header:
ID NAME1 NAME2 EMail Address Tel

For the columns that do not exist in the file, I would still like to make a column with blank.

How do I do this in UNix script?

Assuming you are using TAB as field separator:

 awk -F"\t" -vOFS="\t" '{$6=$4;$5=$3;$4=(NR==1)?"EMail":"";$3=$2;$2=$1;$1=(NR==1)?"ID":""}1' file

Since you haven't mentioned, I will assume that it's a fixed-format file -

$
$
$ cat f0
NAME1     NAME2     ADDRESS   TEL
name001   name002   addr001   tel001
name011   name022   addr011   tel011
name111   name222   addr111   tel111
$
$
$ awk 'NR==1{printf("%-10s%-10s%-10s%-10s%-10s%-s\n","ID",$1,$2,"EMAIL",$3,$4)}
       NR>1 {printf("%-10s%-10s%-10s%-10s%-10s%-s\n"," ",$1,$2," ",$3,$4)}
      ' f0
ID        NAME1     NAME2     EMAIL     ADDRESS   TEL
          name001   name002             addr001   tel001
          name011   name022             addr011   tel011
          name111   name222             addr111   tel111
$
$
$

tyler_durden

If the fields are separated by spaces:

awk 'NR==1{$1="ID" FS $1; $3=$3 FS "EMail"}{$1=FS $1; $3=$3 FS}1' file

This is what I have got so far.

abc.txt:
NAME1 NAME2 ADDRESS TEL
A B ADD 1234
 
awk -F"\t" '
NR==1{
for(i=1;i<=NF;i++)
if($i~/ID/)f[n]=i
n++
for(i=1;i<=NF;i++)
if($i~/NAME1/)f[n]=i
n++
for(i=1;i<=NF;i++)
if($i~/NAME2/)f[n]=i
n++
for(i=1;i<=NF;i++)
if($i~/EMAIL/)f[n]=i
n++
for(i=1;i<=NF;i++)
if($i~/ADDRESS/)f[n]=i
n++
for(i=1;i<=NF;i++)
if($i~/TEL/)f[n]=i
n++
}
{
if (NR==1)
print "ID NAME1 NAME2 EMAIL ADDRESS TEL"; 
}
{
if (NR > 1) {
for(i=0;i<n;i++)
printf"%s%s",i?" ":"",$f;
print""
}
}
' abc.txt

But I keep getting the error "Field is not correct"
I think the error is due to the ID AND EMAIL column not found (by searching for the column name) in abc.txt, hence f[n] evaluates to blank. When the code tries to do printf with $<blank>, it gives this error.

I am almost there, just need a tip how i can handle printf $<blank> so that it prints a blank column rather than errored out.

As i said, i am very new to unix scripting :slight_smile:

---------- Post updated 10-05-10 at 08:03 AM ---------- Previous update was 10-04-10 at 06:24 PM ----------

Hi experts,

Anyone can help me with this please?
Potentially this implementation can help save any manual work that i am doing. That means less Overtime for me and more time with my family.

Thanks

Your script doesn't have to be so convoluted.

$
$
$ cat abc.txt
NAME1   NAME2   ADDRESS TEL
A       B       ADD     1234
$
$
$ awk 'BEGIN{IFS="\t"; OFS="\t"} NR==1{$1="ID" IFS $1; $3="EMAIL" IFS $3} NR>1{$1=IFS $1; $3=IFS $3}1' abc.txt
ID      NAME1   NAME2   EMAIL   ADDRESS TEL
       A       B               ADD     1234
$
$
$

To check that those are TAB characters and not blank spaces -

$
$ # -e option displays "$" at end of line; -t option displays TABs as ^I
$ cat -et abc.txt
NAME1^INAME2^IADDRESS^ITEL$
A^IB^IADD^I1234$
$
$
$ awk 'BEGIN{IFS="\t"; OFS="\t"} NR==1{$1="ID" IFS $1; $3="EMAIL" IFS $3} NR>1{$1=IFS $1; $3=IFS $3}1' abc.txt | cat -et
ID^INAME1^INAME2^IEMAIL^IADDRESS^ITEL$
^IA^IB^I^IADD^I1234$
$
$
$

tyler_durden

Hi Tyler,
Appreciate your prompt reply.
What if I don't know beforehand the columns that were missing? Will your code still work?

I was using for loop in my earlier posting to perform a column header name matching to try to find out 1) the missing column ( as this is not fixed)
2) the position of the column if it matches

I thought this is the way to go? U think u can help me?

No, it won't. The posted code -
(a) adds an "ID" column in front of first column
(b) adds an "Email" column right before the 3rd column

Since it performs simple substitution and does not check for column headers specifically, you'd have two "ID" columns if your first column actually happens to be an "ID" column !

The fact that you do not know what all columns could be missing is most significant and it makes your problem much more difficult.

Here's what I think -
(a) Firstly, you will have to know your "desired" column headers. You won't know what's missing unless you have a complete list for comparison.

(b) Secondly, there's a question about "gaps" as well as "order". Let's say your desired column header is "c1,c2,c3,c4,c5" and your file has "c1,c3,c5". The missing columns over here are "c2,c4", but your file still has columns in order.

Now, how do you want to handle the case when your file has columns "c4,c3,c1" ? There are missing columns, but the existing columns are out of order as well. Is this possible at all ? Do you want to change that to your "desired" order - i.e. fill the gaps as well as reorder ?

tyler_durden

Yes Tyler. There will be a desired coLumn header. And the script will have to fix gaps as well as order.

I don't know how to do this in a single unix script, is this even possible?

If not then I am thinking to use a macro tool to process the header, compare with the desired column header and order, then generate an awk command that will print the columns, like this:
1) macro read the column header and position
2) macro compare this with the desired column
3) macro generate awk print statement, ordering the position returned in (1) and where there is a gap, macro will generate print "\t"
4) pass the generated awk statement to unix to execute

Possible to do this entirely in unix?

Yes it's possible.

$ 
$ cat rearrange.awk
BEGIN {
  IFS = "\t"; OFS = "\t"                                               # set the field separators
  num = split("ID,NAME1,NAME2,EMAIL,ADDRESS,TEL",main,",")             # set the "main" array
}
{
  if (NR==1) {                                                         # if we are at the first line
    for (i=1; i<=NF; i++) {                                            # then loop through the headers
      for (j=1; j<=num; j++) {                                         # and search for them in main array
        if ($i == main[j]) {map = j; break}                         # if match found then
      }                                                                # set "map" array and break
    }
  for (j=1; j<=num; j++) printf j==1 ? main[j] : OFS main[j]           # print headers in correct order
  printf "\n"
  } else {                                                             # for line number > 1
    for (i=1; i<=NF; i++) template[map] = $i                        # loop through fields; set "template"
    for (j=1; j<=num; j++) printf j==1 ? template[j] : OFS template[j] # and simply print template
    printf "\n"
  }
}
$ 
$ cat abc.txt
NAME1    NAME2    ADDRESS    TEL
A    B    ADD1    1234
X    Y    ADD2    5678
$ 
$ 
$ awk -f rearrange.awk abc.txt
ID    NAME1    NAME2    EMAIL    ADDRESS    TEL
    A    B        ADD1    1234
    X    Y        ADD2    5678
$ 
$ 
$ cat abc1.txt
NAME1    ADDRESS
A    ADD1
X    ADD2
$ 
$ awk -f rearrange.awk abc1.txt
ID    NAME1    NAME2    EMAIL    ADDRESS    TEL
    A            ADD1    
    X            ADD2    
$ 
$ 
$ cat abc2.txt
ADDRESS    NAME2    TEL    NAME1
ADD1    B    1234    A
ADD2    Y    5678    X
$ 
$ 
$ awk -f rearrange.awk abc2.txt
ID    NAME1    NAME2    EMAIL    ADDRESS    TEL
    A    B        ADD1    1234
    X    Y        ADD2    5678
$ 
$ 

Sorry, my output data isn't aligned because the TABs are lost when I copy from my terminal emulator and paste it over here. And I am too lazy to add the TABs manually.

tyler_durden

Yoz Tyler, no problem with the TAB. Cant thank you enuf for the code already, how can i fuss over the formatting. :stuck_out_tongue:

Anyway, I am executing your code, it seems like my system not able to parse these 2 lines correctly. Any idea?

printf j==1 ? main[j] : OFS main[j]
printf j==1 ? template[j] : OFS template[j]

Cheers

---------- Post updated at 09:55 AM ---------- Previous update was at 12:24 AM ----------

Hi Tyler,

I managed to work around this error by using if-else statement:

if (j==1)
printf main[j]
else
printf OFS main[j]

So the statement executed and printed OK now!
However, if abc.txt contains record with spaces, then it won't print correctly.

For example:

Field "USA SpringField" will print as "USA". I have no background in Unix but I think it is to be related to the printf statement.

Is it something i need to "tune" the printf to get it correctly? Let me google it in the meantime.

Thanks

Replace:
IFS="\t"
with
FS="\t"