Find columns in a file based on header and print to new file

LMHmedchem · November 25, 2016, 12:26am

Hello,

I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need,

# file with header names of required columns in required order
headers_file=$2

# read contents of headers_file into array
IFS=$'\n' read -a headers_list < $headers_file

and then loop through the list to pick out each column I need,

# loop on header list
for header_value in "${headers_list[@]}"
do
   # print current input file
   echo $header_value

   # look for the column in the input file
   awk -v OFS='\t' -v column_header="$header_value" 'NR==1{for (i=1; i<=NF; i++) if ($i==column_header){p=i; break}; next} {print $p}' $input_file > $output_file

done

The above awk does not work and even if it did it would overwrite the data from each previous column found. How do I find all the columns I need and then print all of them in the right order so they all end up in the output file?

The only thing I could think of was to read the header line from $input_file into another array and then loop through $headers_list making a note of the numerical position of the columns I need. In theory, I could use the list of numerical positions to cobble together a cut argument to get the columns I need. That seems like it would be horribly messy syntax and could probably be done with one line of awk from someone who knows what they are doing.

That means it's time to post and ask for help. I found allot of topics like this one, but most of them seemed to find one column by the header value and print it.

In case that makes a difference, the input files I am working have < 200 columns but may have almost any number of rows. The input file is space delimited and the output should be tab delimited, though I could replace space with tab after the fact if necessary.

Suggestions would be greatly appreciated.

LMHmedchem

ripat · November 25, 2016, 2:19am

As no sample files were provided I made some assumptions on their contents. To put you on track:
headers file

$ cat headers 
H3
H4
H1
H2

input_file

$ cat input_file 
H1 H2 H3 H4
01 02 03 04
11 12 13 14
21 22 23 24
31 32 33 34

$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a;print ""}' headers input_file

Output

H3 H4 H1 H2 
03 04 01 02 
13 14 11 12 
23 24 21 22 
33 34 31 32

Scrutinizer · November 25, 2016, 5:23am

Note: The order in for (i in a) is arbitrary, so it cannot be used reliably to preserve order. An alternative would be to use a for(i=min;i<=max;i++) loop..
for example:

awk 'NR==FNR{A[$1]=++c; next} {s=""; for (i=1; i<=c; i++) {if(FNR==1) P=A[$i]; s=s $(P) OFS} print s}' headers input_file

ripat · November 25, 2016, 5:44am

My bad, am I so rusted in awk? Nice catch.

drl · November 25, 2016, 8:33am

Hi.

From my perspective, this is a csv manipulation problem. Consequently, a simple csv-aware tool seems appropriate. The dataset is transformed to csv format, and the header-name lines are collected into a csv-like string. The named columns are extracted, and the file is converted from csv format to TAB-separated format -- which the OP required,

Collecting these all together in a script and using dataset from ripat:

#!/usr/bin/env bash

# @(#) s1       Demonstrate extraction of fields, csvtool.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf sed pass-fail
pe
dixf csvtool

FILE=${1-data1}
E=expected-output.txt
H=headers

pl " Input data file $FILE:"
head $FILE

pl " Input file converted to csv:"
sed -r 's/\s+/,/g' $FILE |
tee t1

pl " Header name file $H, name list:"
head $H
h1=$( paste -s -d, $H )
pe " Header list = $h1"

pl " Expected output:"
cat $E

pl " Results, extract columns, convert csv to TAB-spaced:"
csvtool namedcol $h1 t1 |
tee t2 |
sed -r 's/,/\t/g' |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.21
sed (GNU sed) 4.2.2
pass-fail (local) 1.9

csvtool tool for performing manipulations on CSV files from sh... (man)
Path    : /usr/bin/csvtool
Version : - ( /usr/bin/csvtool, 2014-08-06 )
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with --help
Home    : https://github.com/Chris00/ocaml-csv

-----
 Input data file data1:
H1 H2 H3 H4
01 02 03 04
11 12 13 14
21 22 23 24
31 32 33 34

-----
 Input file converted to csv:
H1,H2,H3,H4
01,02,03,04
11,12,13,14
21,22,23,24
31,32,33,34

-----
 Header name file headers, name list:
H3
H4
H1
H2
 Header list = H3,H4,H1,H2

-----
 Expected output:
H3      H4      H1      H2
03      04      01      02
13      14      11      12
23      24      21      22
33      34      31      32

-----
 Results, extract columns, convert csv to TAB-spaced:
H3      H4      H1      H2
03      04      01      02
13      14      11      12
23      24      21      22
33      34      31      32

-----
 Verify results if possible:

-----
 Comparison of 5 created lines with 5 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

The command csvtool can be found in the Debian repository or at github as noted.

@LMHmedchem: with 300 posts, you should know that posting data samples, expected output, and your computing environment will help make replies easier and more likely to be applicable to your situation. Please do that in your future posts.

Best wishes ... cheers, drl

LMHmedchem · November 25, 2016, 1:18pm

scrutinizer:

Note: The order in for (i in a) is arbitrary, so it cannot be used reliably to preserve order. An alternative would be to use a for(i=min;i<=max;i++) loop..
for example:
awk 'NR==FNR{A[$1]=++c; next} {s=""; for (i=1; i<=c; i++) {if(FNR==1) P=A[$i]; s=s $(P) OFS} print s}' headers input_file

Thank you all for the replies.

I can't seem to get the above working.

Here is some data, sorry this is hard to read but I thought it best to leave it in its original single space delimited format.

name col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
name,1 2 1 1 0 1 0 11.75 9.6154
name,2 7 0 0 0 1 0 12.7917 8.6310
name,3 4 1 1 0.6 1 0 18.2769 4.6420
name,4 6 1 1 0 1 0 16.1389 7.7778
name,5 2 2 3 0.833333 1 0 21.5342 4.2924

headers_file,

col_1
col_6
col_3
col_4
col_8

desired output (in most cases, some columns in the original input will not be in output)

name	col_1	col_6	col_3	col_4	col_8
name,1	2	0	1	0	9.6154
name,2	7	0	0	0	8.6310
name,3	4	0	1	0.6	4.6420
name,4	6	0	1	0	7.7778
name,5	2	0	3	0.8333	4.2924

When I run the script above by I get,

col_1 col_1 col_1 col_1 col_1 col_1 
col_6 col_6 col_6 col_6 col_6 col_6 
col_3 col_3 col_3 col_3 col_3 col_3 
col_4 col_4 col_4 col_4 col_4 col_4 
col_8 col_8 col_8 col_8 col_8 col_8

The code suggestion posted by ripat has a similar issue but I haven't posted the results here because of the comment by Scrutinizer about the order of output.

I certainly should have included an example with my post, sorry about that. I am currently running this under cygwin 2.3.1 but this will also run on openSuse 13.2 x86_64.

I know that the term csv is sometimes used to refer to generic delimited text data and not just comma separated data. I stay away from comma separation because many of the fields I use (chemical names) have commas ( 1,1,4,4-tetrabutylpiperazine ). The values in the name column could also have unmatched single quotes ( N,N,N',N'-tetramethylguanidine ) or parenthesis ( 1-(2-aminoethyl)piperazine ). I think that code that replaces space with comma would be problematic in my particular case. Yet another reason why an example of real data would have been useful for me to post.

LMHmedchem

RudiC · November 25, 2016, 2:26pm

This works if the name column is added to the headers file:

awk '
NR == FNR       {T[$1] = NR
                 next
                }
FNR == 1        {MX = NR - 1
                 for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i
                }
                {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j == MX)?RS:"\t"
                }
' file2 file1
name	col_1	col_6	col_3	col_4	col_8	
name,1	2	0	1	0	9.6154	
name,2	7	0	0	0	8.6310	
name,3	4	0	1	0.6	4.6420	
name,4	6	0	1	0	7.7778	
name,5	2	0	3	0.833333	4.2924

LMHmedchem · November 25, 2016, 3:00pm

rudic:

This works if the name column is added to the headers file:

awk '
NR == FNR       {T[$1] = NR
   next
   }
FNR == 1        {MX = NR - 1
   for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i
   }
   {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j == MX)?RS:"\t"
   }
' file2 file1

Thanks, that seems to work just fine.

For clarity, in your script above, "file2" is the file with the headers and "file1" is the file with the data.

I used it like,

#!/bin/bash

# get_cols_by_header.sh 

# file with header names of required columns in required order
headers_file=$1

# space delimited file with data in columns
input_file=$2

# tab delimited output goes here
output_file='./temp_output.txt'

# use the columns headers listed in headers_file to locate the colums
# in input_file and print to output_file
awk '
NR == FNR       {T[$1] = NR
                 next
                }
FNR == 1        {MX = NR - 1
                 for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i
                }
                {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j == MX)?RS:"\t"
                }
' $headers_file  $input_file > $output_file

To run the script,

./get_cols_by_header.sh  header_file.txt  input_file.txt

Time to extract and print 48 columns from an input file with 73 columns and 2500 rows,

real    0m0.141s
user    0m0.170s
sys     0m0.030s

so this is pretty fast.

Are there any particular size limitations on this method as far as the size of the input file, number of rows and cols, etc?

LMHmedchem

RudiC · November 25, 2016, 3:52pm

I couldn't guess on the number of cols - this depends on your hardware situation (memory, swap), eventually system configuration (LINE_MAX(?) not sure...). Input file size or row count shouldn't matter, as there is always just one line operated upon and then immediately printed.

Why don't you run it like

awk '...' $1 $2 ./temp_output.txt

saving all the interim variables used just once?

LMHmedchem · November 25, 2016, 4:19pm

I never thought of doing that, but it's a nice idea. Do you not need a redirect to the output file?

I am still not sure if this will be a standalone script or part of a larger script, so I'm not yet sure where the values of the two input files and the output file will be coming from. For now, I wrote them out as bash variables, partly because $header_file will make more sense to me than $1 when I look at this next week.

It's good to know about the resource limits. I theory, there could be millions of rows but there will never be all that many columns (<100). I will test on some larger files and report back if I run into any limitations.

LMHmedchem

RudiC · November 25, 2016, 4:40pm

Sorry, a lapse on my part - you of course need a redirection operator for the output.

drl · November 26, 2016, 8:16am

Hi.

This issue seems to be resolved, but for completeness on the sub-topic of csvtool , there are options for specifying the input and output field separators (see -t and -u below). I should have presented this version first:

#!/usr/bin/env bash

# @(#) s2       Demonstrate extraction of fields, csvtool.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf sed pass-fail
pe
dixf csvtool

FILE=${1-data1}
E=expected-output.txt
H=headers

pl " Input data file $FILE:"
head $FILE

pl " Header name file $H, name list:"
head $H
h1=$( paste -s -d, $H )
pe " Header list = $h1"

pl " Expected output:"
cat $E

pl " Results, extract spaced-separated fields, write TAB-spaced:"
csvtool namedcol $h1 -t " " -u TAB $FILE |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

produces the same output as in post #5.

In addition, there are some facilities that handle embedded problematic characters in field strings in csv-formatted files with perl modules. For example, a small code (about 15 lines of script, but surrounded by 100 lines of supporting code for error detection, argument processing, etc. plus comments) adds quotes to fields, as in:

#!/usr/bin/env bash

# @(#) s3       Demonstrate short perl code to add quotes to csv files.

LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf
pe
dixf csvaq

FILE=${1-data3}

p=$( basename $0 ) t1="$Revision: 1.14 $" v=${t1//[!0-9.]/}
[[ $# -gt 0 ]] && [[ "$1" =~ -version ]] &&  { echo "$p (local) $v" ; exit 0 ; }

pl " Input data file $FILE:"
head $FILE

pl " Results, always add quotes:"
csvaq -q -t " " -u " " $FILE

exit 0

producing:

$ ./s3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.21

csvaq   Add quotes for Comma Separated Value format files. (what)
Path    : ~/bin/csvaq
Version : 1.3
Length  : 165 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Help    : probably available with [     ]-h,--help
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 diagnostics    1.34
 English        1.09
 Carp   1.3301
 Data::Dumper   2.151_01
 Getopt::Long   2.42
 feature        1.36_01
 experimental   0.007
 Text::CSV      1.32

-----
 Input data file data3:
cn0 name
cn,1 1,1,4,4-tetrabutylpiperazine
cn,2 N,N,N',N'-tetramethylguanidine
cn,3 1-(2-aminoethyl)piperazine

-----
 Results, always add quotes:
"cn0" "name"
"cn,1" "1,1,4,4-tetrabutylpiperazine"
"cn,2" "N,N,N',N'-tetramethylguanidine"
"cn,3" "1-(2-aminoethyl)piperazine"

Best wishes ... cheers, drl