Using GAWK to combine files

paragkalra · November 13, 2009, 1:24pm

Hello All,

I have a folder containing few files. Each & every file contain only 1 column.

I want to combine only column of all the files through GAWK, separate them by a delimiter and store it to a new file.

So basically using GAWK, I want to combine '$1' of all files, separate them by a delimiter and save it in a file:
i.e. '$1:$1:$1:$1:$1.....'

otheus · November 13, 2009, 2:03pm

This should work:

{ tr '\n' ':'  * ; echo ; } >../output

danmero · November 13, 2009, 2:30pm

I'll suggest paste - merge corresponding or subsequent lines of files

paste -d\: * > newfile

paragkalra · November 13, 2009, 2:58pm

tr is throwing an error:

And Paste is using the new line characters of the original rows. So rows are not appearing on single line. I would just like to remove new line characters of all the rows of all the columns except the column of last file.

Franklin52 · November 13, 2009, 3:27pm

awk '{printf("%s%s", NR==1?"":":", $0)}END{print ""}' * > newfile

danmero · November 13, 2009, 3:27pm

Hmm, here is my test

# ls
file1   file2   file3   file4   file5   file6   file7   file8   file9
# cat *
1
2
3
4
5
6
7
8
9
# paste -d\: *
1:2:3:4:5:6:7:8:9

Can you post a data sample.

paragkalra · November 13, 2009, 11:53pm

I think I missed out one information....

Although all the files have 1 column but that 1 column may have multiple rows but number of rows in all the rows will remain same...

Scrutinizer · November 14, 2009, 4:08am

I is unclear to me what you are after. Suppose you have a couple of files with just one column and e.g. 4 rows then danmero's paste command should just work...
example:

$> ls *.txt
a.txt  b.txt  c.txt  d.txt  e.txt  f.txt  g.txt

$> cat a.txt
a1
a2
a3
a4

$> cat g.txt
g1
g2
g3
g4

$> paste -d\: *.txt
a1:b1:c1:d1:e1:f1:g1
a2:b2:c2:d2:e2:f2:g2
a3:b3:c3:d3:e3:f3:g3
a4:b4:c4:d4:e4:f4:g4

---------- Post updated at 01:08 AM ---------- Previous update was at 12:44 AM ----------

-or-

Do you need it transposed, like so:

$> for i in *.txt ; do xargs < $i | tr ' ' ':' ; done
a1:a2:a3:a4
b1:b2:b3:b4
c1:c2:c3:c4
d1:d2:d3:d4
e1:e2:e3:e4
f1:f2:f3:f4
g1:g2:g3:g4

$> paste -s -d\: *.txt
a1:a2:a3:a4
b1:b2:b3:b4
c1:c2:c3:c4
d1:d2:d3:d4
e1:e2:e3:e4
f1:f2:f3:f4
g1:g2:g3:g4

paragkalra · November 14, 2009, 5:18am

Ok fixed the issue...

I was actually combining windows file format files which had windows specific line break characters. As a result of which columns were not visible on same row...

Hence had to convert the result file to unix file as shown below:

I have few more questions:

What is the significance of black slash (\) in -d\
If I want to use tab as the delimiter then I guess following should be sufficient but somehow its not giving the desired results:

Any pointers?

Is there a way I can use a long single delimiter like '##--##' or may be '&+&+&'
I tried following but it doesn't seem to work.

Any idea?

Scrutinizer · November 14, 2009, 5:36am

In the paste example it is used as an escape character but it is not really necessary there:

paste -d: *.txt

should work too. in your tr -d example it means ascii character with octal value 15 followed by ascii character oct 32 (carriage return and substitute character)

try paste without the -d option
paste can only use single character separators (or a list thereof, which it will then cycle through and use one character at a time)
You could try this;

paste Columns/*.out|sed 's/\t/##--##/g' > Table.windows

See man paste for further details

paragkalra · November 14, 2009, 8:40am

It will also remove the tab charactes if any present in the data.

I think following hack will do the trick....

Changing FS as I am interested in the space characters after the columns otherwise GAWK will trim the white spaces...

Advantage of this is that there no need use 'tr' to remove line break characters - GAWK takes of care of it...

Scrutinizer · November 14, 2009, 9:08am

OK, then pick a character that is not present and replace that again by the string you want, like e.g.

paste -d:@ Columns/*.out|sed 's/@/##--##/g' > Table.windows

or some other character.