Bash: combine data in loop using AWK

oste91 · April 6, 2022, 1:00pm

Hello,
sorry for the stupid question but i can't solve this problem.
I have 2 file:

AAA.dat (3 columns x 100 rows)
BBB.dat (54 columns x 100 rows).

I would like to combine them with in loop for.
EX:

awk 'BEGIN {print "10"}' > new.dat
awk 'BEGIN{print "10 0 7"}' >> new.dat
awk -F, 'NR==1{print $1, $2, $3}' AAA.dat >> new.dat
awk 'NR==1{print $"1"}' BBB.dat >> new.dat
awk 'BEGIN{print "52 2 52"}' >> new.dat
awk -F, 'NR==1{print}' ice.dat >> new.dat

I wish the loop would run on NR==i. for i in {1..100}

Can you help me???

munkeHoller · April 6, 2022, 1:25pm

@oste91 , welcome , we hope you find the forum helpful.

can you confirm the following

is this coursework/homework
which OS , shell, awk versions are you using
do you have experience in programming
can you supply small sample of the ACTUAL data in the files mentioned.
give an example of the expected output of the work you are asking assistance on

once you've replied we can look at giving advice

thanks

oste91 · April 6, 2022, 1:30pm

I use a Mac OS

munkeHoller · April 6, 2022, 1:33pm

ok, you need to respond to all the questions.

oste91 · April 6, 2022, 1:41pm

Yes, sorry.

this is homework
I use a Mac OS
I don't have much experience in programming
I don't have example, sorry.

Paul_Pedant · April 6, 2022, 2:13pm

If you do not have example data, and an understanding of what the output should look like, then you cannot test it. Code that only works in your head is not code.

Do you need to use awk? There is a utility command called paste that does this job.

Awk generally only reads one file at once (there is a method to work round this, but it is regarded as "advanced").

The standard method is to read the first file into an array, and then stitch together those array elements with the corresponding lines of the second file. You need to take special care if the files are not of the same length.

Maybe try something like that yourself, and post your effort so we can help fix it. It should only be about three lines of awk.

oste91 · April 6, 2022, 2:19pm

file AAA.dat:

12    180    0.144
23    149    0.44
.
.
.

file BBB.dat:

2490 4900 4039 4847 42829 1039 3948 . . . .
2330 2923 1349 4343 43111 2333 3333 . . . .
.
.
.

I would like file OUTPUT:

12  180  0.144
2490 4900 4039 4847 42829 1039 3948 
23 149 0.44
2330 2923 1349 4343 43111 2333 3333
.
.
.

Is it possible ?

munkeHoller · April 6, 2022, 2:29pm

@oste91 , disappointed that you suddenly show some example data after saying you had none, however , remaining focused on the issue.
why not go away and think on the process needed to 'interleave' lines from these two files.
DO NOT think about awk/paste or any other program , just think about how to do this logically - step by step
write that process down on a piece of paper (or electronically) , review it, walk through it step by step as if you were actually doing it physically ... once you have an algorithm that you are happy with, post it back here for the team to review, then we can look at making suggestions on what technologies to use.

Paul_Pedant · April 6, 2022, 2:35pm

Very possible. I have a couple more queries:

(a) Your original post shows awk -F, which implies CSV (comma separated value) file format. This data is white-space separated (any combination of tabs and spaces, although plain tab-separated would be more common as you can then have multi-word text fields). Can you clarify ?

(b) It is fairly unusual to interleave data rows with a different number of fields. When you said "combine them", I immediately assumed your first output would be more like:

12,180,0.144,2490,4900,4039,4847,42829,1039,3948

Are you sure your assignment does not intend this? Either is easy, but good to answer the problem exactly.

Your original post specifies "within loop for". Awk does not need a for loop for lines -- it has a built-in cycle that processes lines one-by-one anyway. A Bash solution would need a loop, though. Is awk stipulated in the assignment, or did you pick it because it seemed like the right tool ?

oste91 · April 6, 2022, 2:40pm

I'm sure.
The code as written above works perfectly for the first line only.
But I would like to do it for everyone

Paul_Pedant · April 6, 2022, 3:01pm

The code at the top does not work at all.

(a) You set the field separator as ,, and there is no comma field separator in the data. When you print $1, $2, $3, field 1 is all of 12 180 0.144, and fields 2 and 3 are empty. It is not doing what you think it is.

(b) The first two lines with BEGIN output two lines of data that do not even come from the input files.

(c) Where does ice.dat come into it ?

(d) You are running awk four times for every row of output.

(e) Sure, it runs for one line because one line cannot be out of sequence. The same logic can never work for two or more lines, because each new awk starts over at line 1 of each file. You have no mechanism to get the second (third, fourth, ...) line from each file, apart from embedding constant data in the script, instead of getting it from the file.

It would help if you explained exactly what tools and methods your course has covered so far, so we can come up with suggestions that conform to your stage of knowledge. No point in providing a solution that you do not learn from, or that your course supervisor knows you didn't write yourself.

oste91 · April 6, 2022, 3:05pm

Dear
It works perfectly!

Paul_Pedant · April 6, 2022, 4:32pm

$ cat Script
awk 'BEGIN {print "10"}' > new.dat
awk 'BEGIN{print "10 0 7"}' >> new.dat
awk -F, 'NR==1{print $1, $2, $3}' AAA.dat >> new.dat
awk 'NR==1{print $"1"}' BBB.dat >> new.dat
awk 'BEGIN{print "52 2 52"}' >> new.dat
awk -F, 'NR==1{print}' ice.dat >> new.dat
$ cat AAA.dat
12 180 0.144
23 149 0.44
$ cat BBB.dat
2490 4900 4039 4847 42829 1039 3948
2330 2923 1349 4343 43111 2333 3333
$ ./Script
awk: fatal: cannot open file `ice.dat' for reading (No such file or directory)
$ cat new.dat
10
10 0 7
12 180 0.144  
2490 4900 4039 4847 42829 1039 3948
52 2 52
$

Sure it does. Perfectly

Exactly why does NR==1{print $"1"} print seven fields, not one like you asked it to ?
Why does it throw a fatal error ?
Why does it output five lines, not two?
Why are there two extra (invisible) spaces after 12 180 0.144 ?
But mainly, why can't you see that nothing in your script will ever do anything with any data after line 1 of the input files?

oste91 · April 6, 2022, 5:17pm

My OUTPUT file:

10   +++                                                                                     awk 'BEGIN {print "10"}' > new.dat
10 0 7      +++                                                                             awk 'BEGIN{print "10 0 7"}' >> new.dat
12 180 0.144   +++                                                        awk -F, 'NR==1{print $1, $2, $3}' AAA.dat >> new.dat
42829      (MAX VALUE each BBB.dat row)      +++              awk 'NR==1{print $"1"}'BBB.dat >> new.dat
52 2 52                       +++                                                      awk 'BEGIN{print "52 2 52"}' >> new.dat
2490 4900 4039 4847 42829 1039 3948            +++            awk -F, 'NR==1{print}' BBB.dat >> new.dat

This is what i get.
I would like NR == 1 to be incremented (i.e. NR == 1 + 1) for n. times

oste91 · April 6, 2022, 5:24pm

ice.dat is a mistake! sorry

Paul_Pedant · April 6, 2022, 6:14pm

This is a script that does exactly what you ask for. It runs a Bash loop for 100 items, and in each iteration it passes in the value of the line number which awk should take from the files this time around.

#! /bin/bash --

: > new.dat		#.. Empty the file.

for (( Nr = 1; Nr <= 100; ++Nr )); do
    awk -v Nr="${Nr}" 'NR == Nr { print; }' AAA.dat >> new.dat
    awk -v Nr="${Nr}" 'NR == Nr { print; }' BBB.dat >> new.dat
done

However, this is a joke (I have to say this because all the mods will otherwise laugh at me more that usual).

awk iterates right through the file for every value of Nr. So it reads two 100-line files 100 times -- 20,000 reads to pick out the 200 that match NR at some point. That scales at about O(n-cubed), and starting 200 processes means a large constant multiplier. I hope your trainer gets the joke too.

As there are only two rows in each test data file, the last 98 iterations never find NR == Nr because Nr will be higher than 2, and awk only sets NR to 1 and 2 before it hits EOF. But they still try all 100 values of Nr. A sensible version would count the lines in AAA.dat first to make the loop do the right number of iterations.

I do not comprehend your posted OUTPUT file.

(a) The data goes to new.dat, but the +++ bash command log comes out on the terminal. So you did an edit to put it together.

(b) The text 42829 (MAX VALUE each BBB.dat row) has magically dropped from the sky. Nothing in your code can possibly generate that. That speaks of a requirement you have not yet mentioned.

oste91 · April 6, 2022, 6:30pm

This is my original script:

awk 'BEGIN {print "10"}' > new.dat
awk 'BEGIN{print "10 0 7"}' >> new.dat
awk -F, 'NR==1{print $1, $2, $3}' AAA.dat >> new.dat
awk 'NR==1{print $"1"}' BBB.dat >> new.dat
awk 'BEGIN{print "52 1 52"}' >> new.dat
awk -F, 'NR==1{print}' BBB.dat >> new.data

oste91 · April 6, 2022, 6:33pm

And this is my OUTPUT original. As you can see it works

10

10 0 7

0.000000000 180.00000000 0.416513032

2940

52 1 52

2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2940 2938 2932 2924 2915 2906 2897 2887 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882 2882

munkeHoller · April 6, 2022, 6:37pm

pretty sure you said this was a few minutes ago

This is my original script:

awk 'BEGIN {print "10"}' > new.dat
awk 'BEGIN{print "10 0 7"}' >> new.dat
awk -F, 'NR==1{print $1, $2, $3}' file1.dat >> new.dat
awk 'NR==1{print $"1"}' ice.dat >> new.dat
awk 'BEGIN{print "52 1 52"}' >> new.dat
awk -F, 'NR==1{print}' ice.dat >> new.dat

oste91 · April 6, 2022, 6:40pm

I do not understand what you mean