Adding filename and line number from multiple files to final file

bioinfo · April 17, 2013, 11:30am

Hi all,
I have 20 files (file001.txt upto file020.txt) and I want to read them from 3rd line upto end of file (line 1002). But in the final file they should appear to start from line 1.
I need following kind of output in a single file:
Filename Line number 2ndcolumn 4thcolumn

I was able to remove first 2 lines from each file (starting with #) and getting only 2nd and 4th column. Here is the code:

cat *.txt | awk '{print $2, $4} | sed "/#\/d" > out.txt

Please guide me.
Thanks

Don_Cragun · April 17, 2013, 11:49am

bioinfo:

Hi all,
I have 20 files (file001.txt upto file020.txt) and I want to read them from 3rd line upto end of file (line 1002). But in the final file they should appear to start from line 1.
I need following kind of output in a single file:
Filename Line number 2ndcolumn 4thcolumn

I was able to remove first 2 lines from each file (starting with #) and getting only 2nd and 4th column. Here is the code:
cat *.txt | awk '{print $2, $4} | sed "/#\/d" > out.txt
Please guide me.
Thanks

If you don't show us sample output that matches what you say you want to have done, we can't figure out what you really want. You say you only want the 2nd and 4th columns, but your sample output file starts with:

1 1 23.01 119.00
1 2 56.00
.. ....
1 1000 -09.00 89.00

each of which has three or four columns.

Then there is the script you said you use to get what you want, but the sed command in this pipeline ( sed "/#\/d" is not a valid command and will generate a diagnostic something like:

sed: 1: "/#\/d": unterminated regular expression

Please:

Use CODE tags (not QUOTE tags) when showing us input and output files.
Show us input files. And,
Show us output that matches your description of your desired output.

bioinfo · April 17, 2013, 12:23pm

I used following, I think previous was my typo error. Sorry for that.

cat *.txt | awk '{print $2, $4} | sed "/#ainst\|#Time/d" > out.txt

I am getting only two columns from my code but I want to have four (1st column as digit from file name and 2nd column as line number), which I showed previously.
Format of my input files:

#ainst
#Time                   tem                 pre                    apot                inst                kin
      10.000         1.95221         0.0000079230          919.62689       149.40629      3858.88908
      20.000         1.22713         0.0000000379           27.40189      -110.08021      2303.82262
....
    10000.000         0.63837        -0.0000007208         -256.43974      -242.08325      1590.95448

I wish to have following output with 1st column as digit from file name and 2nd column as line number:

001   1      1.95221          919.62689       
001   2      1.22713           27.40189      
....
001   1000   0.63837         -256.43974      
002   1      4.98221           19.62689       
002   2      10.52713         127.40189      
....
002    1000   0.43837         -956.43974   
.....
020   1      8.98981           56.62689       
020   2      10.52713          29.40189      
.... 
020    1000   9.43837         -56.43974

Thanks

Don_Cragun · April 17, 2013, 1:14pm

bioinfo:

I used following, I think previous was my typo error. Sorry for that.

cat *.txt | awk '{print $2, $4} | sed "/#ainst\|#Time/d" > out.txt

I am getting only two columns from my code but I want to have four (1st column as digit from file name and 2nd column as line number), which I showed previously.
Format of my input files:

#ainst
#Time                   tem                 pre                    apot                inst                kin
   10.000         1.95221         0.0000079230          919.62689       149.40629      3858.88908
   20.000         1.22713         0.0000000379           27.40189      -110.08021      2303.82262
....
   10000.000         0.63837        -0.0000007208         -256.43974      -242.08325      1590.95448

I wish to have following output with 1st column as digit from file name and 2nd column as line number:

001   1      1.95221          919.62689       
001   2      1.22713           27.40189      
....
001   1000   0.63837         -256.43974      
002   1      4.98221           19.62689       
002   2      10.52713         127.40189      
....
002    1000   0.43837         -956.43974   
.....
020   1      8.98981           56.62689       
020   2      10.52713          29.40189      
.... 
020    1000   9.43837         -56.43974

Thanks

You also have mismatched single quotes in the awk command in your pipeline. Please be more careful in the future when you post samples of code.

It is hard to tell with the minimal sample input provided, but I think the following awk script does what you want:

awk '
FNR==1{ fn = substr(FILENAME, 5, 3)
        n = 0
}
/^#/{   n++
        next
}
{       printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)
}' *.txt

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

bioinfo · April 17, 2013, 1:36pm

Thanks a lot.
I would highly appreciate if you explain it.
Thanks again.

Don_Cragun · April 17, 2013, 2:05pm

Does this help?

awk '
FNR==1{ # This is the first line in a new file...
        fn = substr(FILENAME, 5, 3) # Save 3 characters from this filename
        n = 0   # Clear number of comments found in this file
}
/^#/{   n++     # Increment number of comments found in this file
        next    # Do not do any other processing on comment lines
}
{       # Print saved characters from filename, number of non-comment lines,
        # and fields 2 and 4 from the current line
        printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)
}' *.txt # Process all files ending with ".txt" in the current directory

bioinfo · April 17, 2013, 2:10pm

Thanks
This script is awesome but somewhat tough for me as I am a beginner. Is there any possibililty to add something easy to my previous code (below) to do the same thing.

cat *.txt | awk '{print $2, $4}' | sed "/#ainst\|#Time/d" > out.txt
or
cat *.txt | awk '{NR >=3 && NR <= 1002 {print $2, $4}' > out.txt

Thanks

Don_Cragun · April 17, 2013, 2:36pm

bioinfo:

Thanks
This script is awesome but somewhat tough for me as I am a beginner. Is there any possibililty to add something easy to my previous code (below) to do the same thing.
cat *.txt | awk '{print $2, $4}' | sed "/#ainst\|#Time/d" > out.txt
or
cat *.txt | awk '{NR >=3 && NR <= 1002 {print $2, $4}' > out.txt
Thanks

Besides being unneeded, if you cat the files instead of letting awk open them, awk can't recover the filenames. The awk command can easily perform arithmetic calculations (such as subtracting the number of comments found); sed can't.

You said you wanted to print a portion of the file's name on each output line. You can't do that with either of the scripts above. In your scripts, neither awk nor sed have access to the filenames of the input files they are processing.

You said you wanted to delete the 1st two lines (which start with a #) from each file. Your 2nd script can't be made to do that if you keep the cat. Your 2nd script throws away the 1st two lines of the 1st file but keeps all others. It can easily throw away all lines starting with # (like my awk script did), but since you are only giving awk one input file, it can't throw away the 1st two lines of any file except the first file it is given if it is using line numbers instead of matching lines that start with a #.

You said you wanted to print the line number (not counting the comment lines) for each line in your input files. You can easily do that by using the awk script I suggested; you can't do it with either of your pipelines without making the awk portion of your pipeline look a lot more like what I suggested before.

What is it about the awk script I provided that is too tough to understand?

bioinfo · April 17, 2013, 2:59pm

Thanks a lot for letting me know the concepts in detail.

I did not understand the following things:

FNR==1{ # This is the first line in a new file...

fn = substr(FILENAME, 5, 3) # Save 3 characters from this filename

printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)

I have a general question. In which case I should use cut/paste/cat commands or grep or sed or awk. I am really confused.
I am getting different answers while googling.

Thanks again.

Don_Cragun · April 17, 2013, 4:49pm

The awk utility maintains several variables as it processes a line of text from a file. As you already know, NR is the number of records that have been read from all of the input files. FNR is the number of records that have been read from the current file. When FNR is equal to 1, the condition portion of this awk statement is true and the action portion of the statement will be executed.

FILENAME is another variable maintained by the awk utility. It contains the name of the file that is being processed. You said your filenames were:

file001.txt
file002.txt
00000000011 character
12345678901 number within filename
file003.txt
...
file020.txt

The substr(string, start, count) function in awk returns count characters starting at character number start from string. For example when FILENAME is file001.txt , substr() will return the characters in red and store them in the variable fn (i.e., fn will contain the portion of the filename you want to print at the start of each line printed from this input file.

The awk printf(format, argument...) function is VERY similar to the C Language printf() function and the printf utility. In this case the function call:

printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)

prints the saved portion of the filename as a character string, a tab character, the current line number in the current file minus the number of lines in the current file starting with # as a decimal numeric string, a tab character, the 2nd field from the current line as a character string, a tab character, and the 4th field from the current line as a character string followed by a newline character.

You should use cut and paste when cut and paste do what you need to do and they do it more simply or more efficiently than could be done with your shell's built-in utilities AND you don't need more complex processing (such as that provided by awk or sed) that needs to be used to get the job done.

You should use cat when you need to concatenate two or more files into a single output file, when you need to feed the contents of one or more files into a utility that doesn't accept pathname operands, or when you have a version of cat that provides a non-standard extension that performs some text manipulation as it copies files that you need to perform.

You should NEVER use:

cat *.txt|awk 'awk program'

instead of:

awk 'awk program' *.txt

Creating an additional process like this takes more system resources to run your command, makes it run slower, and keeps awk from knowing how many files are being processed and what the names of the files are.

Many of the original UNIX utilities were designed to perform a transformation data read from standard input and write the transformed data to standard output. (These utilities can be called filters.) The idea was that filters could combined in a pipeline to perform much more complex tasks without making each utility more complex than needed. (This is an example of your basic KISS [Keep It Simple, Stupid] principle.) Unfortunately, many of today's utilities on many systems have forgotten the KISS principle.

Even with the original UNIX utilities, there were frequently many different ways to get a job done. Choosing which utilities to use depends on what you are trying to do, your ability to recognize the alternatives available, your ability to use the alternative tools available, and your knowledge of how utilities have evolved on various systems over the years so you know what will work portably on all of the systems you want to use and which code might have to be tweaked if you want to move your script to a different system.

Despite the fact that many of us have degrees in computer science or computer engineering (or both), there is a lot of art (as well as science and engineering) in programming.

hanson44 · April 17, 2013, 5:40pm

There is no hard and fast rule. cut and paste go together, and are very useful, along with join. cat is not needed that much. grep finds lines. sed makes quick arbitrary changes. awk works well with fields and can make programs. bash ties it all together, and can make programs. The main thing I would recommend is keep the code easy to read and maintain, even for other viewers. Emphasize readability over performance. Avoid trying to always do everything with awk, or always do everything with perl. It sounds like you already know it's better to learn a variety of commands, such as the key ones you mentioned, and a few others such as uniq, head, tail, and sort, and use the commands within the context of shell scripts.

bioinfo · April 18, 2013, 10:54am

Thanks a lot Don Cragon for such an extensive explanation and hanson44.
Does the amount of space between the lines matter or we can write awk program in one line too? Is it for proper readability only?

Thanks.

---------- Post updated at 10:54 AM ---------- Previous update was at 10:36 AM ----------

Hurray!
I got my output.
Thanks

Don_Cragun · April 18, 2013, 3:49pm

If is logically possible to write any awk script as a single line, if you're willing to type it into your shell. If the awk program is in a shell file to be executed, you'll have to restrict the length of each line in your script to the limits supported by your editor. You can also throw away all of the comments and change all of the variable names to single characters to make the script shorter.

I choose to write programs in a way that is easy for me to read and understand rather than to try to artificially produce 1-liners. If you ask me about an awk script I submitted here a month ago, I don't want to deal with the obfuscation caused by collapsing an easily read script into a single line.

If you take a script I supplied, modify it slightly to add a new feature, collapse it to a single line, and then ask me to help you debug your new feature; I will definitely be slower to respond and it will be much more likely that I won't respond at all.

bioinfo · April 18, 2013, 3:52pm

Ok thanks.

hanson44 · April 18, 2013, 4:35pm

Yes, awk does not usually care about the spacing. A short one-liner is convenient and easy to read. But once the script is longer than say about 60 or so characters, it gets progressively harder to read on one line. The proper readability is always a high priority. So I agree with Don Cragun. But you will meet other programmers who do not place as much emphasis on readability. Hopefully, you will not have to maintain or "fix up" their code. :rolleyes: