Spacing off when files combined using awk or cat

cmccabe · January 4, 2016, 2:06pm

I have 133 .txt files in a directory that I am combining into 1 file. The problem is when I use awk or cat to combine the files I get out put like this:

output

				
		85	138662360	KCNT1
		86	138662962	KCNT1
		82	138657053	KCNT1
		83	138657635	KCNT1

		95	138646881	KCNT1
		94	138642912	KCNT1
		98	138669261	KCNT1
		96	138662309	KCNT1
		97	138662661	KCNT1

I know the input and output do not match, but the format of the input is always the same, but it seems the spacing is off when I combine the files. If I do a copy and paste (copy file 1 then 2 and paste them into a text file) I get the desired output.

example input

name	31	Index	Chromosomal Position	Gene	Inheritance
		122	2106725	TSC2	AD
		124	2115481	TSC2	AD
		121	2105400	TSC2	AD
		82	135782221	TSC1	AD
		81	135782026	TSC1	AD
		126	2138218	TSC2	AD
		123	2113107	TSC2	AD
		125	2126142	TSC2	AD
name2	12	Index	Chromosomal Position	Gene	Inheritance
		1	43396568	SLC2A1	AD, AR
name3	20	Index	Chromosomal Position	Gene	Inheritance
		188	2135240	TSC1	AD
		179	2103379	TSC1 AD
		191	2137899	TSC2	AD
		181	2110617	TSC2	AD
		190	2137857	TSC2	AD
		189	2137806	TSC2	AD
		186	2133798	TSC2	AD
		187	2135074	TSC2	AD
		180	2105400	TSC2	AD
		183	2122822	TSC2	AD
		192	2138218	TSC2	AD
		185	2125937	TSC2	AD
		184	2125788	TSC2	AD
		193	2138269	TSC2	AD
		182	2112981	TSC2	AD

desired output

name	  31	Index	Chromosomal Position	Gene	Inheritance
                  82	135782221	TSC1	AD
                  81	135782026	TSC1	AD
name3  20	Index	Chromosomal Position	Gene	Inheritance
                  188	2135240	TSC1	AD
                  179	2103379	TSC1	AD
                  191	2137899	TSC1	AD

RudiC · January 4, 2016, 2:37pm

Not the slightest idea what your problem is. Where does the combined output come into play? I can't find any of those lines in your desired output. What does "spacing is off" mean?

cmccabe · January 4, 2016, 2:57pm

I hope this helps but I think the problem is when I combine two files there are many new lines that the new file contains that are not there when I do a copy and paste. Thank you :).

awk 'FNR==1{print ""}{print}' *.txt > example.txt

desired output (no spaces between new lines)

        name1 1 Index Chromosomal Position Gene Inheritance  
                    176 40757228 ADSL AR    
                    51 1.26E+08 ALDH7A1 AR   
                    49 1.26E+08 ALDH7A1 AR   
                    52 1.26E+08 ALDH7A1 AR   
                     50 1.26E+08 ALDH7A1 AR   
                    178 62857727 ARHGEF9 AD, AR  
                     13 1.6E+08 ATP1A2 AD     
        name2 2 Index Chromosomal Position Gene Inheritance    
                    102 52200340 SCN8A AD    
                    134 61991153 CHRNA4 AD   
                    136 62038585 KCNQ2 AD   
        name3 3 Index Chromosomal Position Gene Inheritance   
                    122 2106725 TSC2 AD    
                    124 2115481 TSC2 AD    
                    121 2105400 TSC2 AD      
        name4 4 Index Chromosomal Position Gene Inheritance    
                    4 43394661 SLC2A1 AD, AR   
                    22 1.67E+08 SCN1A AD     
        name5 5 Index Chromosomal Position Gene Inheritance    
                     75 52319081 EFHC1 AD, AR   
                      51 1.67E+08 SCN9A AD    
                       103 1.31E+08 SPTAN1 AD   
                      84 1.47E+08 CNTNAP2 AD   
                       134 6640393 TPP1 AR

Aia · January 4, 2016, 3:20pm

awk 'FNR==1{print ""}{print}' *.txt > example.txt

You are introducing those undesired lines with the highlighted red part.

Perhaps?

cat *.txt > example.txt

or

awk '1' *.txt > example.txt

cmccabe · January 4, 2016, 3:33pm

Unfortunately both the cat and awk commands produced the same output with the newlines between the name1, name2,name3.

output of command

name1


name2


name3

name1
name2
name3

Thank you :).

Aia · January 4, 2016, 3:40pm

In that case, I have to conclude that some of the source contain those spaces.
You can test what files contain empty lines:

grep -n '^$' *.txt

It will output filename:linenumber: for each line found empty.

To eliminate empty lines:

awk 'NF' *.txt > example.txt

cmccabe · January 4, 2016, 4:13pm

If the filenames have spaces in them:

annovar analysis CAP NGS-02 2015B-R2.txt
Annovar Analysis F41520
Annovar Analysis H52520

Is this ok? Thank you :).

The issue seems to be that in each one of the individual text files there is a bunch of whitespace at the end of each file:

name1



name2



name3



name4




name5

I need to remove this white space from all 133 files before cat , maybe with a bash loop? Thank you

Aia · January 4, 2016, 4:46pm

The awk command I suggested in post #6 should fix it.

awk 'NF' *.txt > example.txt

RudiC · January 4, 2016, 4:48pm

Those lines in example.txt are NOT empty, they contain sequences of <TAB>s, and, most important, DOS line terminators: <CR>.

cmccabe · January 4, 2016, 4:56pm

So would the below work? Thank you

for f in /home/cmccabe/Desktop/folder/*.txt ; do
     bname=`basename $f`
     pref=${bname%%.txt}
    sed 's/\r//' | | sed -E 's,\\t|\\r|\\n,,g' $f > /home/cmccabe/Desktop/new/${pref}_unix.txt
done

Aia · January 4, 2016, 5:09pm

As a rule I do not download any attachment from any forum. If it is true that you have lines containing tabs and spaces, and that the character return is present and you would like to remove those, please, try the following:

perl -ne 's/\r//; print unless /^(\W+)?$/' *.txt > example.txt

That would sanitized it for both instances.

cmccabe · January 5, 2016, 3:24pm

I apologize for the long post, I am trying to avoid attachments. If I merge file1,2,3 into one example.txt it appears that things do not copy over correctly. Then the awk does not result in the desired output. If I manually copy and paste the output is fine, but I have too many files to do that. cat doesn't seem to work either and I'm not sure what else to try. Thank you :).

file1.txt

name	31	Index	Chromosomal Position	Gene	Inheritance
		122	2106725	TSC2	AD
		124	2115481	TSC2	AD
		121	2105400	TSC2	AD
		82	135782221	TSC1	AD
		81	135782026	TSC1	AD
		126	2138218	TSC2	AD
		123	2113107	TSC2	AD
		125	2126142	TSC2	AD

file2.txt

name2	12	Index	Chromosomal Position	Gene	Inheritance
		1	43396568	SLC2A1	AD, AR

file3.txt

name3	20	Index	Chromosomal Position	Gene	Inheritance
		188	2135240	TSC1	AD
		179	2103379	TSC1 AD
		191	2137899	TSC2	AD
		181	2110617	TSC2	AD
		190	2137857	TSC2	AD
		189	2137806	TSC2	AD
		186	2133798	TSC2	AD
		187	2135074	TSC2	AD
		180	2105400	TSC2	AD
		183	2122822	TSC2	AD
		192	2138218	TSC2	AD
		185	2125937	TSC2	AD
		184	2125788	TSC2	AD
		193	2138269	TSC2	AD
		182	2112981	TSC2	AD

perl -ne 's/\r//; print unless /^(\W+)?$/' *.txt > example.txt

example.txt

name	31	Index	Chromosomal Position	Gene	Inheritance
		122	2106725	TSC2	AD
		124	2115481	TSC2	AD
		121	2105400	TSC2	AD
		82	135782221	TSC1	AD
		81	135782026	TSC1	AD
		126	2138218	TSC2	AD
		123	2113107	TSC2	AD
		125	2126142	TSC2	ADname2	12	Index	Chromosomal Position	Gene	Inheritance
		1	43396568	SLC2A1	AD, ARname3	20	Index	Chromosomal Position	Gene	Inheritance
		188	2135240	TSC1	AD
		179	2103379	TSC1 AD
		191	2137899	TSC2	AD
		181	2110617	TSC2	AD
		190	2137857	TSC2	AD
		189	2137806	TSC2	AD
		186	2133798	TSC2	AD
		187	2135074	TSC2	AD
		180	2105400	TSC2	AD
		183	2122822	TSC2	AD
		192	2138218	TSC2	AD
		185	2125937	TSC2	AD
		184	2125788	TSC2	AD
		193	2138269	TSC2	AD
		182	2112981	TSC2	AD

awk command to match specific name and copy header row where the match was found:

awk '
NF == 7         {HD = $0 RS}
$3 == "TSC1"    {printf "%s%s\n", HD, $0
                 HD = ""
                }
' example.txt > TSC1.txt

TSC1.txt (current output)

name	31	Index	Chromosomal Position	Gene	Inheritance
		82	135782221	TSC1	AD
		81	135782026	TSC1	AD
		188	2135240	TSC1	AD
		179	2103379	TSC1 AD

Desired output

name	31	Index	Chromosomal Position	Gene	Inheritance
		82	135782221	TSC1	AD
		81	135782026	TSC1	AD
name3	20	Index	Chromosomal Position	Gene	Inheritance
		188	2135240	TSC1	AD
		179	2103379	TSC1	AD

Aia · January 5, 2016, 4:26pm

It appears that your files file{1,2}.txt do not terminate in a new line.
Please, give it a try at:

perl -nle 's/\r//; print unless /^(\W+)?$/' file{1..3}.txt > example.txt

The -l will ensure that a new line is added after each print