Help with sed and replacing white spaces with commas

Dear all,

I am in a bit of a quandary. I have 400 text files which I need to edit and output in a very specific way.

Here is a sample text file copied from gedit ... The columns will come out a bit messed up but when I cat <file>, it gives a table with six columns (0-28, tot lob vol, vcsf, scsf, gray, white):

     tot lob vol        Vcsf           Scsf         gray          white
  0                      0.000000       0.000000       0.000000       0.000000
  1                      0.000000       0.000000       0.000000       0.000000
  2                      0.000000       0.000000       0.000000       0.000000
  3   26355.514477       0.000000    7986.582205   13035.582840    5333.349432
  4   61027.455763       0.000000    7617.445767   26799.098600   26610.911396
  5   16966.834205       0.000000    2342.000090    9508.623764    5116.210351
  6   16675.247439       0.000000    2196.206707    9037.121760    5441.918973
  7   57706.261817       0.000000   16951.324271   24216.177530   16538.760016
  8   91215.992021    2102.113105    7351.708891   41168.535796   40593.634229
  9   56586.444555       0.000000    6690.985687   29733.578183   20161.880685
 10   24258.571350       0.000000    4461.691120   13908.275147    5888.605082
 11   99730.945990    2593.261027   10288.256465   52107.175509   34742.252989
 12    8293.678905     967.819905      55.835764    4209.396188    3060.627049
 13   13760.413773    2139.336947      21.713908    6338.393179    5260.969738
 14   25516.944025       0.000000    7529.556139   10519.871485    7467.516401
 15   30830.647540     451.856088    3489.735233   14415.966999   12473.089220
 16   26945.925979       0.000000    7874.910677   12887.721466    6183.293835
 17   65250.293894       0.000000    7736.355264   29207.274410   28306.664221
 18   17500.375947       0.000000    2880.711810    9932.561970    4687.102167
 19   17848.832473       3.101987    2632.552860    9684.403021    5528.774605
 20   51296.522939       0.000000   14967.086667   21813.171698   14516.264575
 21   92378.203103    3180.570541    8504.614013   40641.198027   40051.820521
 22   58887.084820      14.475939    8116.865654   29876.269579   20879.473649
 23   25229.493242       0.000000    3892.993527   14250.527699    7085.972016
 24   90686.586261    1761.928544   10282.052492   48120.088380   30522.516844
 25   10664.630873    1746.418610     201.629147    5669.398010    3047.185106
 26   12702.636249    2146.574917      26.883886    5388.151200    5141.026246
 27   30967.134963       0.000000    9628.567257   13284.775786    8053.791920
 28   34047.407929     150.963361    4818.419611   15476.846510   13601.178447

If you've read that far, I thank you already.

Basically, I need to do the following:

  1. Delete the first four rows
  2. Delete the first column (numbers 1-28)
  3. Paste each row to the end of the previous row, so one file comes out as one super long line
  4. Comma-delimit all numbers
  5. Add new first field at the beginning of the line containing the file name
  6. Output all of this to a new text file which will take each of the 400 or so text files I have as above, convert them each to their own line consisting of steps 1-5.

So, an example output file from just the above file would be:

file1.txt, 26355.514477, 0.000000, 7986.582205, 13035.582840, 5333.349432, ..., 34047.407929, 150.963361, 4818.419611, 15476.846510, 13601.178447 <carriage return>
file2.txt, etc.... 

I know zippo about programming but pasting together various snippets I've found online have come up with the following which does not seem to working exactly right ... my output is joining some numbers, particularly those from the end of one line and the beginning of another.

#!/bin/sh

#script file placed in the directory above that which contains all the text files
#script run by typing ../script.sh from within the directory containing the files

files=`ls -l *_stats.txt` 
# all files end in _stats.txt

for file in ${files}
do

  cat ${file} | sed '1,4d'| awk '{print $2, $3, $4, $5, $6}' | sed 's/[:space:]+/,/g' >> output.txt  

#read file, delete lines 1-4, print all but the first column, delete all whitespace #between numbers including carriage returns and put them all on one line with #comma-delimiters; then new line with the output of the next file  


done

Any help would be most appreciated. It is sad, but I have spent HOURS trying to figure this out and I am pretty sure it would be ridiculously simple for anyone with basic bash scripting knowledge. Please help!

Best wishes,
Anthonz

For removing the first 4 rows you can use the sed command as follows

sed '1,4d' input_file

For removing the first field which contains the numbers (0-28) you can use the sed command as follows

sed 's/\(^[ ]*[0-9]\{1,2\}\)//g'

I think you want to join all the lines

Heres a sed solution for this

sed '$!N;s/\n/ /'
tr -s ' ' | sed 's/ /,/g'

Here why I am using tr -s means it contains repeated space characters. I am squeezing the repeated characters
into single space and I am converting the space to comma character.

Use this ,

files=`ls *_stats.txt`
for file in ${files}
do
        content=`cat ${file} | sed '1,4d'| sed  -r "s/^[0-9]+\s(.+$)/\1/g" | sed -r "s/\s+/,/g" | tr '\n' '\0'`
        echo "$file,$content" >> output.txt

done

Just modifying your code a little bit...

files=$(ls -1 *_stats.txt)
for file in ${files}
do
cat ${file} | sed '1,4d' | awk  '{printf ("'${file}'"==pfile?","$2","$3","$4","$5","$6:"\n""'${file}'"","$2","$3","$4","$5","$6)}{pfile="'${file}'"}' >> output.txt
done

Many thanks for your thoughtful replies. I will give them a try after some sleep and post the results.
cheers

Hi, antonz:

Welcome to the forums. The following generates output identical to your sample desired output.

for f in *_stats.txt; do
    printf '%s, ' "$f";
    sed '1,4d' "$f" | cut -d' ' -f2- | paste -sd' '  - | sed 's/ /, /g'
done > output.txt

or

for f in *_stats.txt; do
    printf '%s, ' "$f";
    sed -n '1,4d;s/[^ ]* //;H;5h;${x;s/[ \n]/, /gp;}' "$f"
done > output.txt

Safely prepending the filename to each line using sed is problematic, hence the lingering printf.

Cheers,
Alister

Another way:

awk 'FNR < 5{next}
FNR==5{printf("%s%s", f?"\n":"", FILENAME); f=1}
{$1="";printf $0}
END{printf "\n"}
' ORS=", " OFS=", " *_stats.txt > output.txt

Regards