extract multiple cloumns from multiple files; skip rows and include filenames; awk

manishabh · August 18, 2009, 5:47pm

Hello,

I am trying to write a bash shell script that does the following:

1.Finds all *.txt files within my directory of interest
2. reads each of the files (25 files) one by one (tab-delimited format and have the same data format)
3. skips the first 10 rows of the file
4. extracts and prints out columns 2,14 , 15 into one output file
5. adds a new column to the final output file with the name of the txt file from where the data was extracted.

I have written a shell script which is not working properly and doesnot have the code for the part to skip 10 rows.

Below I have pasted a sample input file, output file and my code

Input file format:

TYPEtexttexttexttextintegerfloatfloattexttexttextintegerintegerintegerintegerFEPARAMSProtocol_NameProtocol_dateScan_DateScan_ScannerNameScan_NumChannelsScan_MicronsPerPixelXScan_MicronsPerPixelYScan_OriginalGUIDGrid_NameGrid_DateGrid_NumSubGridRowsGrid_NumSubGridColsGrid_NumRowsGrid_NumColsDATAmiRNA-v1_95_May07 (Read Only)####################Agilent Technologies Scanner G2505B US45102930155a18d8bd4-628a-4054-b2ba-45c7a66de583016436_D_20070426############1119282* TYPEfloatfloatfloatintegerintegerfloatintegerfloatfloatfloatintegerfloatfloatintegerSTATSgDarkOffsetAveragegDarkOffsetMediangDarkOffsetStdDevgDarkOffsetNumPtsgSaturationValuegAvgSig2BkgNegCtrlgNumSatFeatgLocalBGInlierNetAvegLocalBGInlierAvegLocalBGInlierSDevgLocalBGInlierNumgGlobalBGInlierAvegGlobalBGInlierSDevgGlobalBGInlierNumDATA26.709275.44777100012031791.11899038.717365.42632.954291202965.42632.9542912029* TYPEintegerintegerintegertextintegertextintegerintegertexttexttexttextfloatfloatFEATURESFeatureNumRowColchr_coordSubTypeMaskSubTypeNameProbeUIDControlTypeProbeNameGeneNameSystematicNameDescriptionPositionXPositionYDATA111 0 01miRNABrightCorner30miRNABrightCorner30miRNABrightCorner30 6774.29228.723DATA212 66Structural21DarkCornerDarkCornerDarkCorner 6800.2229.421DATA313chr14:100595916-1005958970 30A_25_P00010115hsa-miR-154*hsa-miR-154*NA6826.51228.385DATA414chr8:135881995-1358820100 50A_25_P00010390hsa-miR-30bhsa-miR-30bNA6850.48228.853DATA515chr14:100558179-1005581610 70A_25_P00010956hsa-miR-379hsa-miR-379NA6875.37228.408DATA616chr19:058916206-0589161860 80A_25_P00011941hsa-miR-517bhsa-miR-517bNA6900.98229.321

Output format: tab delimited file. The last column shows the filename from which the data was extracted

16774.29228.723ABC.txt26800.2229.421ABC.txt36826.51228.385DEF.txt46850.48228.853DEF.txt56875.37228.408XYZ.txt66900.98229.321XYZ.txt

My incomplete code:

find -name '*.txt' |
while read filename
do
awk -F"\t" -v name="$file"'
BEGIN {OFS="|"}
{print $2,$14,$15,name}
' $filename > output.txt
done

thanks in advance for your help.

danmero · August 18, 2009, 5:56pm

Your problem can be solved using awk but first please edit your first post and add [code] tags.

danmero · August 18, 2009, 8:06pm

Your data sample is useless try to copy/paste again and use

 tags not  tags !

From your spinet
for filename in *.txt	# you don't need to find anything special and you are in current directory anyway
do
	awk -F"\t" '	# awk have the internal FILENAME variable(read the manual)
				BEGIN {OFS="|"} {print $2,$14,$15,FILENAME}
				' $filename > output.txt
done

Not tested but should work if that's what you want.

manishabh · August 18, 2009, 10:55pm

Hi Danmero,

Thanks a lot for posting the code. I apologise for your frustating experience with trying to understand the tables. I ran the code, however it throws an error:

'test1.sh: line 3: syntax error near unexpected token `do
'test1.sh: line 3: `do

also I forgot to mention that each of my files is in a subdirectory. So the directory hierarchy is as follows:
root_folder-->ABC-->ABC.txt
-->CDF-->CDF.txt

So I changed a code a bit as follows:

for filename in $(find -iname '*.txt') 
do
 awk -F"\t" ' 
    BEGIN {OFS="|"} {print $2,$14,$15,FILENAME}
    ' $filename > output.txt
done