Shell script for .Txt to .csv conversion with data processing

Gautam_Banerjee · January 3, 2015, 11:41am

Hi experts,

I want to convert a txt file having rows and columns (CNAI_DUMP_raw.txt) by comparing it with another text file (paramaters.txt) and generate a output in CSV which contains only 3rd column from CNAI_DUMP_raw.txt, and the columns mentioned in parameters.txt.

FYI: There are two columns in parameters.txt one containing some column name from CNAI_DUMP.txt and other with the ideal value against it.

The output should contain actual value of the cell and the ideal value mentioned in parameters.txt.

Both the files are attached herewith.

Please let me know for any more information.

Scrutinizer · January 3, 2015, 12:16pm

Hi, what have you tried and where are you stuck? What column from CNAI_DUMP.txt needs to be matched agains which column in parameters.txt..

Gautam_Banerjee · January 3, 2015, 1:06pm

Hi,

I need to match 1st row in CNAI_DUMP_raw.txt with 1st column of parameter.txt

Desired output in CSV:

1st column: Parameter ( This column is present in CNAI_DUMP_RAW.txt in the first row and Parameter.txt in the first column)

2nd Column: Ideal Value ( This is the 2nd column present in Parameter.txt)

3rd Column: Cell Name ( Present in 4th cloumn of CNAI_DUMP_RAW.txt)
E.g RBR201A
RBR201B...

4th Column: Actual Value ( Present in CNAI_DUMP_RAW.txt)
This is the value of the cell in (CNAI_DUMP_RAW.txt) with Row id as Cell name and Column Id as 1st Column of Parameter.txt

---------- Post updated at 12:52 PM ---------- Previous update was at 12:38 PM ----------

Hi Scrutinizer,

I have tried to explain my desired output in the post above.

Also the input files were attached in the first post.

The desired output which I want to show in in csv format and can not be shown in code.

Please let me know for more clarification

---------- Post updated at 01:06 PM ---------- Previous update was at 12:52 PM ----------

Hi Scrutinizer,

I have tried to explain my desired output in the post above.

Also the input files were attached in the first post.

The desired output which I want to show in in csv format and can not be shown in code.

Please let me know for more clarification

ongoto · January 3, 2015, 5:52pm

Has anyone found any matches in these two files?

Don_Cragun · January 3, 2015, 8:32pm

The OP didn't bother mentioning:

that the field names contain uppercase letters in paramaters.txt but contain lowercase letters in CNAI_DUMP_raw.txt ,
what operating system is being used (which with an input file with lines containing well over 10,000 bytes per line can be very important), nor
what shell is being used.

If the OP is unwilling to provide a small set of sample input files, corresponding desired sample output, and a clear explanation of the processing to be performed to produce that desired output from the input files; there isn't a high likelihood that a working solution will be provided here. (Nor is there much incentive for the volunteers here to download 150Kb of data to try to guess at what needs to be done.)

Furthermore, when asked "what have you tried and where are you stuck?", the response was two repeated demands for us to provide working code.

The UNIX & Linux Forums is here to help people learn how to use the great tools available on UNIX and Linux systems to perform wondrous computing tasks. We are not here to work as unpaid programming staff for someone who is unwilling to demonstrate any attempt to do any work to get the job done.

Gautam_Banerjee · January 4, 2015, 2:46am

Hello Experts,

Apologies for the inconvenience faced, as I may have not able to present the problem.

Of course neither I am here to get my work completely done not I am a hacker but to learn and take help for experts like you for a text processing issue.

My OS version:

Linux L9AHD91 3.8.0-44-generic #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Shell is bash shell

Work done so far:

a=$( cat CNAI_DUMP_raw.txt | awk '{print $4}') ;

awk '{print $2}' paramaters.txt > b.txt

b= cat b.txt

c=$( cat paramaters.txt | awk '{print $1}') ;

echo $a;
echo $b;
echo $c;

Sample i/p files are CNAI_DUMP_raw.txt and paramaters.txt already attached
The output file required is in .csv format. How can I attach it here?

Don_Cragun · January 4, 2015, 4:27am

Rather than uploading well over a hundred kilobytes of data, it would be much better to:

describe the number of fields, number of bytes / line, and total sizes of your real input and expected output files,
provide abbreviated (fewer fields and fewer lines, but enough to show the structure of your data) sample input (in CODE tags) in your post instead of hoping that some of the volunteers here have space to burn on the systems they use to try to help you,
provide sample output corresponding to your sample input (also in CODE tags), and
describe in English what processing needs to be done to produce the output you want from the input files you have supplied (including case translations, input and output field separators [and when an input file uses space as a field separator and as data in a field, explain how you know which spaces are data and which spaces are separators]).

Don't just show us the code you have written. Explain what it is trying to do and how that relates to producing the output you want. We can help you fill in missing pieces of code once we understand what you're trying to do.

If you absolutely must upload a .csv file; assuming you have the desired output in a file named output.csv , move or link it to output.txt and upload output.txt .

RudiC · January 4, 2015, 5:08am

Having read and appreciating the comments, and being as overwhelmed as the others by the data files, and having removed the <CR> line terminators from the DOS/windows files, I'd like to propose for discussion (hoping to have guessed the requirements correctly)

awk     'FNR==NR        {T[$1]=$2; next}
         FNR==1         {$0=toupper($0) 
                         for (n=1; n<=100; n++) if ($n in T) C[$n]=n
                         getline
                         next   
                        }
                        {for (n in T) if (C[n]) print FNR, n, T[n], $4, $(C[n])
                        }
        ' OFS="," /tmp/paramaters.txt /tmp/CNAI_DUMP_RAW1.txt

This is working on a subset of the data; applying it to the entire data set as well as redirecting it to the desired output file is left to the reader.

derekludwig · January 4, 2015, 11:44am

Gautam Banerjee,

Some of the space separated key/value pairs of parameters.txt are malformed:

NCCPERM As Per BSIC PLAN
CS 1 for non MBC
CHAP 6 for MBC
CHAP 11 for BCCH on OL

Also, providing five lines of 4068 columns each is not a good example of an input datafile. Can you provide the columns that are pertinent to the problem at hand?

As Don Cragen as requested, please give better description of what you are trying to accomplish, including output? This should include something like:For example, on line 4 of the input sample data, the value of the 5 column ("a.dent") is used as the lookup key in the parameters table. The returned value ("towel") is used as column 4 of the output file.

The result should look something like:

$1 $2 $4 lookup($5)

ongoto · January 4, 2015, 11:58am

Also for discussion, here's how CNAI_DUMP_RAW1.txt appears when loaded into Gnumeric. Each line extends to column EZL. I don't suppose it matters how many columns that is?

Gautam_Banerjee · January 4, 2015, 1:28pm

Hi experts,

I have pasted a small sample of two inputs and one output desired.

Any suggestions on how to get the desired output using awk, sed or a shell script file?

Please suggest some code to achieve the desired output.csv

ongoto · January 4, 2015, 1:29pm

Parameter   Ideal Value     Cell Name       Actual Value
---------   -----------     ---------       ------------
accmin      Ranges          RBR201A         100
cchpwr      33              RBR201A         33
cchpwr      30              RBR201A         33
att         yes             RBR201A         yes
...
...
maxret      2               RBR201A         1

@ Gautam Banerjee
Is this sample output close to the output you are looking for?

derekludwig · January 4, 2015, 2:18pm

Gautam Banarjee,

Can you check the parameters.txt file that your provided? There are differences between your OnGoto's postings. For example, the parameter accmin has a value of '100' as shown by your posting, while the file has 'RANGES', as shown by OnGoto.

Please also remember that standard CSV does not provide a method to highlight values or cells.

DL

Don_Cragun · January 4, 2015, 2:19pm

Please help us help you.

I applaud your efforts to show us reasonable sized sample inputs and corresonding output. But, PLEASE, don't show us pictures of data. (Especially with some data truncated due to field widths in the pictures of tables you show us.) Give us text (in CODE tags) that we can copy and paste into files we can use as test data to try to test code we might write to solve your problem!

I see that the name of one of your input files has changed from paramaters.txt to parameters.txt (and that is good), but why does parameters.txt show four data rows while your expected output has six data columns? Why is the 3rd column in your output (with heading limit1 ) present when limit1 does not appear in parameters.txt ?

You don't show what field separators are to be used in your output. Although most .csv files use comma as the separator, that isn't always the case.

And, flat text .csv files don't contain red coloring. Explain how (or if) this distinctive output is supposed to actually appear in output.csv .

And, please be sure that the sample data you provide is representative of problems your script will have to handle. For example, your original data files contained things like:

NCCPERM As Per BSIC PLAN
CHAP 6 for MBC
CHAP 11 for BCCH on OL

which I assume in your latest example should now be:

nccperm As Per BSIC PLAN
chap 6 for MBC
chap 11 for BCCH on OL

but your latest sample doesn't show us anything like these cases.

Please explain what the output should look like for a column with the header nccperm ? Should it be nccperm(As) or nccperm(As Per BSIC PLAN) or nccperm(as) or nccperm(as per basic plan) or something else?

Which input line is supposed to be used for chap ? Or are there supposed to be two copies of that input column in the output.csv file with different values in parentheses on the header line?

ongoto · January 4, 2015, 3:08pm

Guys
I've been trying to decypher post#3 to come up with an output example to work from. The OP listed 4 columns, but maybe I'm missing the point. The way it reads to me is; column 1 and 2 get info from Parameter.txt; columns 3 and 4 from CNAI_DUMP_RAW.txt after a lookup for a match in the header row, and then values from the other rows in those matching columns.
That's how I came up with 'Ranges' (Parameters.txt col 2).

P.S. I sympathize with the OP in not being able to use his native tongue. I would hate to have to explain all this in French; being that English is all I know ( and I could use some lessons with that even ).

Gautam_Banerjee · January 4, 2015, 10:27pm

Ongoto: Yes I am looking for that output only. Instead of Ranges, Lets keep 100 as a value.

Derekludwig: Ranges actually refers to [100-103], Here to make our work somewhats easier, I have selected 100.

ongoto · January 5, 2015, 10:42am

A program which extracts the values you want to see shouldn't be hard to do, but you will have to edit the parameters.txt file yourself and provide the values you want to see, and omit what you dont want to see. The script can only extract the values you offer. It can't do the editing. In order for scripts to work, the data types and format have to be consistent.

You need to use single words with no spaces, or use numeric values, in the values column:
HSN AsperPlan # would be good (camel case)

# The following types can be a problem for Linux:
CS 1 for non MBC
CHAP 6 for MBC 
CHAP 11 for BCCH on OL
etc.
# Better would be CS_1_for_non_MBC, etc. (or camel case)

If you want to use 100 in place of Ranges , then you will have to make those changes yourself, and give us (or yourself) a modified parameters.txt file to work from.

The field names in parameters.txt that are not found in CNAI_DUMP_RAW1.txt should be omited from parameters.txt.

---------- Post updated 01-05-15 at 07:42 AM ---------- Previous update was 01-04-15 at 11:17 PM ----------

First attempt in bash. Awk would be much faster I'm sure.

#!/bin/bash
###############################################################################
# Use these commands to strip carriage returns from input files
# sed -i 's|\r$||g' CNAI_DUMP_RAW1.txt
# sed -i 's|\r$||g' parameters.txt
###############################################################################
dumpfil="./CNAI_DUMP_RAW1.txt"
parmsfil="./parameters.txt"
outfil="./csv.out"
> $outfil

hdr=($(head -n 1 $dumpfil))
lnum=0
while read raw
do
    lnum=$(( $lnum + 1 ))
    [ $lnum -lt 3 ] && continue     # skip header and row separators
    while read parm1 parm2
    do
        fnum=0  # lookup actual values
        for fld in ${hdr
[*]};
        do
            fnum=$(( $fnum + 1 ))
            if [[ $fld == ${parm1,,} ]]; then   # field headers are lower case
                break
            fi
        done
        cell=$(echo $raw | cut -d ' ' -f 4 2>&1)
        act_val=$(echo $raw | cut -d ' ' -f $fnum 2>&1)
        printf "%s,%s,%s,%s\n" "$parm1" "$parm2" "$cell" "$act_val" | tee -a $outfil
    done < $parmsfil
done < $dumpfil

# output
# ------
# CCMIN,RANGES,RBR201A,100
# CCHPWR,33,RBR201A,33
# CCHPWR,30,RBR201A,33
# ATT,YES,RBR201A,YES
# CB,NO,RBR201A,NO
# CBQ,HIGH,RBR201A,HIGH
# ...
# ...
# etc.

# real    2m8.955s
# user    1m59.455s
# sys   0m12.274s