Append next line to previous lines when NF is less than 0

cumeh1624 · April 15, 2014, 11:50pm

Hi All,

This is very urgent, I've a data file with 1.7 millions rows in the file and the delimiter is cedilla and I need to format the data in such a way that if the NF in the next row is less than 1, it will append that value to previous line.

Any help will be appricated.

Thanks,
cumeh1624

---------- Post updated at 10:50 PM ---------- Previous update was at 10:46 PM ----------

I will prefer any suggestion using awk command because of performance

balajesuri · April 16, 2014, 2:24am

By NF < 1, do you mean the row is empty? (a blank line, perhaps)

pravin27 · April 16, 2014, 3:15am

 awk '{printf "%s",(NF>0?$0:"\n")}' filename

cumeh1624 · April 16, 2014, 8:02am

It means the very line has one field without cedilla as a field seperator or it has a blank line.

SriniShoo · April 16, 2014, 8:42am

awk 'BEGIN{FS='�'}
  NR == 1 {p = $0; next}
  NF > 1 {print p; p = $0}
  NF <= 1 {p = (p " " $0)}
  END {print p}' input.txt > output.txt

cumeh1624 · April 16, 2014, 6:50pm

Hi Srinishoo,

The script works fine but there is a limitation to it. I tested it with 1.7 million rows in a data file and about 10,471 rows with numerical values and some few characters with 5 to 9 fields didn't get loaded into the output file.

So when processing the files this statement "

NF > 1 {print p; p = $0}

" didn't send about 10,471 rows to the output file and the only different with other rows is that it contain almost numerica values in most of the column fields.

This is examplle of what one of the row that didn't get loaded looks like

29863�890000000�543209911�CHNGOHG�000000001�055�

do you hav a suggestion in handling this issue?

SriniShoo · April 17, 2014, 12:13am

That should not be the case. Try the below code
exchanged single quotes with double quotes for FS assignment.

awk 'BEGIN{FS = "�"}
  NR == 1 {p = $0; next}
  NF > 1 {print p; p = $0}
  NF <= 1 {p = (p " " $0)}
  END {print p}' input.txt > output.txt

The number of rows will change between input and output as you are append few lines to previous line
and verify if both the input has the row you have provided

awk '$0 ~ /29863�890000000�543209911�CHNGOHG�000000001�055�/' input.txt
awk '$0 ~ /29863�890000000�543209911�CHNGOHG�000000001�055�/' output.txt

Don_Cragun · April 17, 2014, 4:05am

No.
With the awk script:

awk -F� '{print NF, $0}'

The number of fields printed as the 1st field in the output will be the number of � characters present on the line plus 1 for any line that contains any characters other than the terminating <newline> character. The only lines that will have NF < 1 (i.e., NF == 0 ) will be empty lines. Blank lines (lines containing only <space> and <tab> characters and the terminating <newline> character) that are not empty line (lines containing only the <newline> character) will have NF == 1 when the field separator is � .

cumeh1624 · April 20, 2014, 11:31am

The command we had in the script that performs the next line append to previous line takes almost 4 hrs because of the while loop and for performance reason is why we are looking to use a faster command like awk command but your lines of code do not provide same data file count

The lines of code in the script that takes four hrs to complete has output data file count of 176060 more than what your lines of code will produce at 175044 data file count.

Below is what the lines of code that we had in the script looks like

FLAG=0
cat filename | while read CUR_LINE
do
if [[ $FLAG -ne 0 ]];then
If [[ `echo ${CUR_LINE} | awk -F "�" '{print NF -1}'` -le 0 ]];then
PREV_LINE="{PREV_LINE} ${CUR_LINE}"
NEW_LINE=`echo ${PREV_LINE} | tr -d '\n' | tr -d '^M'`
PREV_LINE=${NEW_LINE}
else
echo ${PREV_LINE} >> ${OUT_FILE}
PREV_LINE='${CUR_LINE}
fi
else
PREV_LINE=${CUR_LINE}
FLAG=1
fi
done
echo ${PREV_LINE} >> ${OUT_FILE}

what I'm looking is to reduce the 4 hrs completion time in which your line of code will do but the output file count is different and formatting is also different.

Please let me know if you have a suggestion to this issue.

Scrutinizer · April 20, 2014, 1:21pm

That script is indeed inefficient and would take a long time, but this cannot be the actual script, since it contains several syntactical errors and also, the last line will probably be deleted (since in most shells this while loop will get executed in a subshell because of the pipe)

Please post a relevant input file and desired output and specify what OS and version you are using. Also, are there carriage returns in your input file?

Don_Cragun · April 20, 2014, 2:03pm

After reformatting your code so we can see the structure, getting rid of the subshell issue Scrutinizer mentioned, adding missing <dollar-sign> characters, changing <single-quote> characters to <double-quote> characters, and adding missing <double-quote> characters to get around syntax errors:

OUT_FILE=out
FLAG=0
while read CUR_LINE
do
        if [[ $FLAG -ne 0 ]]
        then
                if [[ `echo ${CUR_LINE} | awk -F "�" '{print NF -1}'` -le 0 ]]
                then
                        PREV_LINE="${PREV_LINE} ${CUR_LINE}"
                        NEW_LINE=`echo ${PREV_LINE} | tr -d '\n' | tr -d '^M'`
                        PREV_LINE="${NEW_LINE}"
                else
                        echo ${PREV_LINE} >> ${OUT_FILE}
                        PREV_LINE="${CUR_LINE}"
                fi
        else
                PREV_LINE="${CUR_LINE}"
                FLAG=1
        fi
done < filename
echo ${PREV_LINE} >> ${OUT_FILE}

we can see that this is grossly inefficient code. Having a while loop is not your problem, executing awk once for each of your 1.7 million input lines (except the 1st ) and tr twice for both empty lines and lines with only one field (especially since one of those invocations of tr is always a no-op) is going to be extremely slow.

Your code seems to be trying to remove <carriage-return> characters from your input (which you never mentioned were present before). And, we can't tell if you're trying to remove <carriage-return> or circumflex and upper-case M characters. (The above code removes all circumflex and upper-case M characters from your input.)

It also converts all sequences of one or more adjacent <space> and <tab> characters to a single <space> character (which again was not mentioned as a requirement until now). Is this intentional, or an accident? Or does your input contain no <tab> characters and no occurrences of multiple adjacent <space> characters?

It gets rid of backslash characters at the ends of input lines and joins lines that end with <backslash> characters no matter how many fields are on the joined lines. Is this intentional, or an accident? Or, are you sure that none of your input lines end with a <backslash> character just before a <newline> character?

And, depending on what shell you're using and what operating system you're using, any other <backslash> characters in your input could be deleted or converted to other characters by your uses of echo .

Please show us the code you are really using. Please also upload a SMALL sample input file (not more than 50 lines) that contains examples of all of the transformations that need to take place while removing characters, joining lines, and squeezing blanks, AND upload the desired output corresponding to that input. I explicitly say upload because we need to be sure that we will be able to see the difference between spaces and tabs in your desired input and output and see the <carriage-return> characters in your input.

cumeh1624 · April 20, 2014, 4:42pm

Hi Don,
That's exactly what the code looks like and all I'm looking for is to reduce the completion time, I didn't mention the removal of new line, squeezing blanks and control M character because if I can figure out how to reduce the completion time, I can easily implement other functionalities.

I'm not allowed to sample the data file but if you can suggest a way around it or commands to be used to reduce the completion time, I will really appreciate it.

Cumeh1624

Don_Cragun · April 20, 2014, 6:01pm

The code you showed us had syntax errors and would not run with any shell we have ever seen.

The code we have suggested would do exactly what you asked for, but clearly doesn't do what you want with the data you have. If you aren't able to give us a representative sample of data (scrubbed of any private data) that we can use to see what you're actually trying to do, we can't help you.

You have said that the script we have provided don't correctly format your output. How can we possibly guess at what that means if we can't see representative input and desired corresponding output?

SriniShoo · April 21, 2014, 12:36am

awk -F "�" 'NR == 1 {p = $0; next}
  NF > 1 {gsub("^M", X, p); print p; p = $0; next}
  {p = (p " " $0)}
  END {gsub("^M", X, p); print p}' filename

cumeh1624 · April 24, 2014, 6:48pm

Hi SriniShoo,

Using that script I want to be able to remove spaces in the fields using delimiter cedilla to detect each fields in the data file.

Thanks,
cumeh1624

Scrutinizer · April 24, 2014, 6:51pm

Please do not leave people guessing. Show a representative sample of input, desired output, attempts at a solution and specify what OS and versions being used, or this thread will be closed.

cumeh1624 · April 24, 2014, 8:00pm

This what the input data file rows look like

09 1598760                                      �chnge�                           03773634�dgedfr�2014 04 21 00 00 00PM�                            hdgete09�

After removing the fields space it should be in this format below.

09 1598760�chnge�03773634�dgedfr�2014 04 21 00 00 00PM�hdgete09�

Thanks,
cumeh1624

---------- Post updated at 08:00 PM ---------- Previous update was at 07:21 PM ----------

The OS system is AIX and shell is ksh.

Don_Cragun · April 25, 2014, 12:11am

Moderator comments were removed during original forum migration.