Removal of multiple characters with in double quotes

Jag_1981 · January 11, 2018, 10:18pm

For one of my need I was going through post "Removal of new line character in double quotes"

Which alister has replied like

$ cat data
"leave me alone"

"ABCD RENT-A-
CAR XYZ LTD","00N0H","Enterprise Lake","
100 View Way"
$ sed -n 'H;g;/^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/d; s/^\n//; y/\n/ /; p; s/.*//; h' data
"leave me alone"

"ABCD RENT-A- CAR XYZ LTD","00N0H","Enterprise Lake"," 100 View Way"

In my case requirement is to remove new line character as well as pipe (|) character when it is present inside double quotes. So for that how to enhance this code snippet. I am new to to this area and unable to under stand how mentioned code snippet work. Thanks for any help.

RavinderSingh13 · January 11, 2018, 11:39pm

Hello Jag_1981,

If your Input_file is same as shown sample then following may help you in same too.

awk '{printf("%s%s",$0~/^"/?(FNR==1?"":RS):FS,$0)} END{print ""}'  Input_file

Thanks,
R. Singh

Don_Cragun · January 12, 2018, 1:20am

jag_1981:

For one of my need I was going through post "Removal of new line character in double quotes"

Which alister has replied like
$ cat data
"leave me alone"

"ABCD RENT-A-
CAR XYZ LTD","00N0H","Enterprise Lake","
100 View Way"
$ sed -n 'H;g;/^[^"]*"[^"]*$"[^"]*"[^"]*$*$/d; s/^\n//; y/\n/ /; p; s/.*//; h' data
"leave me alone"

"ABCD RENT-A- CAR XYZ LTD","00N0H","Enterprise Lake"," 100 View Way"
In my case requirement is to remove new line character as well as pipe (|) character when it is present inside double quotes. So for that how to enhance this code snippet. I am new to to this area and unable to under stand how mentioned code snippet work. Thanks for any help.

You say that your requirement is to remove <newline> and <vertical-bar> characters found in double-quotes, but the sample data you have provided doesn't contain any <vertical-bar> characters (either inside or outside) pairs of double-quotes.

If you can't provide representative sample input data and the sample output that you want to produce from that input (in CODE tags), it makes it hard for us to know whether or not we are on the right track when making suggestions that might work for you.

It is also a very good idea to tell us what operating system and shell you're using whenever you start a new thread in this forum. Although the standard utilities perform the same basic operations, many utilities have additional features on some operating systems. If we know what operating system and shell you're using, we can limit our suggestions for you to things that will work in your environment.

When you read the sed manual page on your system to help you figure out how the code above works, where did you get stuck trying to understand what it does? What modifications have you tried on your own to meet your additional requirements?

Jag_1981 · January 12, 2018, 11:10am

My apology for not providing all reqd details..

Operating system - RHEL 6.9

Sample Input file.

$ cat data
111|"IKJA - SPORTS"|00IIQ|Normal|100 Hall Road|

123|"ABCD RENT-A-
CAR XYZ LTD"|00N0H|Enterprise Lake|"
100 View Way"|

244|"DEFG Travel | Tour
World LTD"|"AK|0Q"|Praire Lake|"
105 NE Main St"|

Expected Output File:

$ cat data
111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|

123|ABCD RENT-A-CAR  XYZ LTD|00N0H|Enterprise Lake|100 View Way|

244|DEFG Travel  Tour World LTD|AK0Q|Praire Lake| 105 NE Main St|

The input file is a pipe delimited csv file. However, there are some data which contain a new line character or pipe symbol ( every time enclosed with in double quote). And I do not have the option at present to change the file in source side. So looking for an option to correct the file on myside.

RudiC · January 12, 2018, 6:07pm

So you want to lose ALL double quotes as well? Try

awk -F"\"" -vRS= -vOFS= '
    {for (i=2; i<=NF; i+=2) gsub (/[|\n]/, "", $i) 
     $1 = $1
     print $0 ORS
    }
'  file
111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|

123|ABCD RENT-A-CAR XYZ LTD|00N0H|Enterprise Lake|100 View Way|

244|DEFG Travel  TourWorld LTD|AK0Q|Praire Lake|105 NE Main St|

Jag_1981 · January 12, 2018, 7:05pm

Thanks Rudic for your help. I tried the same, but it's removing new line character from end of line too. Also, the pipes (|) in following lines are not getting removed as expected.

$ awk -F"\"" -vRS= -vOFS= '
    {for (i=2; i<=NF; i+=2) gsub (/[|\n]/, "", $i)
     $1 = $1
     print $0 ORS
    }
'  test.csv > test1.csv
$ cat test.csv
111|"IKJA - SPORTS"|00IIQ|Normal|100 Hall Road|
123|"ABCD RENT-A-
CAR XYZ LTD"|00N0H|Enterprise Lake|"
100 View Way"|
244|"DEFG Travel | Tour
World LTD"|"AK|0Q"|Praire Lake|"
105 NE Main St"|

$ cat test1.csv
111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|123ABCD RENT-A-CAR XYZ LTD|00N0H|Enterprise Lake|100 View Way244|DEFG Travel  TourWorld LTDAK|0QPraire Lake105 NE Main St|

$ cat expected_output.csv
111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|
123|ABCD RENT-A-CAR XYZ LTD|00N0H|Enterprise Lake|100 View Way|
244|DEFG Travel Tour World LTD|AK0Q|Praire Lake|105 NE Main St|

$

RudiC · January 12, 2018, 8:16pm

Your test.csv does not comply to the data structure you posted earlier - data in post#4. There you had a blank line as a record separator on which to rely the script was laid out.

Don_Cragun · January 12, 2018, 8:29pm

Hi Jag_1981,
No. You changed the format of the input data in between post #4 and post #6!

In your sample input data in post #4 in this thread, records are separated by a blank line. But, the sample data that you say is not working has no separator between records and, therefore, there is no way to determine where one record ends and the next begins.

Your sample output also differs by removing the blank lines between output records.

Please do not blame RudiC for providing you with bad code when the code he provided works perfectly with the description of what was to be done on the sample input you originally provided.

Note also that your output samples are not consistent with your input samples in either post #4 or post #6. You say you want double-quoted <newline>s and <vertical-bar>s to be removed, but that is only true part of the time. In the sample outputs that you have shown us some of those characters are replaced by <space> characters instead of being removed. For an example look at Tour World versus TourWorld in both of those posts and RudiC's code removes all of them as requested in your problem statements, but it doesn't match (and can't match) the sample output you provided in either case. If some <newline> characters are to be replaced by <space> instead of being removed, you need to CLEARLY specify the logic that can be used to determine which action is to be taken. You also sometimes replace a single <space> with two adjacent <space> characters in one place in post #4.

Similarly, if there aren't any blank lines between records in your input file, you need to clearly specify how the end of a record is supposed to be identified. Please help us help you by clearly specifying what is supposed to happen and by providing sample inputs and outputs that match the behavior that you describe.

Jag_1981 · January 12, 2018, 9:00pm

Dear Don/RudiC,

My sincere thanks for being patience with me as well as helping me with my need.

I understand fully now that by sharing incorrect or partial input/output file without paying full attention to the same, I am wasting your valuable time.

I am attempting to summarize again my need with below details.

My input file is Pipe (|) Delimited CSV file.
It has multiple records and end of record is identified by new line character.
There is no blank lines between each record ( either in input or output file)
I want only double-quoted <newline>s and <vertical-bar>s to be removed. (replaced by Null)
The double quotes itself should be removed. (Replaced by Null)

Sample Input File:

111|"IKJA - SPORTS"|00IIQ|Normal|100 Hall Road|
123|"ABCD RENT-A-
CAR XYZ LTD"|00N0H|Enterprise Lake|"
100 View Way"|
244|"DEFG Travel | Tour
World LTD"|"AK|0Q"|Praire Lake|"
105 NE Main St"|

Expected Output file:

111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|
123|ABCD RENT-A-CAR XYZ LTD|00N0H|Enterprise Lake|100 View Way|
244|DEFG Travel  TourWorld LTD|AK0Q|Praire Lake|105 NE Main St|

Don_Cragun · January 12, 2018, 10:08pm

With the sample input shown in post #9 stored in a file named file.csv , the following code:

awk -F'\n' -v dq='"' '
{	record = record $0
	#printf("record in:\n%s\n", record)
}
(n = split(record, f, dq)) % 2 {
	#printf("split into %d fields\n", n)
	for(i = 2; i <= n; i += 2) {
		gsub(/[|]/, "", f)
		#printf("f[%d] updated to: \"%s\"\n", i, f)
	}
	for(i = 1; i <= n; i++)
		printf("%s%s", f, (i == n) ? ORS : "")
	record = ""
}' file.csv

produces the output requested in post #9:

111|IKJA - SPORTS|00IIQ|Normal|100 Hall Road|
123|ABCD RENT-A-CAR XYZ LTD|00N0H|Enterprise Lake|100 View Way|
244|DEFG Travel  TourWorld LTD|AK0Q|Praire Lake|105 NE Main St|

But note that it removes <newline> and <vertical-bar> characters found between pairs of <double-quote> characters; it does NOT replace them with <NUL> characters. Replacing those characters with <NUL> characters would give you a binary file instead of a text file.

If someone else wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

If you uncomment the commented out printf() statements, you can get an inside view at how it accumulates records and removes unwanted <vertical-bar>s and <newline>s.