Egrep or awk for removing values within CSV file?

macurdy · February 25, 2015, 9:56pm

Hello,

I have a large CSV file that contains values all on the same column, and in one very long row (e.g. no line breaks till end, with all data values separated by a comma).

The file has two types of data for the values. One begins with the letters rs and some numbers. The other begins with the letter i and some numbers. An example is below (id's are genome identifiers).

rs28931576,rs11542040,rs28931577,rs429358,i6007484,i6007510,rs28931578,i6007500,i6007489,i5000217,i6007504,i6007493,rs769455,i6007507,i6007497,i6007512,i6007495,i6007485,i6007492,i5000216,i5000205,rs7412

My Unix command line knowledge was enough to use the cat and cut commands to get the above data to this point.

I can't seem to figure out how to remove all of the values that begin with the letter i. I've tried some awk and egrep commands, but don't have the mastery yet to get this figured out.

I also need a way to get rid of duplicate commas after the i values are removed.

Right now, I'm using Find-Replace with TextEdit on mac to do these steps, however I'd love to be able to script this.

Any help is much appreciated!

pilnet101 · February 25, 2015, 11:53pm

Try:

awk '!/i/' RS="," ORS="," file.csv

Don_Cragun · February 26, 2015, 12:33am

Please use CODE tags around sample input, output, and code in your posts. Without them, long lines get split into multiple lines adding extraneous whitespace into your sample data.

You could also use something like:

sed -e 's/i[^,]*,//g' -e 's/i[^,]*$//' file

With your sample input, this produces the output:

rs28931576,rs11542040,rs28931577,rs429358,rs28931578,rs769455,rs7412

If your input lines are longer than LINE_MAX on your system, sed may fail due to line length limitations, while the awk script pilnet101 suggested should still work. But if the line length limit isn't a problem, sed is usually smaller and faster than awk for jobs like this. You can find the value of LINE_MAX on your system with the command:

getconf LINE_MAX

but it should not be less than 2048.

Scrutinizer · February 26, 2015, 3:36am

With this particular input, this might also work:

sed 's/i[^,]*,\{0,1\}//g' file

This works fine, but it does leave a trailing comma instead of a newline.
It would not hurt to use !/^i/ which would provide extra security here, even though it is not required with this particular input.

macurdy · February 26, 2015, 2:43pm

Thank you for the help. With the suggestion, I was able to get the text formatted correctly.