Split a line into multiple lines based on delimeters

redse171 · August 19, 2014, 2:24pm

Hi,

I need help to split any lines that contain ; or ,

input.txt

Ac020	 Not a good chemical process
AC030	 many has failed, 3 still maintained
AC040	 Putative; epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative; upregulated in 48;h biofilm vs planktonic

The output should be:
Output.txt

Ac020	 Not a good chemical process
AC030	 many has failed 
AC030    3 still maintained
AC040	 Putative
AC040    epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
AC060    upregulated in 48
AC060    h biofilm vs planktonic

I did below code but it does not give me the ID in first column for the splited ones

sed -e 's/\(.\), /\1\n\t\t /g' input.txt | sed -e 's/\(.\);/\1\n\t\t/g' > Output.txt

The result that I got is:

Ac020	 Not a good chemical process
AC030	 many has failed 
         3 still maintained
AC040	 Putative
         epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
         upregulated in 48
         h biofilm vs planktonic

I don't know how should i do it to show the ID. Can anyone advise/help me on this? thanks

Akshay_Hegde · August 19, 2014, 2:31pm

Try

$ cat file
Ac020	 Not a good chemical process
AC030	 many has failed, 3 still maintained
AC040	 Putative; epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative; upregulated in 48;h biofilm vs planktonic

$ awk 'gsub(/[;,]/,RS $1 OFS) + 1' OFS='\t' file

Resulting

Ac020	 Not a good chemical process
AC030	 many has failed
AC030	 3 still maintained
AC040	 Putative
AC040	 epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
AC060	 upregulated in 48
AC060	h biofilm vs planktonic

redse171 · August 19, 2014, 2:40pm

Hi Akshay Hegde,

Thanks a bunch! It worked great..

Akshay_Hegde · August 19, 2014, 3:12pm

This also will work bit lengthy

$ awk 'match($0,regex){ n=split(substr($0,length($1)+1),A,regex); for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file

awk 'n=split(substr($0,length($1)+1),A,regex){for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file

redse171 · August 19, 2014, 3:18pm

akshay hegde:

This also will work bit lengthy

$ awk 'match($0,regex){ n=split(substr($0,length($1)+1),A,regex); for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file

awk 'n=split(substr($0,length($1)+1),A,regex){for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file

Yeah.. tried both and it works great too. A little bit complicated to understand compared to the first one. But I am glad that it gives something for me to think of. Thanks.

Chubler_XL · August 19, 2014, 5:00pm

You can also do this with sed:

sed -E ':a ; s/^([^ \t]+[ \t]+)([^,;]+)[,;][ \t]*/\1\2\n\1/; ta' infile

redse171 · August 19, 2014, 5:14pm

Yeah, it worked great too.. Thanks Chubler_XL..

RavinderSingh13 · August 20, 2014, 3:44am

Hello redse171,

Following may also help.

awk '{a=$1; gsub(/\,|\;/,"\n" a OFS,$0); print}'  filename

Output will be as follows.

 
Ac020    Not a good chemical process
AC030    many has failed
AC030  3 still maintained
AC040    Putative
AC040  epithelial cells
AC050    Predicted binding activity
AC060    rodC Putative
AC060  upregulated in 48
AC060 h biofilm vs planktonic

Thanks,
R. Singh

redse171 · August 20, 2014, 9:08am

ravindersingh13:

Hello redse171,

Following may also help.

awk '{a=$1; gsub(/\,|\;/,"\n" a OFS,$0); print}'  filename

Output will be as follows.

 
Ac020    Not a good chemical process
AC030    many has failed
AC030  3 still maintained
AC040    Putative
AC040  epithelial cells
AC050    Predicted binding activity
AC060    rodC Putative
AC060  upregulated in 48
AC060 h biofilm vs planktonic

Thanks,
R. Singh

Hi R. Singh,

Yeah..it worked perfectly too. Thanks so much