Discussion:
[clug] Splitting a file using bash
Hal Ashburner
2014-09-15 00:54:04 UTC
Permalink
I want to turn stdout into multiple files. I have markers where I
would like this to happen.

stdout:
data
data
data
this_is_a_marker
data2
data2
data2
data2


only it's very large

I have this function which works, but is slow.
Better ideas would include:
1) re-write everything in another language eg python
2) re-write split reports in C
3) ask CLUG if anyone has a faster way of doing this using standard
bash 4.1.2 or older on a redhat enterprise/centos system.

Yeah I just asked about optimising a shell script, I already feel bad
and you don't have to point out that I should. ;-)

function split_reports()
{
local input_file="$1"
local first_report="$2"
local second_report="$3"
# generalise the above using $@ if more than 2 needed
local breaks_seen=0
local line=""
while read line
do
if [[ $line =~ start_report ]]; then
breaks_seen=$((breaks_seen + 1))
# clobber it before using it
# don't write out the marker
: > ${input_file}.${breaks_seen}
else
case $breaks_seen in
[0-9]) echo "${line}" >> ${input_file}.${breaks_seen} ;;
*) echo_stderr "error breaks_seen is ${breaks_seen} -
should be 0-1";;
esac
fi
done < "${input_file}"

mv "${input_file}" "${input_file}.orig"
mv "${input_file}.0" "${first_report}"
mv "${input_file}.1" "${second_report}"
}
Andrew Janke
2014-09-15 00:57:41 UTC
Permalink
Isn't this what csplit is made for?


a
Post by Hal Ashburner
I want to turn stdout into multiple files. I have markers where I
would like this to happen.
data
data
data
this_is_a_marker
data2
data2
data2
data2
only it's very large
I have this function which works, but is slow.
1) re-write everything in another language eg python
2) re-write split reports in C
3) ask CLUG if anyone has a faster way of doing this using standard
bash 4.1.2 or older on a redhat enterprise/centos system.
Yeah I just asked about optimising a shell script, I already feel bad
and you don't have to point out that I should. ;-)
function split_reports()
{
local input_file="$1"
local first_report="$2"
local second_report="$3"
local breaks_seen=0
local line=""
while read line
do
if [[ $line =~ start_report ]]; then
breaks_seen=$((breaks_seen + 1))
# clobber it before using it
# don't write out the marker
: > ${input_file}.${breaks_seen}
else
case $breaks_seen in
[0-9]) echo "${line}" >> ${input_file}.${breaks_seen} ;;
*) echo_stderr "error breaks_seen is ${breaks_seen} -
should be 0-1";;
esac
fi
done < "${input_file}"
mv "${input_file}" "${input_file}.orig"
mv "${input_file}.0" "${first_report}"
mv "${input_file}.1" "${second_report}"
}
--
linux mailing list
linux at lists.samba.org
https://lists.samba.org/mailman/listinfo/linux
Hal Ashburner
2014-09-15 01:01:54 UTC
Permalink
Hadn't encountered before now.
Looks like this is exactly it and exactly what I'd hoped for asking CLUG.
Thanks Andrew!
Post by Andrew Janke
Isn't this what csplit is made for?
a
Post by Hal Ashburner
I want to turn stdout into multiple files. I have markers where I
would like this to happen.
data
data
data
this_is_a_marker
data2
data2
data2
data2
only it's very large
I have this function which works, but is slow.
1) re-write everything in another language eg python
2) re-write split reports in C
3) ask CLUG if anyone has a faster way of doing this using standard
bash 4.1.2 or older on a redhat enterprise/centos system.
Yeah I just asked about optimising a shell script, I already feel bad
and you don't have to point out that I should. ;-)
function split_reports()
{
local input_file="$1"
local first_report="$2"
local second_report="$3"
local breaks_seen=0
local line=""
while read line
do
if [[ $line =~ start_report ]]; then
breaks_seen=$((breaks_seen + 1))
# clobber it before using it
# don't write out the marker
: > ${input_file}.${breaks_seen}
else
case $breaks_seen in
[0-9]) echo "${line}" >> ${input_file}.${breaks_seen} ;;
*) echo_stderr "error breaks_seen is ${breaks_seen} -
should be 0-1";;
esac
fi
done < "${input_file}"
mv "${input_file}" "${input_file}.orig"
mv "${input_file}.0" "${first_report}"
mv "${input_file}.1" "${second_report}"
}
--
linux mailing list
linux at lists.samba.org
https://lists.samba.org/mailman/listinfo/linux
Scott Ferguson
2014-09-15 01:00:59 UTC
Permalink
Post by Hal Ashburner
I want to turn stdout into multiple files. I have markers where I
would like this to happen.
Curious - why didn't you use 'split' or 'csplit'??

i.e. for split, 1st pass find byte markers/line numbers, 2nd pass split
at byte markers/line numbers.

What is the criteria for splitting?
Are these splits mid-line?

-------------8<----------->8------------------


Kind regards
Hal Ashburner
2014-09-15 01:07:18 UTC
Permalink
Only because I didn't know any better.
Now, thanks to CLUG, I do. :-)

On 15 September 2014 11:00, Scott Ferguson
Post by Scott Ferguson
Post by Hal Ashburner
I want to turn stdout into multiple files. I have markers where I
would like this to happen.
Curious - why didn't you use 'split' or 'csplit'??
i.e. for split, 1st pass find byte markers/line numbers, 2nd pass split
at byte markers/line numbers.
What is the criteria for splitting?
Are these splits mid-line?
-------------8<----------->8------------------
Kind regards
--
linux mailing list
linux at lists.samba.org
https://lists.samba.org/mailman/listinfo/linux
Kim Holburn
2014-09-15 08:39:10 UTC
Permalink
I haven't used split but couldn't you use a script like this:

#----------------------------------
#!/bin/bash

exec > "$1"
shift
while read line; do
if [[ this_is_a_marker == $line ]] ; then
exec > "$1"
shift
continue
fi
echo $line
done
#---------------------------------
Post by Hal Ashburner
I want to turn stdout into multiple files. I have markers where I
would like this to happen.
data
data
data
this_is_a_marker
data2
data2
data2
data2
only it's very large
I have this function which works, but is slow.
1) re-write everything in another language eg python
2) re-write split reports in C
3) ask CLUG if anyone has a faster way of doing this using standard
bash 4.1.2 or older on a redhat enterprise/centos system.
Yeah I just asked about optimising a shell script, I already feel bad
and you don't have to point out that I should. ;-)
function split_reports()
{
local input_file="$1"
local first_report="$2"
local second_report="$3"
local breaks_seen=0
local line=""
while read line
do
if [[ $line =~ start_report ]]; then
breaks_seen=$((breaks_seen + 1))
# clobber it before using it
# don't write out the marker
: > ${input_file}.${breaks_seen}
else
case $breaks_seen in
[0-9]) echo "${line}" >> ${input_file}.${breaks_seen} ;;
*) echo_stderr "error breaks_seen is ${breaks_seen} -
should be 0-1";;
esac
fi
done < "${input_file}"
mv "${input_file}" "${input_file}.orig"
mv "${input_file}.0" "${first_report}"
mv "${input_file}.1" "${second_report}"
}
--
linux mailing list
linux at lists.samba.org
https://lists.samba.org/mailman/listinfo/linux
--
Kim Holburn
IT Network & Security Consultant
T: +61 2 61402408 M: +61 404072753
mailto:kim at holburn.net aim://kimholburn
skype://kholburn - PGP Public Key on request
Loading...