Splitting archive and combining later on the fly

Many of us use tar (many times with gzip or bzip2) for archiving purposes. When performing such an action, a large file, usually, too large, remains. To extract from it, or to split it becomes an effort.

This post will show an example of a small script to split an archive and later on, to directly extract the data out of the slices.

Let’s assume we have a directory called ./Data . To archive it using tar+gzip, we can perform the following action:

tar czf /tmp/Data.tar.gz Data

For verbose display (although it’s could slow down things a bit), add the flag ‘v’.

Now we have a file called /tmp/Data.tar.gz

Lets split it to slices sized 10 MB each:

cd /tmp
mkdir slices
i=1 # Our counter
skip=0 # This is the offset. Will be used later
chunk=10 # Slice size in MB
let size=$chunk * 1024 # And in kbytes
file=Data.tar.gz # Name of the tar.gz file we slice
while true ; do
# Deal with numbers lower than 10
if [ $i -lt “10” ]; then
j=0${i}
else
j=${i}
fi
dd if=${fie} of=slices/${file}.slice${j} bs=1M count=${chunk} skip=${skip}
# Just to view the files with out own eyes
ls -s slices/${file}.slice${j}
if [ `ls -s slices/${file}.slice${j} | awk ‘{print $1}’` -lt “${size}” ]; then
echo “Done”
break
fi
let i=$i+1
let skip=$skip+$chunk
done

This will break the tar.gz file to a files with running numbers added to their names. It assumes that the number of slices would not exceed 99. You can extend the script to deal with three digits numbers. The sequence is important for later. Stay tuned 🙂

Ok, so we have a list of files with a numerical suffix, which, combined, include our data. We can test their combined integrity:

cd /tmp/slices
i=1
file=Data.tar.gz
for i in `ls`; do
cat ${file}.slice${i} >> ../Data1.tar.gz
done

This will allow us to compare /tmp/Data.tar.gz and /tmp/Data1.tar.gz. I tend to use md5sum for such tasks:

md5sum Data.tar.gz
d74ba284a454301d85149ec353e45bb7 Data.tar.gz
md5sum Data1.tar.gz
d74ba284a454301d85149ec353e45bb7 Data1.tar.gz

They are similar. Great. We can remove Data1.tar.gz. We don’t need it anymore.

To recover the contents of the slices, without actually wasting space by combining them before extracting their contents (which requires time, and disk space), we can run a script such as this:

cd /tmp/slices
file=Data.tar.gz
(for i in `ls ${file}.slice*`; do
cat $i
done ) | tar xzvf –

This will extract the contents of the joined archive to the current directory.

This is all for today. Happy moving of data 🙂

Similar Posts

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.