Anyone using pbzip2?

Is anyone using pbzip2 for backups or packaging of similarly large files (hundreds of MB, even some GB)? Just wondering what your experiences are.

It is in essence parallel bzip and produces bzip2 compatible files, so I suppose it should be much faster than and produce smaller files than gzip.

13 Replies

I've seen a few people who use it in #linode, but I myself haven't bothered…

I've used it. It worked out very well CPU-wise, although I was doing it in a pipeline with a very I/O-intensive task, so I switched back to using normal bzip2 to slow things down a bit.

It's how I back up my claims on IRC that I'm always able to get ~400% CPU when I want it :-)

I use it whenever I can; it's great. Often I find myself in the same boat as Azathoth, where I'm piping the output through gpg, so it doesn't gain me much there. But for other uses, absolutely.

Hmm… I tried using it but I don't see much improvement for my particular case (1.3G of Gentoo Portage including distfiles and compiled binaries). pbzip2 working on 4 cores completes in a bit less time as gzip (a few seconds less @ cca 1 min overall), but the size difference (cca 10% smaller bz2) is not sufficient to warrant taxing of 4 cores to get the same job done.

Although I'm sure bpzip2 works as intended because regular bzip would have a smaller archive but in much longer time.

I guess it would work best for larger archives (for sizes in gigabytes).

You're comparing apples and oranges, bzip2 and gzip:

gzip –> pigz

bzip2 --> pbzip2

You're comparing gzip to pbzip2 for some reason. If you want to speed up gzip, use pigz. If you want to speed up bzip2, use pbzip2.

Actually I'm not. The idea is to produce smaller archive files, thus comparing the bz with gz algorithms, or more precisely wanting to achieve bz compression ratio at gz speed.

But you can't, because "gz speed" would be the speed achieved with pigz (the parallel version of gzip). And pbzip2 will obviously be nowhere near as fast as pigz.

@Guspaz:

But you can't, because "gz speed" would be the speed achieved with pigz (the parallel version of gzip). And pbzip2 will obviously be nowhere near as fast as pigz.

And pigz would still produce archives as large as gz. So I wanted filesizes of bzip at the speed of gz or better.

The question remains do I want increased I/O when four processes start asking for disc access at once, and what is more important to me, smaller archive or increased I/O. No one can answer that for me, but myself. ;)

Perhaps I need to explain what I want it in detail.

I have 1.3G of files to archive and ship out compressed and encrypted via FTP every morning at 5am localtime when the server is least loaded.

gz takes approx 56 seconds and produces 800MB archive

bz2 takes few minutes and produces smaller file (don't remember exact figures)

pbzip2 takes 54 seconds and produces 740MB archive, but at 4 times the IO of gz, because I use 4 processes (one per core)

Now, if I used pigz, I am sure it would take much less than 50 seconds, but will produce 800MB archive just like gz, and peak the I/O four times more.

This is just a test, in preparation for rather larger archives (few GB) at rather larger load than I currently have on the server, which will be needed once we start a new local service in January.

So I will need to balance between:

  • smaller or bigger IO peak

  • longer or shorter network hogging to get the backup over, including smaller or larger archive to store on the backup server.

  • backup locally first, then ship away the encrypted tarball, or tar, compress, encrypt and ship away without local files on the fly (and I have yet to test if parallel compression works with data on stdout)

Sure, pigz will produce the archive faster, but I want them smaller, and doing so, I want to see how much IO and CPU will it take to produce them smaller, and what is better, longer but less taxing serial, or quicker but more IO intensive parallel compression. Though I still want smaller archive, so pbzip2 is better for me than pigz.

pbzip operating in 54 seconds should use about the same amount of IO as gz does. They're reading the same source data in about the same amount of time.

For limiting CPU and I/O usage, you may want to play around with nice and ionice (article).

Here's something for consideration: 7zip. It's open-source (and the author made his LZMA algorithm public-domain), and should be in most distro repos. It also supports multi-core compression/decompression.

Compression performance is a bit better than bzip2, but usually much faster. RAM requirements are usually higher, though (depends on the dictionary size, which affects compression).

It's also got decent support. The app itself is available for *nix, but also for other platforms like Windows, and WinRAR can also decompression 7zip archives.

Thanks for all your interesting suggestions.

@Guspaz

I guess you're right. For more or less same filesizes, shorter processing time can only mean lower I/O, smaller peaks.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct