The Art of Lossless Data Compression vol. 17

Here are the results of tests performed in July 2000 to compare lossless compression of english texts by all known good enough programs developed for such purpose, including RK, DC, PPMDF, Bzip2, IMP, RAR and 7-zip. See Archive Comparison Test by J.Gilchrist for more details: http://ACT.BY.net If anybody wants to start or continue such tests, or can suggest some other sets of texts, or other compression programs, (not sources or algorithm descriptions, programs for DOS or Windows only) or knows we have missed something important, (some new fantastic technology, an algorithm or even a program capable of lossless compression of up to 1000:1 etc.) please let us know immediately: ratush@srsc-gw.sscc.ru Thank you!

[[1]] COMPRESSION QUALITY

(see also [[2]] Speed [[3]] Details [[4]] Comments) Fifth line shows results for the sum of four Canterbury Corpus Large Set files, tenth line - for the sum of all 556 files in five sets. (modeling and ppm-based, slow-extracting programs) original RK ppmonstr PPMDF BOA ACB 777 UFA Arhangel UHARC -mx3-ft+ -o7-m56 -o7-m56 -m15 u -m5-mu32-m5-mu32 -2-mm-mt -m3-mm 569.47% 100% 103.03 103.94 104.20 105.75 112.36 112.36 113.46 136.80 411.40% 100.03 101.95 101.98 100.56 102.85 100.50 100.50 100% 100.84 572.82% *100% 103.43 104.45 104.59 104.73 110.27 110.27 113.35 138.89 644.43% ^100% 106.03 107.28 110.29 109.01 124.88 124.88 136.57 134.41 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 521.41% *100% 103.06 103.73 103.68 104.76 108.93 108.93 111.32 123.80 486.73% 100% 102.67 104.25 105.02 107.57 112.71 112.71 114.30 133.35 398.62% ^100% 101.75 103.39 103.55 107.73 108.80 108.80 108.41 128.69 438.62% 100% 102.23 103.88 104.81 108.93 110.61 110.61 111.99 133.70 704.14% 100% 103.10 104.02 107.75 112.77 112.93 112.93 134.06 148.99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 454.91% 100% 102.18 103.74 104.68 108.70 110.41 110.41 112.91 133.42 (dictionary-based and block-sorting, fast-extracting programs) DC BA ZZip SZip ERI BZip2 IMP RAR 7-zip PkZip -b16300 -k50-m -a4-b12 -o10b41 -m5 -k -9 -2 -s4 -m5-mm -mx -exx 102.63 107.29 110.25 108.91 109.94 118.98 117.30 135.80 156.39 165.00 101.46 103.86 102.41 103.83 106.17 110.95 109.09 112.46 111.08 115.52 100.82 105.20 108.7 109.36 107.74 118.50 116.25 138.68 158.53 166.77 108.77 108.37 109.43 113.00 110.32 127.55 125.74 138.57 181.35 187.54 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 102.42 105.53 106.78 107.62 107.94 116.91 114.99 128.30 143.37 150.05 102.17 106.92 108.93 111.23 110.76 117.07 115.80 135.44 152.85 159.24 101.01 106.37 107.97 110.12 110.03 113.89 113.57 135.65 143.11 149.32 103.00 107.94 110.32 111.16 112.13 117.50 117.17 137.22 149.56 155.62 107.00 115.13 119.33 114.02 115.00 131.86 139.43 149.70 173.61 180.91 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 102.79 107.96 110.27 110.97 111.66 117.74 117.87 137.34 149.92 156.13 * RK -mx2 (not -mx3 -ft+ ) ^ RK -mx3

[[2]] Speed

Canterbury Corpus Large Set http://corpus.canterbury.ac.nz/ftp/large.zip was used for this test, and an AMD-K6-400 machine with 64M RAM and Windows98. Programs,options Overall Average Compress Extract Compressed score, Users' time, time, size, score, seconds seconds bytes seconds % seconds % 777 a -m5 -mu32 1354 147% 1171 133% 203 222 3343996 777 a -mg -s 1880 205% 1262 144% 688 139 3793939 7zip a 1307 142% 1232 140% 83 4 4393623 7zip a -mx 1358 148% 1240 141% 131 4 4401160 acb B 2540 276% 1818 207% 803 808 3346915 acb b 2997 326% 2059 235% 1042 1047 3267480 acb u 3802 414% 2496 285% 1452 1456 3221349 arhangel a 1205 131% 1117 127% 98 94 3647060 arhangel a -2 -mm 1203 131% 1117 127% 96 94 3647060 arhangel a -2 -1 1514 165% 1148 131% 407 94 3647060 arhangel a -mt 1173 127% 1069 122% 115 109 3417110 arhangel a -mtf 1177 128% 1071 122% 118 110 3418181 ba -k 1057 115% 988 112% 78 26 3432541 ba -k -m 1057 115% 988 112% 78 26 3432541 ba -k -1 1170 127% 1122 128% 54 26 3927264 ba -k -10 1056 115% 986 112% 79 26 3424345 ba -k -50 1046 114% 954 109% 103 17 3337823 boa -m1 1623 176% 1387 158% 263 281 3886856 boa -a 1560 170% 1266 144% 327 340 3217347 boa -m15 1588 173% 1277 145% 346 358 3182732 bzip2 -k 1075 117% 1025 117% 56 16 3611558 bzip2 -k -s 1145 124% 1102 125% 48 14 3902513 bzip2 -k -1 1201 130% 1159 132% 47 13 4109767 bzip2 -k -5 1089 118% 1046 119% 48 14 3697142 bzip2 -k -9 1070 116% 1023 116% 53 15 3611558 dc e 950 103% 918 104% 36 22 3214240 dc e -a 950 103% 921 105% 33 23 3223329 dc e -d 3567 388% 3547 405% 24 2 12751141 dc e -b16300 1098 119% 875 100% 248 64 2829394 eri a -m1 1119 122% 983 112% 153 29 3378440 eri a -m2 1117 121% 975 111% 158 30 3346586 eri a -m3 1123 122% 971 110% 169 32 3318853 eri a 1136 123% 972 111% 183 33 3313568 eri a -m5 1167 127% 975 111% 215 33 3313559 imp98 a -2 1043 113% 1002 114% 46 11 3547964 imp98 a -2 -s4 1040 113% 998 114% 48 11 3535351 imp a -2 -s4 1041 113% 1001 114% 45 11 3548156 pkzip -es 1659 180% 1655 189% 5 3 5945608 pkzip -a 1326 144% 1307 149% 22 2 4691477 pkzip -exx 1498 163% 1303 148% 217 2 4605928 ppmd e -o5 958 104% 937 107% 24 23 3279292 ppmd e -o7 983 107% 953 108% 34 34 3296502 ppmd e -o9 1057 115% 1015 116% 47 48 3464715 ppmd e -o5 -m56 950 103% 932 106% 20 23 3268214 ppmd e -o7 -m56 917 100% 893 102% 28 30 3095512 ppmd e -o9 -m56 985 107% 944 107% 46 46 3215327 ppmonstr e -o5 997 108% 958 109% 43 43 3278191 ppmonstr e -o7 1023 111% 972 111% 57 59 3265897 ppmonstr e -o9 1097 119% 1031 117% 74 78 3406265 ppmonstr e -o5 -m56 989 107% 954 109% 40 42 3268306 ppmonstr e -o7 -m56 965 105% 918 104% 53 56 3083063 ppmonstr e -o9 -m56 1036 112% 967 110% 77 77 3178172 rar a 1226 133% 1134 129% 103 4 4029077 rar a -mm 1227 133% 1134 129% 105 4 4029077 rar a -m1 1247 135% 1205 137% 48 4 4304853 rar a -m5 1555 169% 1144 130% 457 4 3938348 rar a -s 1227 133% 1134 129% 104 4 4028163 rar a -s -mda 1307 142% 1236 141% 79 4 4408220 rar a -s -mdc 1252 136% 1168 133% 93 4 4157251 rar a -s -m5 1560 170% 1144 130% 463 4 3937052 rar32 a -s -m5 1560 170% 1144 130% 463 4 3937052 rk -mf1 1194 130% 1166 133% 32 21 4110184 rk -mf2 1308 142% 1149 131% 177 76 3798456 rk -mf3 1504 164% 1151 131% 392 72 3742232 rk -mx1 1736 189% 1350 154% 430 449 3089384 rk -mx2 1825 199% 1403 160% 470 502 3074900 rk -mx2 -ft+ 1915 208% 1452 165% 514 540 3099400 rk -mx2 -fe+ 1844 201% 1413 161% 480 510 3074904 rk -mx3 1891 206% 1440 164% 502 535 3076136 szip -v0 1040 113% 1003 114% 41 34 3473957 szip -o4 1061 115% 1044 119% 19 29 3646906 szip -o8 1040 113% 993 113% 53 35 3429112 szip -o0 1063 115% 979 111% 94 24 3403202 szip -v0 -b41 1019 111% 984 112% 39 34 3405120 szip -o4 -b41 1045 113% 1029 117% 17 30 3591824 szip -o8 -b41 1021 111% 974 111% 53 36 3356744 szip -o0 -b41 1055 115% 959 109% 107 24 3326271 ufa a -m5 -mu32 1378 150% 1185 135% 216 234 3343996 ufa a -mg -mu32 1381 150% 1185 135% 219 234 3343996 ufa a -m5 -mu16 1323 144% 1156 132% 186 203 3363895 ufa a -m5 -mu10 1312 143% 1154 131% 177 195 3387619 ufa a -m5 -mu4 1342 146% 1187 135% 173 192 3519553 ufa a -mg -s 1630 177% 1161 132% 522 28 3889878 uharc a 1381 150% 1183 135% 220 27 4081072 uharc a -m1 1354 147% 1244 142% 122 29 4333271 uharc a -m3 1514 165% 1125 128% 432 26 3801399 uharc a -m3 -mm 1515 165% 1126 128% 433 26 3801399 uharc a -m3 -md64 1501 163% 1221 139% 311 28 4184881 uharc a -m3 -md2048 1515 165% 1126 128% 433 26 3801399 zzip a 1085 118% 1030 117% 62 28 3584447 zzip a -mm 1085 118% 1030 117% 61 28 3584447 zzip a -lm 1085 118% 1030 117% 61 28 3584447 zzip a -a1 1085 118% 1030 117% 61 28 3584447 zzip a -a2 1080 117% 1021 116% 66 31 3543392 zzip a -a3 1076 117% 1014 115% 69 30 3517619 zzip a -a4 1085 118% 1015 116% 79 30 3517619 zzip a -a4 -b12 1029 112% 950 108% 88 31 3277976 Overall score is calculated by adding compression time, extraction time, and time it would take to transfer the compressed file over a 28,800bps network: (compressed_size)/3600 , because 28800 bits_per_second is 3600 bytes_per_second Average Users' score is calculated by adding (compress_time/10)+ extract_time + time it would take to transfer the compressed file over a 28,800bps network. Compression time is divided by 10 here, because more than 90% of people would never compress anything during their life (with compression programs), but they use compressed data almost _every_ time they use computers and/or Internet. That's why compression time is not so actual for them.

[[3]] Details

are no longer put to this main text (738 lines reporting 22796 results on 556 files in 5 sets), but can be found in FULL version with TEXTS.DAT and *.BAT at http://geocities.com/SiliconValley/Bay/1995/artest17.zip or https://artest1.tripod.com/artest17.zip

[[4]] Comments

Links to download programs:

7-Zip 2.11 :W http://www.7-zip.com/dl/7zip211.exe 493K 777 0.04b1 :W http://www.7-zip.com/dl/ufa/777004b1.zip 72K UFA 0.04b1 :W http://www.7-zip.com/dl/ufa/ufa004b1.zip 64K ArHanGeL 1.40 :a http://geocities.com/SiliconValley/Lab/6606/arh140.zip 50K ERI32 4.6fre :e http://geocities.com/eri32/eri46fre.zip 91K Imp 1.1 :e http://www.winimp.com/imp110d.zip 266K Imp-win 1.12 :W http://www.winimp.com/imp112.exe 122K PkZip 2.50 :a ftp://ftp.simtel.net/pub/simtelnet/msdos/arcers/pk250dos.exe 202K RK 1.02a5 :W http://malcolmt.tripod.com/downloads/rk102a05.exe 191K RAR32 2.71 :e ftp://ftp.netlab.sk/public/rarsoft/rar/rarx271.exe 257K WinRAR 2.71 :W ftp://ftp.netlab.sk/public/rarsoft/rar/wrar271.exe 588K PPMD var.F , PPmonstr v.F :W ftp://ftp.simtel.net/pub/simtelnet/win95/compress/ppmdf.zip 97K ACB 2.00c :e ftp://ftp.simtel.net/pub/simtelnet/msdos/compress/acb_200c.zip 42K BOA 0.58b :e ftp://ftp.cdrom.com/.3/sac/pack/boa058.zip 74K DC 0.98b :W ftp://ftp.cdrom.com/.3/sac/pack/dc124.zip 55K BA 1.00 beta :e ftp://ftp.cdrom.com/.3/sac/pack/ba100b.zip 60K Bzip2 1.0.1 :W ftp://sourceware.cygnus.com/pub/bzip2/v100/bzip2-100-x86-win32.exe 68K SZip 1.12a :W http://www.compressconsult.com/szip/szip_112a_win32.zip 71K ZZip 0.35a :W http://www.via.ecp.fr/~damien/zzip/zzip-win32.zip 28K :a - any DOS - DOS programs, will run under pure DOS or in a DOS box :e - extender - DOS programs using DOS extenders like DOS/4GW or CWSDPMI :W - windoze - Windows95/98/NT/etc programs If direct link doesn't work-most probably newer version of the program appeared at the same site: visit web page, or read the whole directory from ftp server (i.e. try the same URL, but without filename).

Homepages:

Arhangel : http://geocities.com/SiliconValley/Lab/6606 Eri32 : http://geocities.com/eri32 mirror : http://artest1.tripod.com RK : http://malcolmt.tripod.com Imp,WinImp : http://www.technelysium.com.au mirror : http://www.winimp.com PkZip : http://www.pkware.com Ufa,777,7-Zip: http://www.7-zip.com RAR,WinRAR : http://www.rarsoft.com BZip2 : http://sources.redhat.com/bzip2 SZip : http://www.compressconsult.com/szip ZZip : http://www.via.ecp.fr/~damien/zzip

What's new:

All contents of this page. 407 Megabytes of plain (english) texts in 556 files in 5 sets, including the four Canterbury Corpus Large Set files. Non-english texts will probably be added in future, but don't expect that results will differ more than 1%. One file (pgwht04.txt) is an html file, and one (E.TXT, originally E.COLI), the first of Large Set - pseudo-text. 19 archivers and file-to-file compressors, known to be best in plain texts compression (plus few most popular tools). .BAT files used for tests are more compact and readable - see TEXT_ALL\*.BAT inside artest17.zip, and .BATs used for calculations are also added this time. DOS prompt calculator with user def. functions (math.exe being used for ARTest) can be found at ftp://ftp.simtel.net/pub/simtelnet/msdos/calculte/mathfc24.zip (26K) Ultra Precision Command Timer 1.6 - Freeware (C) 1993 by Erik de Neve (upct.exe being used for ARTest) can be found at ftp://ftp.cdrom.com/.3/sac/utilmisc/upct16.zip (7K) MultiEdit 7.00jP-386 was used for files editing with macrocommands, blocks etc, and standard fc.exe from any DOS/Windows package - for comparing files.

WARNINGS:

RK 1.02a5 was unable to correctly decompress CHNBG10.TXT compressed with any -mx1,-mx2, -mx3 ("This program has performed an illegal operation and will be shut down"), and also MISCC10.TXT with -ft+ and any of -mx1,-mx2,-mx3, reporting ERROR 303: CRC check failed. BA 1.00beta can't decompress any file compressed with -mf , and says nothing like "CRC fails" DC 0.98b failed to decompress 1DFRE10.dc , ANDES10.dc , and BTI0110.dc , saying "Corrupted block" (while t(est) command writes "Test successful"). UFA and 777 can't handle files with symbol ` (ASCII code 96) in their names. It was replaced with _ in nine filenames. ERI32 4.6 can't compress files larger than (free DPMI memory)/6 , i.e. about 10Mb on a PC with 64Mb RAM. The largest 44Mb file was split to 5 chunks 9000000 bytes long (last chunk was 8894190 bytes). The LATEST RELEASE, and thirteen previous versions of these tests can be found at http://geocities.com/SiliconValley/Bay/1995/ and https://artest1.tripod.com/

The FINAL PART

> [[5]] PLEASE read THIS before replying to this article was removed from this text, but can be easily found at http://geocities.com/SiliconValley/Bay/1995/artest10.html https://artest1.tripod.com/artest10.html Send your suggestions, comments to ratush@srsc-gw.sscc.ru With best kind regards, RAO Inc.