Bunnings site rip

Some time ago I fool­ish­ly vol­un­teered to per­form a site rip of https://www.bunnings.com.au/ for the local SES group I am a mem­ber of. This was to allow our accoun­tant mem­ber to more accu­rate­ly assign a val­ue to our assets. I under­stand this is an impor­tant thing for an accoun­tant.

I have done a num­ber of site rips in the past, the Bun­nings site is prob­a­bly the most painful so far. The prod­uct pages are very com­plex for what they are.

Each Bun­nings prod­uct page is rough­ly 300k. I extract­ed 1.1k of con­tent from each page. So 99.63% of it basi­cal­ly use­less, or an effi­cien­cy rate of 0.4%. The vast major­i­ty of the space is tak­en up by the nest­ed menu at the top, the ads near the bot­tom take a bit and then there is a fair­ly exten­sive site map across the bot­tom. At least the CSS is in an exter­nal file, well, four of them.

There is a mobile web­site which is a bit slim­mer. I think the page served is trig­gered by brows­er fin­ger­print­ing and cook­ies. I didn’t dis­cov­er it until too late though.

There are also two dif­fer­ent HTML struc­tures used for prod­uct pages, they look sim­i­lar but have dif­fer­ent tags with dif­fer­ent class­es.

And a fun trick, these two links go to the same page:
https://www.bunnings.com.au/romak-m6-high-tensile-course-hex-nut-10-pack_p1100797
https://www.bunnings.com.au/nobody-nibbles-nuts-like-noddy_p1100797

That trick gets less awe­some when you realise that they actu­al­ly do this and link to the same prod­uct with dif­fer­ent urls, 626 times.

In case any­one else is feel­ing fool­ish enough to try this them­selves, and brave enough to look at my code, the end result of my tri­als and tribu­la­tions is on github. All the mis­takes have of course been purged from the his­to­ry so it looks like I just bril­liant­ly did it in one go.

https://github.com/lod/bunnings-siterip

Install Jammer Extractor

I recent­ly spent a few days reverse engi­neer­ing an Install Jam­mer gen­er­at­ed bina­ry installer, specif­i­cal­ly the LPCX­pres­so installer sup­plied by NXP. The goal was to try and install the pro­gram with­out run­ning the bina­ry installer as root. I man­aged to cre­ate a perl script which unpacks the install files into a local direc­to­ry.

UPX

One of the first things I noticed when exam­in­ing the installer was a UPX head­er

00000070: 0010 0000 ea2d 27a5 5550 5821 e811 0d0c  .....-'.UPX!....

I hadn’t played with UPX before but it is a sys­tem to com­press exe­cutable files. There are two parts, a pro­gram which com­press­es the exe­cutable and a decom­pres­sion pro­gram which gets prepend­ed to the com­pressed file.

When the exe­cutable is run it uncom­press­es the pay­load and restarts the exe­cu­tion at the start of the new exe­cutable.

UPX is an open source project with some nice tools. Specif­i­cal­ly they pro­vide a pro­gram which can read the UPX head­ers and pro­vide infor­ma­tion and decom­press the bina­ry. They strong­ly advo­cate not mess­ing things up so that these tools can func­tion.

Unfor­tu­nate­ly all the lead­ing google results, stack over­flow entries and forum queries are cen­tered around pre­vent­ing peo­ple from uncom­press­ing the bina­ry. Giv­en the way UPX works it is easy to slight­ly mod­i­fy the decom­pi­la­tion and com­pi­la­tion process in a way that caus­es incom­pat­i­bil­i­ty. UPX also makes a spe­cial effort to allow GDB to work, which is easy to sab­o­tage. These things con­tribute to make UPX very pop­u­lar with virus writ­ers as a mask­ing ele­ment.

Nat­u­ral­ly Install Jam­mer did all of this. I extract­ed the UPX head­er by hand but it refers to a com­pres­sion scheme which doesn’t exist in the orig­i­nal pro­gram. The sec­tions and sec­tion head­ers that UPX uses are miss­ing or masked, a com­mon­ly rec­om­mend­ed tech­nique to pre­vent decom­pres­sion. Attempt­ing to run using GDB didn’t pro­vide any use­ful infor­ma­tion.

It should be pos­si­ble to extract the assem­bler instruc­tions and fig­ure out or run the decom­pres­sion rou­tine. How­ev­er that was beyond me and I found an eas­i­er approach.

Install Jammer Extractor

The Install Jam­mer pro­gram which gen­er­ates the final install bina­ries comes with bina­ry blobs that are prepend­ed to the final installer.

This pre­com­piled pro­gram looks at the rest of the file and extracts from it the install files. Look­ing at the strings there are what looks like file names in the install blob.

I sim­pli­fied the prob­lem by cre­at­ing an Install Jam­mer installer of my own con­tain­ing a small col­lec­tion of scripts.

Inside the gen­er­at­ed bina­ry is a sec­tion with the fol­low­ing lines (there are actu­al­ly two, iden­ti­cal sec­tions… no idea why):

0015af60: 0000 0000 0000 0000 0000 0000 0046 494c  .............FIL
0015af70: 455a 4c30 3637 3239 3039 412d 3946 3236  EZL0672909A-9F26
0015af80: 2d33 4539 312d 4242 4546 2d30 3241 3230  -3E91-BBEF-02A20
0015af90: 3633 3238 3639 3200 0000 0000 0000 0000  6328692.........
0015afa0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0015afb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0015afc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0015afd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0015afe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0015aff0: 0000 0034 3431 0000 0000 0000 0000 0032  ...441.........2
0015b000: 3632 0000 0000 0000 0000 0031 3437 3337  62.........14737
0015b010: 3432 3339 3400 0000 0031 3137 3830 3031  42394....1178001

It looks like a file­name and sev­er­al num­bers encod­ed as strings, I found the file­name por­tion in one of the inter­me­di­ate files gen­er­at­ed in the installer gen­er­a­tion, Linux-x86-files.tcl, this allows much of the detail to be iden­ti­fied. The com­pressed address and size refer to the posi­tion and size of a blob with­in the install bina­ry, this was con­firmed by sequenc­ing mul­ti­ple adja­cent entries.

File ::0672909A-9F26-3E91-BBEF-02A206328692 -name compiles.t -parent 81FF3CF4-D2FD-4649-FA7F-C2640F59BE65 -directory <%InstallDir%>/t -size 441 -mtime 1473742394 -permissions 00644 -filemethod 0
FILE start mark­er
ZL flag
0672909A-9F26-3E91-BBEF-02A206328692 id string
441 extract­ed size
262 com­pressed size
1473742394 mtime
1178001 blob address (#11F96F)

ZLib files

I extract­ed the com­pressed blob and grabbed the match­ing uncom­pressed file. I tried sev­er­al dif­fer­ent com­pres­sion tech­niques on the uncom­pressed file and tried match­ing them to the extract­ed blob. Zlib, attempt­ed due to the ZL flag, was a very close match. Below is an exam­ple very small file.:

> zlib-flate -compress < original_file | xxd
00000000: 789c 2b4a 4db3 5228 4a4d 2bd6 2f4a cdcd  x.+JM.R(JM+./J..
00000010: 2f49 2dd6 cf2f ca4c cfcc d3cf 4d2c 2e49  /I-../.L....M,.I
00000020: 2de2 0200 c596 0bf2                      -.......

Extract­ed blob, lined up to match:

00000000: 0000 2b4a 4db3 5228 4a4d 2bd6 2f4a cdcd  ..+JM.R(JM+./J..
00000010: 2f49 2dd6 cf2f ca4c cfcc d3cf 4d2c 2e49  /I-../.L....M,.I
00000020: 2de2 0200                                -...

The ZLib head­er and foot­er are both miss­ing. The head­er sets the com­pres­sion method and options such as the dic­tio­nary to use. Adding the stan­dard head­er bytes allowed the extract­ed blob to uncom­pressed using zlib-flate -uncom­press. The four byte foot­er is a check­sum which seems to be option­al.

This tech­nique allowed all the install files to be extract­ed how­ev­er their names and struc­ture of the direc­to­ry tree were lost.

LZMA files

Along with the ZLib com­pressed install files are a bunch of tcl files with an LZ flag. These have full names and seem to be the files nec­es­sary to run the installer, includ­ing files for tcl and the nec­es­sary libraries.

The tcl files are not from my sys­tem, some of them have dif­fer­ent ver­sions or do not exist at all. I chose iso8859-3.enc to exam­ine, assum­ing that it was like­ly to be the same as my ver­sion.

I assumed the encod­ing used was LZMA (Lempel–Ziv–Markov chain algo­rithm) par­tial­ly because I had noticed a bina­ry library called craplz­ma in the Install Jam­mer appli­ca­tion files. Unfor­tu­nate­ly LZMA is, like the name sug­gests, an algo­rithm which is used by mul­ti­ple dif­fer­ent archivers such as 7-Zip, LZip, XZ and more. Most of the archive con­tain­ers spec­i­fy how to store mul­ti­ple files but for a sin­gle file it turns out you can just tack the appro­pri­ate head­er on and any pro­gram will extract it.

The head­er that matched most close­ly was LZMA alone or LZMA1. Which is con­ve­nient­ly sup­port­ed by the Perl Compress::Raw::Lzma mod­ule. :

cat /usr/share/tcltk/tcl8.5/encoding/iso8859-3.enc | lzma -z | xxd
00000000: 5d00 0080 00ff ffff ffff ffff ff00 1188  ]...............
00000010: 0528 b979 d70b 91f8 28ae b6ac 59fc 1cbb  .(.y....(...Y...

Extract­ed blob, first line:

5d00 0080 0000 1188 0528 b979 d70b 91f8

The LZMA file for­mat defined a head­er:

  • 2 bytes prop­er­ties
  • 4 bytes dic­tio­nary size
  • 8 bytes uncom­pressed size

Our extract­ed blob is miss­ing the uncom­pressed size field. For­tu­nate­ly pass­ing a size of FFFF FFFF to the decom­pres­sion rou­tine indi­cates an unknown size, splic­ing this field in allowed all the LZ flagged files to be extract­ed.

TCL scripts

Install Jam­mer is large­ly a TCL project, I believe it is a C++ base which uses TCL to per­form the GUI tasks, allow script­able exten­sion and do most of the work.

The inter­me­di­ate files cre­at­ed by the installer build process include a bunch of TCL gen­er­at­ed scripts, these scripts rename the extract­ed files from their stored ID names to the final name. They also cre­ate the direc­to­ries, sym­links if required and set the mtime for the files. It looks like the script is meant to set the per­mis­sions for the files but this doesn’t actu­al­ly work, ever­thing is set to 777, there is no facil­i­ty to set the own­er­ship.

Extract­ing the files from the installer this script can be found in main2.tcl for my gen­er­at­ed file or main.tcl for the lpcx­pres­so installer. I end­ed up just pro­cess­ing every root direc­to­ry tcl script to be safe.

The tcl script con­tains lines like the fol­low­ing which are fair­ly sim­ple to parse. By com­bin­ing these lines with the entry table extract­ed from the installer bina­ry each file can be extract­ed, decom­pressed and placed in the appro­pri­ate loca­tion.

File ::4D49D586-0ADF-966C-3FC4-8DB31B47B741 -name dumpio2curl -parent 81FF3CF4-D2FD-4649-FA7F-C2640F59BE65 -directory <%InstallDir%> -type dir -permissions 040755 -filemethod 0
File ::381BB57B-2E9F-3012-F9BB-C1752B423A6E -name .travis.yml -parent 81FF3CF4-D2FD-4649-FA7F-C2640F59BE65 -directory <%InstallDir%> -size 164 -mtime 1473742394 -permissions 00644 -filemethod 0

The last step was to parse the tcl script for info vari­able block. This gives the vari­ables such as InstallDir which are embed­ded in the File entry. Sev­er­al of these vari­ables would typ­i­cal­ly be set by the install wiz­ard, we sup­port this by allow­ing the user to pass val­ues on the com­mand line, either to cus­tomise the install or pro­vide vari­ables which are miss­ing.

References

Weekly Wrap 18–25 April

Marbled butter biscuits

Work

  • Not much pub­lic work to show, most­ly inves­ti­gat­ing poten­tial man­u­fac­tur­ing part­ners.
  • Far­nell order is drib­bling in, my PCBs and AliEx­press orders have been shipped. All the pieces should be ready when I get back next week.

Play

  • Went to Can­ber­ra on Fri­day to catch up with folk and par­ty, stay­ing for the week.
  • Dis­cov­ered Coconuts Duo, amaz­ing­ly fun to play and spec­tate. It’s prob­a­bly even fun sober.
  • Made mar­bled but­ter bis­cuits (pic­tured). Annoy­ing­ly frag­ile to being burnt but for­tu­nate­ly I made so many that after chuck­ing 15% I still need­ed two con­tain­ers to store them all.

Other

Last week’s wrap

Return top