Some time ago I fool­ish­ly vol­un­teered to per­form a site rip of for the local SES group I am a mem­ber of. This was to allow our accoun­tant mem­ber to more accu­rate­ly assign a val­ue to our assets. I under­stand this is an impor­tant thing for an accoun­tant.

I have done a num­ber of site rips in the past, the Bun­nings site is prob­a­bly the most painful so far. The prod­uct pages are very com­plex for what they are.

Each Bun­nings prod­uct page is rough­ly 300k. I extract­ed 1.1k of con­tent from each page. So 99.63% of it basi­cal­ly use­less, or an effi­cien­cy rate of 0.4%. The vast major­i­ty of the space is tak­en up by the nest­ed menu at the top, the ads near the bot­tom take a bit and then there is a fair­ly exten­sive site map across the bot­tom. At least the CSS is in an exter­nal file, well, four of them.

There is a mobile web­site which is a bit slim­mer. I think the page served is trig­gered by brows­er fin­ger­print­ing and cook­ies. I didn’t dis­cov­er it until too late though.

There are also two dif­fer­ent HTML struc­tures used for prod­uct pages, they look sim­i­lar but have dif­fer­ent tags with dif­fer­ent class­es.

And a fun trick, these two links go to the same page:

That trick gets less awe­some when you realise that they actu­al­ly do this and link to the same prod­uct with dif­fer­ent urls, 626 times.

In case any­one else is feel­ing fool­ish enough to try this them­selves, and brave enough to look at my code, the end result of my tri­als and tribu­la­tions is on github. All the mis­takes have of course been purged from the his­to­ry so it looks like I just bril­liant­ly did it in one go.