Some time ago I fool­ishly volun­teered to per­form a site rip of https://​www​.bun​nings​.com​.au/ for the local SES group I am a mem­ber of. This was to allow our account­ant mem­ber to more accur­ately assign a value to our assets. I under­stand this is an import­ant thing for an accountant.

I have done a num­ber of site rips in the past, the Bun­nings site is prob­ably the most pain­ful so far. The product pages are very com­plex for what they are.

Each Bun­nings product page is roughly 300k. I extrac­ted 1.1k of con­tent from each page. So 99.63% of it basic­ally use­less, or an effi­ciency rate of 0.4%. The vast major­ity of the space is taken up by the nes­ted menu at the top, the ads near the bot­tom take a bit and then there is a fairly extens­ive site map across the bot­tom. At least the CSS is in an external file, well, four of them.

There is a mobile web­site which is a bit slim­mer. I think the page served is triggered by browser fin­ger­print­ing and cook­ies. I didn’t dis­cover it until too late though.

There are also two dif­fer­ent HTML struc­tures used for product pages, they look sim­ilar but have dif­fer­ent tags with dif­fer­ent classes.

And a fun trick, these two links go to the same page:
https://​www​.bun​nings​.com​.au/​r​o​m​a​k​-​m​6​-​h​i​g​h​-​t​e​n​s​i​l​e​-​c​o​u​r​s​e​-​h​e​x​-​n​u​t​-​1​0​-​p​a​c​k​_​p​1​1​0​0​797
https://​www​.bun​nings​.com​.au/​n​o​b​o​d​y​-​n​i​b​b​l​e​s​-​n​u​t​s​-​l​i​k​e​-​n​o​d​d​y​_​p​1​1​0​0​797

That trick gets less awe­some when you real­ise that they actu­ally do this and link to the same product with dif­fer­ent urls, 626 times.

In case any­one else is feel­ing fool­ish enough to try this them­selves, and brave enough to look at my code, the end res­ult of my tri­als and tribu­la­tions is on git­hub. All the mis­takes have of course been purged from the his­tory so it looks like I just bril­liantly did it in one go.

https://​git​hub​.com/​l​o​d​/​b​u​n​n​i​n​g​s​-​s​i​t​e​rip