Posts

Internet Archaeology: Scraping time series data from Archive.org

A guide to scraping historical snapshots of webpages from the Archive.org Wayback Machine.

Advanced Web Scraping: Bypassing "403 Forbidden," captchas, and more

The full code for the completed scraper can be found in the companion repository on github. Introduction I wouldn’t really consider web scraping one of my hobbies or anything but I guess I sort of do a lot of it. It just seems like many of the things that I work on require me to get my hands on data that isn’t available any other way. I need to do static analysis of games for Intoli and so I scrape the Google Play Store to find new ones and download the apks.

The stories that Hacker News removes from the front page

An analysis of which stories are removed from the front page of Hacker News due to moderator intervention.

How many people will actually die this week because of Daylight Savings Time?

A data analysis of how many deaths the DST transition causes due to tired driving.

Reverse Engineering the Hacker News Ranking Algorithm

A data-driven exploration of how the Hacker News ranking algorithm works.

A Greedy Image Unshredder

A brief response to Nayuki’s post about the use of simulated annealing to solve an image unshredding problem. An interactive demo is used to show that a simple greedy algorithm outperforms the SA, both in terms of results and computation time.

Finding an Optimal Keyboard Layout for Swype

An overview of my work on optimizing phone keyboard layouts for Swype and T9. There’s some interesting history here as well as a novel simulation-based approach to keyboard optimization.