Every time I get a C# exception on my ASP.NET web site running in Windows Azure, I have my web site code send me an email; this email includes the call stack, path information, and user agent string. I can use these email to find the problems with the code that caused the error. For a long time I have been ignoring the email (about 50 a day) and letting them stack up – being too busy with other projects. Lately, I have resolved to clean up all the code, so that I get less email and reduce the bugs in the web site. I thought this would lead to a better user experience for people visiting the site.
My typical process is to read the call stack, look at the path information and try to reproduce the issue to gain insight into how I have to fix it. Basically, a debugging process that I have in my toolkit no matter what the environment I am programming.
As I worked through my stack of emails I started to notice an emerging trend – it was not my users that were having problems with my web pages it was the robots. Robots, Bots, or web crawlers are mainly search engine programs that traverse your web site and glean information to build results for searches on their sites.
The interesting thing about bots is that they don’t ever work as well as the browsers. Or it would appear, because I built my web site for browsers not bots – which means that the web site works best for users. The difference in thinking is that the bots aren’t working wrong; they are just not first class citizens on the web site because I coded the site for browsers. Think of it as a car that is built for adults, tested for adults that a child tries to drive. It sort of works, however it is hard to reach the car peddles and see over the dashboard at the same time if you are a kid. The same goes for the bots, they are trying to consume something that was tested for browsers.
The simple approach would be to ignore the errors from the bots, since they are not my target audience. In fact, I can restrict the bots from the web site altogether with a robots.txt file. However, my intent is to make a better user experience for my users – so does fixing the errors for the bots create a better user experience for people that are really human? The answer is yes – if the web crawlers can find the content on my site (without getting errors first) they can drive traffic to the site. This traffic driven from the search engines is real traffic from humans.
Now that I know I want to fix the errors from the bots, let’s take a look at my debugging technique. Key to process is simulating the environment by using the appropriate web browser; the client that made the request that caused the error. However, I have no access to the web crawlers (the client for the bots) and cannot simulate a request from that client. In fact I am not even sure how they handle the response (the outputted HTML), because a lot of how the web crawlers work is kept a secret; the interactions with the site are intellectual proprietary technology. All I have to go on is the HTTP standard, which dictates how the requests are made and some speculation about how the search engines works which falls within the black arts of Search Engine Optimization.
This leaves me in this limbo land of fixing the suspected bug without being able to reproduce. I have to deploying the fix live to see if it solves the web crawler’s problem all without breaking the human experience via the browsers. Sounds like it isn’t worth the effort right? No true, 97% of my human traffic comes from the search engines. So maybe I should be writing my web site for the bots.