header image

Archive for August, 2007

Browser toolbars reveal more than you think

Monday, August 27th, 2007

All the major search engines provide toolbars that you can download and install in your browser. Each toolbar has some nifty features that are commonly not found in browsers, which makes them compelling enough to download and install. One feature of all toolbars is to be ale to search the web using the search engine that made the toolbar. This is of course the reason for the toolbar’s existence: to funnel more searches to the search engine.

Another common “feature” of search engine toolbars is to report home about each web page that you visit. Even though you can in most cases turn off this feature, the toolbar offers some compelling extra benefit so that most users keep it enabled. (Or they are just unaware of the “call home” feature.)

If we for the moment disregard the privacy aspects of reporting every web page that you visit, there is another implication that most web site owners are not aware of: The web pages reported by toolbars are fed into the search engine’s web crawler. (I don’t have prof that this is the case for all toolbars, but I know it’s true in at least one case. And that’s enough to cause trouble for web masters.)

What’s the problem with that, you say? One example could be that you’re working on a new web site that is not quite ready to be public yet. And you haven’t bothered to password protect it during the development. Who is going to guess your new domain name anyway? As you’re busy developing your site, the toolbar sends the URL of every page - finished or not - to the search engine.

Another, perhaps more serious, example is the thank you page of web sites that sell digital products. When you - or anyone of your customers - goes to the thank you page, the toolbar reports the URL to the search engine. If you don’t have any additional protection on the thank you page it will be included in the search engine index. Then when a potential customer uses that search engine it’s possible that your thank you page shows up in the search results. And it’s very likely that the person searching was looking to buy your product. But now, with direct access to the thank you page the potential customer can download it for free. You just lost a sale.

If you have good web analytics it may be possible to see these direct accesses and calculate how much money you’re loosing. But it’s also very likely that the search engine has cached your page, and possibly even the product download itself. In that case you will never even know that your product was downloaded without payment.

My Digital Security Report has advice on how to protect your digital products from overzealous search engine toolbars.

Can anyone view your WordPress plugins?

Monday, August 20th, 2007

If you are running WordPress go to www.yourdomain.com/wp-content/plugins. If you see a directory listing of all your installed plugins you may want to follow the steps described by Shoemoney here.

This is not a major security hole and you are not alone in exposing your plugins. Google has indexed over 500,000 plugin directory listing pages.

It appears that this will be fixed in the 2.3 release of WordPress.

robots.txt

Monday, August 13th, 2007

Back in the days around 3 B.G (Before Google) AltaVista was the new search engine on the block. In an effort to show off the power of their minicomputers, the AltaVista team at Digital decided to crawl and index the entire web. This was at the time a new concept. Many web masters didn’t relish the idea of a “robot” program accessing every page on their web site as this would add more load to their web servers and increase their bandwidth costs. So in 1996 the Robots Exclusion Standard was created to address these web master concerns.

Using a simple text file called robots.txt you can instruct web crawlers (a.k.a. robots) to stay out of certain directories. Here is a very simple robots.txt which disallows all robots (User-agents) access to the /images directory.

User-agent: *
Disallow: /images

By disallowing /images you are also implicitly disallowing all subdirectories under /images, such as /images/logos and any files beginning with /images such as /images.html.

Curiously there was no “Allow” directive in the first draft of the standard. It was added later, but it’s not guaranteed to be supported by all robots. So anything that is not specifically disallowed should be considered fair game for web crawlers.

To disallow access to your entire web site use a robots.txt like this:

User-agent: *
Disallow: /

If User-agent is * then the following lines apply to all search engine robots. By specifying the signature of a web crawler as the User-agent you can give specific instructions to that robot.

User-agent: Googlebot
Disallow: /google-secrets

Since the original spec was published several search engines have extended the protocol. One popular extension is to allow wildcards.

User-agent: Slurp
Disallow: /*.gif$

This prevents Yahoo! (whose web crawler is called Slurp) from indexing any files on your site that end with “.gif”. Keep in mind that wildcard matches are not supported by all search engines so you have to preface these lines with the appropriate User-agent line.

You can combine several of the above techniques in one robots.txt file. Here’s a theoretical example.

User-agent: *
Disallow: /bar


User-agent: Googlebot
Allow: /foo
Disallow: /bar
Disallow: /*.gif$
Disallow: /

This would result in the following access results for a few URLs:

URL Googlebot Other robots
example.com/foo.html Allowed Allowed
example.com/food.html Allowed Allowed
example.com/foo/ Allowed Allowed
example.com/foo/index.html Allowed Allowed
example.com/foo.gif Allowed Allowed
example.com/fu.html Blocked Allowed
example.com/bar.html Blocked Blocked
example.com/bar/index.html Blocked Blocked
example.com/img.gif Blocked Allowed

Computer programs are pretty good at following instructions like these. But for a human brain it can quickly get overwhelming, so I highly encourage you to keep it simple. One of the longer robots.txt files I’ve encountered is from www.seobook.com - it’s over 300 lines long. The site owner Aaron Wall is the author of the excellent SEO Book; he knows what he’s doing.

For us mortals there is a robots.txt analysis tool in Google’s webmaster tools (http://google.com/webmasters/sitemaps/siteoverview). Highly recommended. Another good resource for more information on the Robots Exclusion Standard is www.robotstxt.org

 

Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site. See my Digital Security Report for more information.

 

Update to WordPress 2.2.2

Monday, August 6th, 2007

If you are using WordPress 2.2.1 you should immediately get the 2.2.2 security update.

The discovered bug is a Cross-Site Scripting vulnerability. See http://trac.wordpress.org/ticket/4689 for more details.

The WordPress developers assigned this bug a priority of “highest omg bbq” :-)


footer image
Close
E-mail It