|
|
Contents of this page
Introduction
Is it a human? is it a spider? how do I tell the difference?
How do I find out what a particular agent is?
Tips on searching for user agents in search engines
Identifying search engines and other agents that visit your site isn't rocket science, but it can be a painstaking process with a real possibility of failure. This page describes some of the methods I've used to track down the search engine spiders, webbots and other user agents that visit my site.
First you need to have access to your server logs. How you access these will depend on your ISP. Some of the free ISP's may not grant access to such logs. Others may grant FTP access to full logs. I have a sample log file explained if you're unfamiliar with them.
The server log typically contains a one-line entry for each "hit" on your site, where a "hit" in this context is usually a request for a HTML page, an image file, a style sheet (.css) file, or whatever else you're serving from your site. Each entry contains many fields, but the ones of interest here are
Of these three bits of information only the first can be relied upon, and will always be present. Often your ISP will have done a DNS lookup and give you the node name, rather than the IP number. The remaining two are set optionally by the user agent visiting your site and depending on your ISP you may not have access to the referring URL or user agent. Without these your search is pretty much at an end, unless the IP/DNS is sufficient to identify your visitor. If you don't have this information in your logs ask your ISP if it can be enabled.
You may occasionally want to refer to other fields such us the number of bytes transferred and the HTTP status code, when trying to determine the webbots "behaviour" with respect to the content on your site.
Humans, spiders and webbots have different patterns of browsing. You can't be absolute about this, but here are some guidelines
User agents that are being driven by humans, will usually come from a variety of IP addresses that will often have a DNS lookup that is recognisable as being an ISP. However this pattern will only become apparent if you get multiple visits from users using the same agent.
Most spiders always come from the same range of IP addresses, and these addresses will often have the same domain name as the parent site (e.g. piano.excite.com is one of Excite's spider engines).
Webbots may come from the same IP address each time (usually indicating that they are some form of web-based service), or from a user's ISP IP address (indicating it is some piece of software running on a user machine).
You can work out the agent from its name. First look at my webbots page. I've tracked down dozens of these already, so it's worth seeing if I got there first. Currently that page is updated every few weeks, and new entries are added at the rate of around 10/month.
If it's not yet there, take the following step :-
More usually the agent field doesn't have this information. So now look at the IP address(es) that you've seen associated with this user agent. If you've seen many visits then you'll know if they come from a single IP address (or a set of similar IP addresses) or from a wide range.
If the IP address has no DNS lookup (i.e. it's all numbers), then either run TRACERT to trace a route to that IP address, or use something like VisualRoute to do so. This will give you a series of IP "hops" that will route data from your machine to the specified IP address. The first few hops will be close to your machine, and will probably all be to do with your ISP, but the last few hops will be close to the machine that visited you. Often when there's a numerical IP address, running a trace will reveal a more meaningful DNS name a few hops before the destination is reached. Using a utility such as VisualRoute can yield a lot of information about these last few hops.
Again, if the last few hops yield a meaningful DNS lookup, then extract the domain name and visit the web site.
http://nnn.nnn.nnn.nnn
where "nnn.nnn.nnn.nnn" is the numerical IP address. This will sometimes be routed to the parent web site, but more commonly this will either simply timeout or display some standard screen or error message that will tell you nothing about the owner. Sometimes it's worth repeating this a few weeks later, as webbot owners notice attempt to access this node, and put up a web page explaining what they're doing.
If you can't identify the source from a single IP, or if the user agent shows signs of being run on various user machines, then you'll need to turn your attention to the user agent name and see if you can track this down using search engines.
One last trick before we resort to using search engines. Look at the
user agent text. Does it have a snappy name like webtwin, webstripper
or the like? If it does, try constructing a domain name from this,
usually by adding www. and .com before and after. Chances are that
if www.webtwin.com exists (and it does), it will be the web site describing
this particular agent.
That didn't work? Oh dear. Now things get tough.
If the user agent is named after a common word (like "Jack") you probably should give up now while you've got some hair left. Searching for a user agent by name can be a needle in a haystack job, and you need to have a good understanding of how to refine searches using a suitably powerful search engine. I usually use a mixture of Altavista and Google as between them they have a comprehensive reach, produce sensible results, and can be highly refined (particularly Altavista).
Your problems come from a number of sources :-
First try entering the user agent string (or a suitable substring) into your search engine as a search phrase. In doing these searches it's best to omit any version numbers etc. in the string. Look at the number of results, many of which will be web server logs, unless the agent is comparatively new or rare.
Assuming you've got too many hits try some or all of the following:-
-title:statistics -title:stats -title:access -title:agents
i.e. exclude pages with the word "statistics" etc. in the title. Of course we may just have lost the page we want, so another option would be to look for pages not containing another common user agent's text.
+url:<word> -title:statistics -title:stats -title:access -title:agents
Once you get down to a sensible number of pages (<30), start looking at these pages to see if any tell you about the agent. Usually such pages will stand out through their title, description or (sometimes) their URL. Good URLs to check are from software library sites like DaveCentral and ZDNet as these will often describe the software involved better (or be higher placed in the results) than the product's own homepage.
If the search engines don't yield a result, try a visit to DejaNews searching for the desired user agent string. It's most likely you'll simply find posts from people quoting parts of their server logs, and you may even find posts asking what the agent is. If you find such a post, follow the thread to see if anyone posted a useful answer (most times they won't if all the above methods have failed). In a last gasp of desperation you could email the original poster to see if they ever found out. If nothing else you'll get a shoulder to cry on.
Of course, if you finally succeed, please do drop me a line at info@jafsoft.com so I can add it to my list :-)
This page is © 2000-2004 John A Fotheringham. It may not be
reproduced without permission,
although you are welcome to save a copy for personal use to your hard disk.
home -
search engines -
contact us -
news -
product index -
search this site |