Looks like a spider to me. I know there are several search engines that are indexing dynamic pages now. The main reason they never did before is on many sites the spider would end up in a loop because everything points back to the same site and there are a ton of different links to follow. Apparently they are starting to figure out how to keep this from happening.
They obviously don't register so they won't enter and index closed/private boards, so as long as they don't go posting strange stuff I'd say it's okay. And so far they only handle basic CGI (the GET method) and not POST so it should be easy to prevent.
This 'scan everything' of course affects the 'read' counters but I don't see how this can be prevented... We would need a spider filter then and keeping such a beastie up-to-date would be a real hassle!
If the spiders are well-behaved they should respect the 'robots' META tag and a simple "noindex,nofollow" should do the trick. Put this into the header.include file and you're set, something like this: <META HTTP-EQUIV="Robots" CONTENT="noindex,nofollow">
Yours,
[:red]Per Gøtterup System Administrator, NetGroup A/S
Indeed it is a problem, I saw that one was indexing all user profiles... this is not something users would like to get indexed! That's why I hacked my viewprofile.pl to be only available when users are logged in, but also the additional unwanted traffic could be a problem on hosted sites. You can also edit your robots.txt to exclude the directory that has your w3t, just use Disallow: /forum/ and have /forum/ be the directory where you have your w3t, here for the perl version it would be Disallow: /perl/ and have this robots.txt in the html root of your domain, e.g. www.yourdomain.com/robots.txt Another trick would be to use .htaccess to disallow any particular spider, <Limit GET> Deny from spider </Limit> replace spider with the IP of the spider.
The main reason they never did before is on many sites the spider would end up in a loop...
I doubt this is really the reason. Lots of sites have a "site map" link on every page (including static ones) which would cause looping if the search engine spiders were really that dumb. I don't see any reason why they can't avoid looping problems by simply keeping a list of URLs they have visited and then comparing to that list before following a link. I suspect that they have been intentionally avoiding spidering dynamic pages because they figure that the pages change too rapidly for the search engine's index to keep up with them (most search engines take weeks or months to update a changed page).
Bill Dimm, MagPortal.com - [:red]free feeds for your site.
Do you have access to the log files on your server? If you do, the best way to tell if it's a legitimate spider is look at the "user agent" field in the log file. This is the part that often looks like: "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" Legitimate spiders will supply a user agent name that can be looked up in the Web Robots Database. You might also want to look at its behavior. Does it read /robots.txt (if yes, good)? Does it hit your site more than a few times per minute (if yes, bad)? The spiders from the major search engines are all very well behaved. There is an aweful lot of other software out there which is written/used by idiots and you really don't want it on your site. If it's being really aggressive about which pages it's visiting, it is more likely looking for email address for a spam list than looking to index your pages for a search engine. We've had tons of them going through our site. For very good info on that see protect your site from spam harvesters.
I have no idea what the "spider" in spider-ta071.proxy.aol.com is supposed to mean (maybe you should ask AOL), but I can tell you that MagPortal.com has had visits from similar IP addresses which trace back to similar "spider" domains on AOL, and they are not search engine spiders. They appear to be regular users simply browsing around. Quite likely you are seeing an AOL user using software to skim your site to build a spam list. Congratulations []/w3timages/icons/frown.gif[/].
Bill Dimm, MagPortal.com - [:red]free feeds for your site.
I've never really thought about spiders being used for a spam list or such. And I've never used a spider.txt file either (no surprise here huh).
I have had one visitor that I'm not really sure about. It may be just someone from this site because they have been on both politilcalforums.net and oldhouseforums.net. Webtop.com This visitor has been on my forums for three straight days now. I've been to the webtop.com site and it seems to be a local provider but why would just a regular kinda guy be on a forum every single hour of the day for three days? I mean a person has to sleep ya know.
Another odd thing about this particular visitor is that he has looked at a few of the boards that only have three or four posts but the log file says they have looked at the board nearly a hundred times.....I wonder what's up with that? Looking at the same post over and over?
I included an attachment that shows his details. I don't mind someone that's checking out the forums....That's what they're there for. It just seems odd is all.
I'll look into the spaming spiders now a little closer. Thanks for the heads up!
Unfortunately, all robots.txt does is tell well-behaved spiders what you would like them to do (or not do). The really annoying software (not used by legitimate search engines) ignores robots.txt completely so it doesn't help. You can ban an IP address under Apache by creating a ".htaccess" in the directory and putting in an entry like: [:blue]deny from insert_IP_address_here for each IP to be banned.
I don't know why you say webtop.com looks like a local provider. It looks like some sort of search tool that is integrated with your browser to me. There is software out there which you can install to make it periodically sample a page on a web site and notify you when the page changes--I don't know if webtop has that sort of functionality or not, but if it does that might be what you're seeing.
Bill Dimm, MagPortal.com - [:red]free feeds for your site.
Well, I can see the spam collection angle and understand you on that, but the way many search engines work is to improve your rating (higher position) the more pages you have featuring your keywords, and a forum where people discuss some topic all the time will improve your rating on those keywords significantly, and I for one like to get a higher rating... []/w3timages/icons/wink.gif[/]
Yours,
[:red]Per Gøtterup System Administrator, NetGroup A/S
Donate to UBBDev today to help aid in Operational, Server and Script Maintenance, and Development costs.
Please also see our parent organization VNC Web Services if you're in the need of a new UBB.threads Install or Upgrade, Site/Server Migrations, or Security and Coding Services.