Previous Thread
Next Thread
Print Thread
Rate Thread
Joined: Jan 2001
Posts: 97
Enthusiast
Enthusiast
Offline
Joined: Jan 2001
Posts: 97
I have this one visitor in my boards that has looked at each and every post in my forum.

Using the domain name hack, in the Who's Online thingy, I see that this "person" is spider-ta071.proxy.aol.com (152.163.205.76).

Does that look like a indexing spider to you?

Brew
CustomShowCars.com
OldHouseForums.net
PoliticalForums.net



Brew
CustomShowCars.com
OldHouseForums.com
PoliticalForums.net
pcgnetworks.com
ut2003news.com
rtcwnews.com
bf1942news.com
nolf2news.com
Sponsored Links
Joined: May 1999
Posts: 3,039
Guru
Guru
Offline
Joined: May 1999
Posts: 3,039
Looks like a spider to me. I know there are several search engines that are indexing dynamic pages now. The main reason they never did before is on many sites the spider would end up in a loop because everything points back to the same site and there are a ton of different links to follow. Apparently they are starting to figure out how to keep this from happening.


UBB.threads Developer
Joined: May 2000
Posts: 15
Journeyman
Journeyman
Offline
Joined: May 2000
Posts: 15
Scream,
In reply to:

I know there are several search engines that are indexing dynamic pages now


By chance could you list one of these search engines? I'm curious []/w3timages/icons/smile.gif[/]

- Da Birk
http://www.ncaastrategies.com

Joined: Aug 2000
Posts: 3,590
Moderator
Moderator
Offline
Joined: Aug 2000
Posts: 3,590
It is a problem that spiders index our forums?

They obviously don't register so they won't enter and index closed/private boards, so as long as they don't go posting strange stuff I'd say it's okay. And so far they only handle basic CGI (the GET method) and not POST so it should be easy to prevent.

This 'scan everything' of course affects the 'read' counters but I don't see how this can be prevented... We would need a spider filter then and keeping such a beastie up-to-date would be a real hassle!

If the spiders are well-behaved they should respect the 'robots' META tag and a simple "noindex,nofollow" should do the trick. Put this into the header.include file and you're set, something like this: <META HTTP-EQUIV="Robots" CONTENT="noindex,nofollow">


Yours,

[:red]Per Gøtterup
System Administrator, NetGroup A/S



Joined: Jan 2000
Posts: 111
Kahuna
Kahuna
Joined: Jan 2000
Posts: 111
Indeed it is a problem, I saw that one was indexing all user profiles... this is not something users would like to get indexed! That's why I hacked my viewprofile.pl to be only available when users are logged in, but also the additional unwanted traffic could be a problem on hosted sites. You can also edit your robots.txt to exclude the directory that has your w3t, just use
Disallow: /forum/
and have /forum/ be the directory where you have your w3t, here for the perl version it would be
Disallow: /perl/
and have this robots.txt in the html root of your domain, e.g. www.yourdomain.com/robots.txt
Another trick would be to use .htaccess to disallow any particular spider,
<Limit GET>
Deny from spider
</Limit>
replace spider with the IP of the spider.

Gerrit
SpiritBoard
http://www.channeling.net

Edited by Gerrit on 04/15/01 04:32 AM (server time).


Sponsored Links
Joined: Jul 2000
Posts: 82
Member
Member
Offline
Joined: Jul 2000
Posts: 82
In reply to:

The main reason they never did before is on many sites the spider would end up in a loop...


I doubt this is really the reason. Lots of sites have a "site map" link on every page (including static ones) which would cause looping if the search engine spiders were really that dumb. I don't see any reason why they can't avoid looping problems by simply keeping a list of URLs they have visited and then comparing to that list before following a link. I suspect that they have been intentionally avoiding spidering dynamic pages because they figure that the pages change too rapidly for the search engine's index to keep up with them (most search engines take weeks or months to update a changed page).

Bill Dimm, MagPortal.com - [:red]free feeds for your site.


Joined: Jul 2000
Posts: 82
Member
Member
Offline
Joined: Jul 2000
Posts: 82
Brew,

Do you have access to the log files on your server? If you do, the best way to tell if it's a legitimate spider is look at the "user agent" field in the log file. This is the part that often looks like:
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)"
Legitimate spiders will supply a user agent name that can be looked up in the Web Robots Database.
You might also want to look at its behavior. Does it read /robots.txt (if yes, good)? Does it hit your site more than a few times per minute (if yes, bad)? The spiders from the major search engines are all very well behaved. There is an aweful lot of other software out there which is written/used by idiots and you really don't want it on your site. If it's being really aggressive about which pages it's visiting, it is more likely looking for email address for a spam list than looking to index your pages for a search engine. We've had tons of them going through our site. For very good info on that see protect your site from spam harvesters.

I have no idea what the "spider" in spider-ta071.proxy.aol.com is supposed to mean (maybe you should ask AOL), but I can tell you that MagPortal.com has had visits from similar IP addresses which trace back to similar "spider" domains on AOL, and they are not search engine spiders. They appear to be regular users simply browsing around. Quite likely you are seeing an AOL user using software to skim your site to build a spam list. Congratulations []/w3timages/icons/frown.gif[/].


Bill Dimm, MagPortal.com - [:red]free feeds for your site.

Joined: Jan 2001
Posts: 97
Enthusiast
Enthusiast
Offline
Joined: Jan 2001
Posts: 97
I've never really thought about spiders being used for a spam list or such. And I've never used a spider.txt file either (no surprise here huh).

I have had one visitor that I'm not really sure about. It may be just someone from this site because they have been on both politilcalforums.net and oldhouseforums.net. Webtop.com This visitor has been on my forums for three straight days now. I've been to the webtop.com site and it seems to be a local provider but why would just a regular kinda guy be on a forum every single hour of the day for three days? I mean a person has to sleep ya know.

Another odd thing about this particular visitor is that he has looked at a few of the boards that only have three or four posts but the log file says they have looked at the board nearly a hundred times.....I wonder what's up with that? Looking at the same post over and over?

I included an attachment that shows his details. I don't mind someone that's checking out the forums....That's what they're there for. It just seems odd is all.

I'll look into the spaming spiders now a little closer. Thanks for the heads up!

Brew
CustomShowCars.com
OldHouseForums.net
PoliticalForums.net

Attachments
10-34355-screenie.jpg (0 Bytes, 26 downloads)


Brew
CustomShowCars.com
OldHouseForums.com
PoliticalForums.net
pcgnetworks.com
ut2003news.com
rtcwnews.com
bf1942news.com
nolf2news.com
Joined: Jul 2000
Posts: 82
Member
Member
Offline
Joined: Jul 2000
Posts: 82
In reply to:

And I've never used a spider.txt file either


It's "robots.txt". You can read more at Standard for Robot Exclusion.

Unfortunately, all robots.txt does is tell well-behaved spiders what you would like them to do (or not do). The really annoying software (not used by legitimate search engines) ignores robots.txt completely so it doesn't help. You can ban an IP address under Apache by creating a ".htaccess" in the directory and putting in an entry like:
[:blue]deny from insert_IP_address_here
for each IP to be banned.

I don't know why you say webtop.com looks like a local provider. It looks like some sort of search tool that is integrated with your browser to me. There is software out there which you can install to make it periodically sample a page on a web site and notify you when the page changes--I don't know if webtop has that sort of functionality or not, but if it does that might be what you're seeing.

Bill Dimm, MagPortal.com - [:red]free feeds for your site.


Joined: Aug 2000
Posts: 3,590
Moderator
Moderator
Offline
Joined: Aug 2000
Posts: 3,590
Well, I can see the spam collection angle and understand you on that, but the way many search engines work is to improve your rating (higher position) the more pages you have featuring your keywords, and a forum where people discuss some topic all the time will improve your rating on those keywords significantly, and I for one like to get a higher rating... []/w3timages/icons/wink.gif[/]


Yours,

[:red]Per Gøtterup
System Administrator, NetGroup A/S



Sponsored Links

Link Copied to Clipboard
Donate Today!
Donate via PayPal

Donate to UBBDev today to help aid in Operational, Server and Script Maintenance, and Development costs.

Please also see our parent organization VNC Web Services if you're in the need of a new UBB.threads Install or Upgrade, Site/Server Migrations, or Security and Coding Services.
Recommended Hosts
We have personally worked with and recommend the following Web Hosts:
Stable Host
bluehost
InterServer
Visit us on Facebook
Member Spotlight
isaac
isaac
California
Posts: 1,157
Joined: July 2001
Forum Statistics
Forums63
Topics37,573
Posts293,925
Members13,849
Most Online5,166
Sep 15th, 2019
Today's Statistics
Currently Online
Topics Created
Posts Made
Users Online
Birthdays
Top Posters
AllenAyres 21,079
JoshPet 10,369
LK 7,394
Lord Dexter 6,708
Gizmo 5,833
Greg Hard 4,625
Top Posters(30 Days)
Top Likes Received
isaac 82
Gizmo 20
Brett 7
WebGuy 2
Morgan 2
Top Likes Received (30 Days)
None yet
The UBB.Developers Network (UBB.Dev/Threads.Dev) is ©2000-2024 VNC Web Services

 
Powered by UBB.threads™ PHP Forum Software 8.0.0
(Preview build 20221218)