Interna
My weblog has disappeared – at Google, that is. A search for andrea.editthispage.com yields – nothing.
Not only are other people not able to find Al’s fried rice recipe (one of the most popular search strings), I’m not able to search my site using Google any more. And the Manila search engine only indexes the messages that get posted to the home page.
If this doesn’t change, I’ll have to think about moving my weblog. The problem is: As much as I’d like to do that, I don’t have the money to pay for (Manila) webhosting – at least not until I have finished my degree and have a paid job.
History
Indivisible – Stories of American Community “is an exploration of community life in America by some of this country’s most accomplished photographers, radio producers, and folklorists. Here are the stories of twelve communities where people are coming together to make their small piece of the world a better place to live.”
The project also includes an exhibition that is currently touring the USA.
By the way, have you ever seen a four-horned sheep?
Snakes
Visit the American International Rattlesnake Museum in Albuquerque, New Mexico. This animal conservation museum has “artifacts, memorabilia, and the largest collection of different species of live rattlesnakes in the world”. The best part of the site is the section about the different types of rattlesnakes.
sandra wasn’t too thrilled with ‘oscar’ the bullsnake living under our outdoor fireplace. she got on the ‘net and found out bull snakes can have up to 24 babies, by the end of june.
i put some gloves on, picked him up (he was very demure), and walked him down to the end of our property. he’ll find a new home, i’m sure. besides, he’ll keep the mouse population down … but i doubt you’ll see sandra walk down there for the next six months!
(grin)
Although snakes are really fascinating animals, I’m sure I wouldn’t want one of those fellas living in my garden or near the house either. Even if they are not venomous.
As far as I know, very few snakes live in Germany, and I’ve never seen one. My closest encounter with a snake was while climbing Angel’s Landing in Zion.
in reference to google, i think jeremy over at irights has an article about it (about three or four stories down). it seems google did it’s usual update, but when the etp servers were malfunctioning or down.
Want to know why your site isn’t listed on Google anymore? Because Userland didn’t like the load that Google’s indexer was putting on their servers, so they have prevented Google from indexing any of their sites.
Want proof? I sent a HTTP GET request for your site, with a special header that allowed me to masquerade as the Google Bot, which is their indexer. Here’s what i got back:
Inktomi, your crawler is repeatedly hitting our servers getting the same WAP files over and over. Please stop pounding us, it’s hurting the service we provide to our customers. Thanks. webmaster@userland.com.
(Sorry that text is so big, but that’s exactly what they sent back.)
This shows that they’ve shut off more than just the Google Bot, because “Inktomi” is a completely different company, they make a bot called “slurp”.
Yikes… makes me wonder several things. What exactly did you do to masquerade as a GoogleBot? (hey, I’m just curious!)
…and… well, what next for those of us who like all the random search requests that come through Google? how to fix this? go to weblogger.com?
Susan
Thanks, Seth… I’ve updated my blog to reflect that info.
Perhaps we need to petition Inktomi to stop whacking Userland’s servers, per Dave’s request. IMHO, denying that robot was perfectly reasonable given its behavior… bear in mind I’m a techie and so my opinion may be suspect :-)
I can’t think of a good way to set up a Manila site for petitions, so we need to come up with something else. Are we interested in trying to band together and at least get some sort of official response from Inktomi? Together with a push from Dave we might even get motion.
I can finish my e-mail program here in a bit and we could all send e-mails to me, which I could use to generate a page on iRights (or indeed, any Manila site) with a list of all the interested parties. Are we interested?
userland is trying to reinvent the wheel. why don’t they just set up a robots.txt that closes the /wap/ directory for crawlers? that’s easy and it will solve the problem.
instead, they use a new mr-script that shows every browser the middle finger.
the fix? there is no fix. this is a free service, i can completely understand if userland tries to minimize the traffic.
(d)english translation follows below:
Ich bin dem Rat von André (nur, falls Du die URL nicht kennst) gefolgt und gehe mit meinen Seiten zu Launchpoint. Ich zahle dafür, aber die haben dort auch einen kostenlosen Manila-Service.
Vielleicht ist das auch für Dich eine Alternative.
I followed a hint from André (only for the case you didn’t know his URL) and transfer my pages to Launchpoint. I pay for that but there is also a free Manila hosting service.
Maybe that’s an alternative for you, too.
Und keine Verbesserungen meines (d)englisches, bitte.
http://manilanewbies.userland.com/discuss/msgReader$5517 is quite interesting.
If they just wanted to shut off Inktomi, why did they shut off Google, also? Our experience at Free-Conversant is that Google provides a lot more hits than any of the other search engines, and I would expect the same at editthispage.com. (Conversant isn’t based on Manila, and so may have different technical requirements than they do.)
What we did was, instead of shutting off access to the search engines, we just removed some of the links if the request came from a known search engine. We also added some ‘conditional macros’ that allow template designers to change the output of the page if the request came from a known robot (actually, that’s just one of many conditions they can test).
Duncan Smeed mentioned on his site that he likes that better than completely shutting off the indexers. ;-)
Seth
Andrea, I can resolve the apparent “contradiction” — Brent knows what our servers are actually doing, Seth is making the same old mistake Seth always makes, saying what we think, instead of asking.
(It’s disappointing to see him pitch his product in this context, btw.)
Many of the search engines have brutal buggy crawlers. We even started a mail list to try to engage them in conversation, to no avail.
To explain, some search engines just check the host name to determine if they are hitting the same site. Some serach engines go by IP address. Userland hosts many (hundreds? thousands? I don’t know I don’t work there) of sites with different host names on the same IP address. So does almost everyone else.
Another hostile search engine is ia_archiver. It hits my site several times a day from 3 or 4 IP addresses at the same time. My site doesn’t change that often but it kept coming back, so I blocked it. I still let google in because I only have 3 sites on one IP address.
Wait a second. I know what your servers are doing, because I requested a page as google, and got that response. Want the script I used? You can see it for yourself.
I just checked, the “message to Inktomi” still comes through when you request a page as the googlebot. It’s possible that Brent forgot that Google uses two different User-Agent fields, and he may have only re-enabled one of them.
As for pitching my product: I purposely didn’t include a URL, so that it wouldn’t be taken that way. I was talking about what we did to solve the same problem, and I admitted that you might not have been able to do the same thing because we are, in fact, entirely different products.
Besides, Andrea already admitted that she can’t afford a paid service right now, so there was nothing for me to gain. (Right, Andrea?)
I was just trying to help answer a question that I’ve seen in a few places (here, on the SIT site, and on smeed.org). All editthispage.com sites are out of Google’s index, and it’s going to be awhile before they’ve reindexed them all again (especially if they’re not allowed access).
Andrea, I’m sorry this turned into a debate on your site. I don’t mind if you want to delete my messages.
Seth’s tests are not illuminating for a few reasons:
1. He’s not testing all our servers.
2. He’s not testing them all the time.
The policies vary over time and across different servers depending on whether we’re having trouble keeping them responsive.
That Doc’s site has been de-indexed by Google completely punctures Seth’s theory, which unfortunately he didn’t state as a theory, and so now it’s become part of the folklore that UserLand is nasty or whatever, I’m already getting flames thanks to Seth’s post here.
Anyway, when we started offering free sites in 1999 it was a different world. Now we have software that distributes the work. Radio UserLand has a great weblog editor that generates static sites that are very search-engine-friendly. They can pound them all they want with their sloppy algorithms.
Jake will post a how-to later today that explains how to move a UserLand-hosted Manila site to Radio. I hope people take a look. Let’s form a migration community, let’s solve the problem instead of fumbling around looking for someone to blame. In all likelihood, when and if we ever figure out what’s happening, it will turn out that there’s no one to blame.
fyi: on ndx google is responsible for about 50 or 60% of the traffic. this is excluding the bots.
i don’t understand today’s sn. if you blocked the crawlers it takes a while before they come back. i don’t think there is anything irregular with this.
1. No, I’m not testing all of your servers, but I’ve just tested Doc’s site and that server gives exactly the same response.
2. Actually, I’ve been testing a number of your sites on a regular basis, for the last 24 hours, and so far they’ve ALL responded the same way, EVERY TIME.
I’m not saying you’re a bad guy for this, but don’t tell everybody I’m wrong when I’m not.
People can run this test for themselves, if they want to. I’ve posted the script on my site.
http://www.TruerWords.net/659
Dave: Let’s form a migration community, let’s solve the problem
I really want to do that. But I’m not able to backup my site Der Schockwellenreiter (for transferring it to launchpoint.net) because the URL
http://DerSchockwellenreiter.editthispage.com/downloadMySite/MySite.root
gives the following error-page:
Sorry! There was an error: Can't call the script because the name "manilaSiteHostingSuite" hasn't been defined.
And that is a server error by editthispage because I was able to get a backup from Rollberg News which is AFAIK on a different server by userland.
And sure – my site has the same user structure (approx. 65 per cent from Google) as ndx.
Search engines are very temporal things. What you’re seeing today and what we were doing two weeks ago are very different things, but what we were doing two weeks ago might be what you’re seeing there.
Hmmmm. Well, while I can’t do anything about Seth and Dave’s problems, I can do something about the fried rice recipe problem.
I’ll fix the link (and the other recipe links)so Google can index them, Andrea. Thanks for pointing that out; I thought the link would update automagically with the rest of the site when I moved it to Weblogger.
I’ll bite: what was going on two weeks ago? Why block Google’s very well-behaved robots? Why not offer a blog that tracks which filters are applied when so no one has to guess?
I can’t believe that you’d attack a competitor in public for offering an interpretation of what’s going on based on the facts as he sees them, which I agree with. They represent his view of contemporary reality.
“They can pound them all they want with their sloppy algorithms.”
I run a site that gets a few hundred thousand hits a day and about 10,000 real user all performing computationally intensive tasks using only BerkeleyDB-based libs on a dual 850 MHz Pentium III box. This is a lot of brain power and a lot of disk churning. I feed the search engines 100,000 page impressions some days, and am happy to do so.
There are methods to avoid pounding, including creating static versions of pseudo-static pages that are specifically fed to search engines. There are techniques to deal with this.
I moved my site off Weblogs.com for a variety of reasons that include some I won’t discuss publicly. But one of them was that I wanted to a) generate more stats about access and usage, b) offer a search within the site option, and c) be exposed all the time to all search engines.
I hate to rage. I just want to get what I pay for, and when I don’t pay for something, it’s really hard to complain about the limitations of the service.
[I’ve revised this post to reflect Seth’s message below.]
Glenn, we don’t offer a paid version of Manila.
Conversant is not based on, built on, or developed along the lines of Manila. It’s an unrelated product that happens to do some of the same things as, was developed at the same time as, and uses the same scripting language as Manila.
I’m not being defensive! :-) I just wanted to make it clear. Our free demo service is similar in some ways to editthispage.com or weblogs.com, but we’re really not the same thing at all.
I’ve revised my post to reflect reality. My apologies for misreading and then writing something that didn’t reflect the truth.
Glenn that’s an interesting story, but I’m sure you’re not offering free hosting to thousands of people. Whatever, I’m glad you graced our servers for a few months (no sarcasm), and wish you the very best.
Glenn that’s an interesting story, but I’m sure you’re not offering free hosting to thousands of people.
It’s an incredible service you offer, but my point was rather that I built in the necessary scripting, redirections, custom code, hardware, and bandwidth to handle the vast amount of spider traffic. The search engine companies are non-responsive, on the whole, so I could either block them (at times or always), or figure out how to accommodate them.
This is part of why I want Userland to introduce a tiered structure for pay: I would have preferred to give you money and stay and get a higher level of service, including being open to spiders, etc.
The subdomain problem is a huge one. Search engines use IPs, so I’m not sure why they can’t limit access via IP if you’re using the Host directive. But we can’t get them to agree to anything, of course.
This is part of why I want Userland to introduce a tiered structure for pay: I would have preferred to give you money and stay and get a higher level of service, including being open to spiders, etc.
We’re getting close to making a decision about that. First we want to be sure Radio is a solid migration path for people. I wrote a few months ago (and linked to it yesterday) that we know we’re not going to be able to continue as we have been. We want to do it with the minimum disruption for people, and lots of benefits, and have a chance to make weblogging a profitable business for UserLand. We have some concrete ideas, but aren’t ready yet to talk about them publicly.
The subdomain problem is a huge one. Search engines use IPs, so I’m not sure why they can’t limit access via IP if you’re using the Host directive. But we can’t get them to agree to anything, of course.
Right on Glenn! I can tell you understand the problem. The spiders don’t know that xxx.weblogs.com and yyy.weblogs.com are the same machine. Of course they could figure it out, but they haven’t done the coding. But after Doc talked with Cindy yesterday and she called me, I think we’re going to get a dialog going. When that happens I’m going to want to be open about it, and I hope you share your thoughts. I didn’t understand that you had so much experience running a server. Let’s tap into that.
One of my dreams is to have a special URL they can query that returns the list of urls that have changed on a site since a given date, and only index those pages. This would make JIT-SEs a real possibility for big search engines like Google, imagine if they indexed a few of the big weblogs and the things they pointed to every night. All of a sudden you could use Google to track news.
It’s a trivial programming task on both ends. We’d be glad to add that feature to Manila, and I bet some other CMSes would follow suit. It would help separate the men from the boys, if your CMS can’t generate such a file, what’s it doing?
Anyway, just thinking out loud here on Andrea’s DG!
Great idea. This is sort of how robots.txt started, as I understand it: some engineer(s) at Webcrawler wrote the spec. Posted it. Other search engine engineers followed it. Pointed to it. No committees, no agreements, no W3C, no payments. A perfect illustration of the Net.
So – if you get Google to go along with it and they post a permanent link explaining the XML DTD to link to, or the – hey! – SOAP document that describes the query…hmm…. Could solve several million problems at once.
This would also solve part of the dynamic link problem: many sites generate most or all content through links that look they’re feeding arguments to a script.
Yikes – this is like a way to supersize RSS (sorta).
(Andrea: Wilde Papageien are in Chicago, too. I’ve seen the nests. Very odd. Also in New York City.)
>It’s a trivial programming task on both ends. We’d be glad to add that
> feature to Manila, and I bet some other CMSes would follow suit.
This is the reason I joined the CMS-vendor list about 5 minutes after you started it.
Seth
sorry for joining so late. as the creator of the disturbing search requests site and webmaster of a frontier server i developed an obsession with search engines and their algorithms.
i actually believe that webloggers are the worker bees for google, because the page ranking algorithm is based on posted links. according to google’s algorithm every link counts as a vote for a page. webloggers post loads of link and so they help google to create a good index.
i noticed in my log files that many crawlers first ask for the age of pages before they start to index. this is done by a HEAD instead of a GET request. if you want to minimize the traffic by crawlers you should make sure that the server returns information about the age of pages. (see the rfc for http about details)
the idea of a page with all pages and their last updates goes one step further. this will even minimize these HEAD requests.
p.s.: speaking of separating the men from the boys, what is a cms that returns a 200 instead of a 404?
Kris, re your postscript.
It was probably a mistake to return a 200 on a FNF in mainresponder, but it would probably not be a good idea to change it now.
At the time the rationale was that most of the browsers (MSIE) did something really awful with error messages, and I was confused.
I’m working on a new responder now, and it certainly returns a 404 when the file is not found.
Hindsight, as they say, is 2020.
Dave
I can’t find it anywhere, but I have a recollection of one of the high techies of InfoSeek (maybe Steve Kirsch, maybe not) posting a document way back in 96/97 proposing that sites keep a chrono-by-changedate listing of documents on their site. The idea being that a spider would read down the file until it got to a changedate that was older than its last-visit-date, and go on to read all those newer files.
I think this was a really simple format: url and changedate on one line, maybe just a space between them.
Hmmm, I think I’ll email Kirsch and ask…
Aha, found the reference in my 1997 Bookmarks file (thanks Netscape).
The proposed filename was “sitelist.txt”. Here’s the new location for the spec: http://www.inktomi.com/products/search/support/docs/sitelist.html.
It appears that the InfoSeek Ultraseek server spider (now owned by Inktomi) actually does this already. http://www.inktomi.com/products/search/support/docs/faqs/faq092.htm. Thanks Python!
Dear Andrea,
I found you at http://www.google.com/search?q=andrea.editthispage.com so you’re still there. However, strange things happen with Google, and I don’t use it 100% any longer. In fact, I we use mostly Alta Vista and Thunderstone. Our site http://www.geocities.com/cmalerts/ is not on Google.
Al