Ready for your head to explode? Vanessa Fox let me know this session will likely make me want to cry because it’s all Q&A, except for a few super technical slides from Todd Nemet. So, if you need me, I’ll basically just be sobbing under my chair. Maybe bring me a blanket or something. A cookie would work too.
Hopping right into the madness, Todd is up with his super technical deck.
He says one thing he’s learned about titles at SMX is that they don’t really matter. So he renamed his. But I didn’t write it down. Go ahead, cry about it. I already am.
Todd says we’re not going to look at the Web site or any of that stuff. Instead we’ll look at things like
- Web access longs
- HTTP response codes
- HTTP response headers
- And talk to the network admin/developers
Questions for IT/Developers
- Is your load balancing round robin?
- How do you monitor your site? [Because it's rude to ask "DO you monitor your site?"]
- Are there are reverse proxies or CDNs in your configuration?
- Do you do any URL rewriting
- May I have a sample of your web access log files?
From there he’ll:
- Check load balancing
- Check server latency. Do 10 real quick grabs of the home page and time it.
- Check network latency: They can see a slow network, packet loss. You’ll want to talk to a network, engineer.
- Check for duplicate sites to see how many have DNS records
- He’ll do a port scan to see all the ports that are open. He’ll find duplicate sites that are running on high ports. Those are potential duplicate sites. Don’t keep your FTP port open, this is 2011.
Web access log analysis
You have browsers that go to a Web server and a bot that goes to a Web server. Every time something is accessed, an entry gets written into a log file. He wants to get all the log files from the server. What kind of data is in there?
- IP address and user-agent: Whose doing the crawling?
- Date: How often are we being crawled?
- Referrer: What’s the referring inbound link?
- The URL being requested
- What are the http response codes?
They’ve created a web log analyzer. Their clients upload web log files and they extract the relevant fields. In the Excel file you get the bot activity, a hierarchical view of what’s happening, query parameters, reverse DNS, HTTP response codes, etc.
The rest of his presentation is about examples he found and how they fixed it. I will do my bestest, people.
He shows an example of crawling inefficiency and a site with about a zillion dynamic sitemaps. Naturally, Google was spending all its time crawling the million sitemaps and nothing else. To fix it they changed the way sitemaps were generated. Magic.
He shows an example of a site getting badly scraped. What they did was monitor it to block all the bad IPs.
Duplicate content problems: Site has 7 version of their home page indexed. Links are going all over the place and being diffused and you’re wasting your crawl time. Solution was to URLRewrite (IIS7+) and URLRewriter (IIS6+)
Other duplicate content problems come from sorting issues. The solution is to use the canonical tag or redirects so Google is ignoring certain parameters. Your log files will tell you how bad that problem really is.
Poor error handling: Your error pages tend to be the most crawled pages so they’ll look like they’re very important pages, which will bump down other pages.
Look to see if a site is cache friendly.
Look for character encoding. You want to URL encode those characters.
[Todd then talks a whole lot more about really smart things that are over my head. I'm sure it was all very genius. I don't speak g33k.]
Many more areas to investigage
- Cache control headers
- DNS configuration
- Domain health
And with that, he’s done. Thanks, Todd.
From here, we’re going to head to Q&A.
The client’s home page had an internal server error which was corrected after Google reindexed the page. Now the home page has been taken out of Google’s index. How can we get it back in?
Vanessa thinks maybe it hasn’t been reindexed since…its not…there. Google does crawl home pages really frequently so you shouldn’t have to submit your home page to Google. If your home page is not in the index, something is wrong. Her guess is there’s still something wrong with the page. Even crappy sites should get their home pages crawled every day.
Question regarding running multiple sites. Should multiple sites be on different IPs to help with link development? Should domain registrations be private or public?
[everyone's giggling already]
If you’re not a spammer, then the answer would be that there is no reason to hide from the search engines the site that you own. There is no problem with owning multiple sites. If you ARE a spammer, Google will find out that you own those 15,000 domains that are linking to one another. This isn’t something I would worry about at all.
We’re about to go through a redesign – lots of page consolidation – for about 20 sites and moving everything to new domains. On maintaining 301 redirects – my developer said its going to overload the server and that we shouldn’t do it.
Todd and Vanessa: No, you’re fine. If you don’t do it, you’ll have a major problem.
Googlebot does not crawl the name domain, it is making request via IP.
Todd: I see this quite a bit. It will be in the Webmaster Tools as one of the most frequent linking domains. The search engines are smart enough to figure that out. Todd says not to worry about it, but to 301 it to the right thing.
Google is crawling every refinement of our content. I want Google to index my tab but I don’t want Google to index every refinement users make on my page. [I may have just murdered that. Pardon me.]
Come to the pagination session tomorrow. Vanessa thinks you can probably use the canonical tag to canonical all the different variations into that route.
Michael Gray is in the audience and hops in to say that you can detect server-side when you’re getting those and noindex the page.
Question on going international -they can only sell specific products if they’re going to Canada. But it’s their regular .com product pages which are ranking really well in the top few positions. He knows he can geotarget, but what are some ideas for how to deal with that and outrank himself?
Vanessa: What happens with internationalization is that Google says its fine to have 4 different English language sites (US, UK, Australia, etc) but there’s all these different relevance signals that go into ranking. One of them certainly is the country. Part of that is currency, shipping, TLD, having it on a subdomain, etc. All of those should go toward country relevance. The problem is there are other signals that go into ranking as well. Sometimes those other signals outweigh those location relevance signals.
You can have IP detection that redirects Canadian users to the Canadian version. It’s a tough problem. You want to start getting local authority by going after local links.
Log files – During the Panda Disaster, he looked through the log files. They have a lot of visitors and they have a hosting provider so he had to ask for the log files. Is there an easier request that he can give them so he can just get the information he needs out of them?
Vanessa: I don’t know who the hosting provider is and what they have access to. They wrote a tool to make it work inhouse.
Todd: If you can get shell access, then you can filter that and then zip it.
Sears ranks #1 for [dishwasher] but the page that’s ranking for the term is completely random. Why?
Vanessa: It’s possible the page that was ranking is having a technical problem and is no longer indexed. It’s possible your autograph page (the one ranking) went viral and got a bunch of links to it.
After some investigation, Vanessa finds that Google’s in an infinite loop because the site is pointing at a short URL. When someone has a cookie they can load that URL, but when you arrive for the first time, it appends session information. So Googlebot is never getting a 200 on that short URL.
Wow. Okay. That was a lot. We’ll be back to finish out Day 2 in just a bit. :)