Welcome back my friends. It’s time for a session on technical search engine optimization which means if you hear whining from where you are, don’t worry, that’s just me trying to keep up and understand what they’re saying. Sign. Why do I do this to myself? I have no idea.
Up on stage we have Vanessa Fox moderating friends Greg Boser, Jonathan Hochman, Todd Nemet, and Brian Ussery. Hello, all. Oh, it seems Vanessa is also presenting. And she’s deciding she’s going first. She brings up her slides and notices she’s spelled her own name wrong on her Twitter account. Technology, 1; Vanessa, 0.
The Importance of Technical SEO: A Case Study
We’re going back to a site Vanessa analyzed at SMX East back in October. There’s lots of code showing Googlebot unsuccessfully trying to fetch site maps. A large percentage of their crawl allotment was spent trying to crawl site maps so there wasn’t much time left over to crawl the URLs. There was no way to find this out other than through the server logs. Because their site has a dynamic set up, every one of their site maps changed all the time because you’re always adding new listings, causing the bots to keep fetching. To fix this they built static sitemaps for everything that stayed the same.
A new site analysis was done yesterday and found that Google can now fetch the actual listings on the site. Yey. Progress. In October 2010, Google spent 27 percent of their time crawling site maps. In March 2011, they spent just .05 percent. Their traffic has doubled. This is why SEOs and Vanessa are awesome.
Next up is Greg. He heads up organic strategy at BlueGlass. This is the first time I’ve ever seen Greg B0ser deliver a PowerPoint presentation. I can’t even process this. He swears he’s done it before. I’m not buying it.
Technical SEO: Understanding CPR
Core Concepts You Need to Understand
- Proper Prioritization is the key to success – every site has tons of things that could/should be fixed. Know where to start.
- Proper top-level analysis is critical – in order to properly prioritize, you need to have a thorough understanding of the “big picture” before you start.
- Google is no longer “page” focused – The days of Google determining what will or won’t rank primarily based on page-level analysis are gone. Overall content performance is the key. Google is looking at your site as a whole, so YOU need to start looking at your site that way, too.
Content Performance Ratios (CPR)
Taking the time to understand how your content is performing will help you determine where to start.
Questions you need to answer:
- What is the ratio between total pages indexed and the total number of pages generating current organic traffic?
- How do those numbers break down based on landing page type and content topic?
If Google was his engine, Greg would think less of a site if they continually fed him a high percentage of garbage they’d never show. We have to assume Google is doing the same thing.
Breakdown Down Your CPR
- Total URLs visited: 537,000
- Total visits: 215l
- Total URLs: 17,445
- CPR= 3 percent
Google has decided to actually show somebody 3 percent of the total pages that have been indexed on the site. That is terrible and is a very bad thing. If you look at some of the footprints of the sites that got hit in the Panda update they probably fit this characteristic.
Break it down further
- Browse URLs: 16,676
- Browse Visits:67,884
- Category URLs: 109
- Category Visits: 6,754
- Product Detail URLs: 12l
- Product Detail Visits: 90k
- Pagination URLs: 490
- Pagination Visits: 2,421
These are the kinds of things you need to know before you start making technical tweaks so you don’t make things worse.
Follow up with linking metrics:
- Total URLs indexed: 537,000
- Total Visits: 215,273
- Total URLs: 17,445
- Total URLs with External Links: 3,365
Those numbers show a pretty poor distribution and they’re going to tie into the other numbers.
What the Data Tells Me
- Definite duplicate content issue
- We’re irritating Google – making them work too hard to find the content
- Poor architecture focus – not enough torso or head traffic
- Possible poor external link support for torso/head
This data maps out his plan for how he’s going to use canonical or noindex/nofollow to sculpt the architecture and make it clear to Google where the pages are that are most important. He can also see there’s going to be an issue with links at the top category level. And he can get someone working on that while they’re figuring out the best way to do others. When you walk through that process it will pinpoint where you need to go.
Where We’re Going to Focus First
- Trim down the total indexed content – shooting for an initial goal of a 30 percent CPR
- Exploring external backlink structure
- Further analyze site structure to determine effectiveness on top-level category support
- Build an initial action plan based on those three items
- Deal with page-level items after this work is done
Use the information, hammer out an action plan and don’t move on to the page level stuff until a plan is in place. If you take the time to do that, things will jump out at you that are painfully obvious that you never seen when you just go through the site thinking up ideas. Make it about analytics.
Vanessa congratulates herself on getting Greg to make Power Power slides. Greg tries to redeem himself saying it wasn’t him he made it and they have a graphic designer for that stuff. Hee! :)
Next up is Todd.
Evaluating Technical Architecture
What can we evaluate about the technical architecture? You can analyze your network. Looking at the GSLB, local load balancing, DNS, etc. There’s Web access logs, HTTP headers, domain register, etc.
Network Analyists are very confident. You have to ask them questions.
Interviewing a network engineer
- Is your local balancing round robin?
- How does the server do health checks? Are there any reverse proxies in your configuration?
- Do you do any URL rewriting
- May I have a sample of your web access log files?
We can ask those questions, but we can also check ourselves. We don’t have to wait for him (or her!) to do it.
He looks at server latency. Do 10 real quick grabs of the home page and time it.
Isolate the network latency. They can see a slow network, packet loss. You’ll want to talk to a network, engineer.
[He's showing all the codes on how to do this but, um, yeah, I'm not a robot, people.]
Check for duplicate sites and shut them down to clean things up.
Go to Robtex.com – they mine DNS information.
Web Access Logs Analysis
If a browser goes to a Web site, that gets written in a log file. He shows a typical log file. Yup, looks like keyboard mashing to me.
- Figure out when and how often you’re being crawled
- Referers: What links are bringing actual visitors
- URL Path: Where is this crawler spending its time
- HTTP Response: Am I redirecting correctly? Errors?
Nine By Blue Web Log Parser
- Bot activity
- Hierarchical View
- Query Parameters
- Reverse DNS
- HTTP Response codes
He shows a lot of screen shots of actual logs…but again…real person, not robot. This is where a ticket to SMX actually comes in handy.
Many more areas to investigate
- Cache control headers
- Domain health
- Page level analysis
Next up is Brian.
Types of SEO
Brian says that’s not true because the other two don’t exist if you don’t have architectural. Take, THAT!
Path to Indexing
- URL discovered via links/sitemap
- Time allocated for crawling URL
- Accessible unique content
- Don’t block the bots and remove obstacles
Hosting can actually have a big impact. Most hosts don’t know much about search engine optimization.
Hosting Obstacles: 403 Errors
- He says Google Webmaster Tool is a great place to go for information
Hosting Obstacles: Robots.txt
- The host can actually block access to your Web site.
- User-Agent Switchers don’t switch IPs
Crawl efficiency is very important to search engines and to you. The more efficient your page is, the less time it takes to crawl, the more pages you have that get crawled. You can find this information in the Crawls Stats section of Webmaster Tools
- DNS: This is the time taken for the DNS lookup of the hostname
- Connect: This is the first phase of the http GET request when the TCP/IP connection is setup by the remote server
- First byte: This is the time from when the last byte of the http GET request is sent until the first byte of the response header is received
- Total: The time from when the http GET request is started until the last byte of data is received
- Server efficiency: Compress files and be sure your server supports If-Modified-Since
- Response time: Be sure your server responds quickly
How? Netcraft will go into detail every month and has complete ratings,. He calls it a very helpful resource when looking for a host.
Unique Content vs Duplicate Content
He uses the Google store as an example
If you go to the http://googlestore.com it redirects you to the .aspx page instead of the root. If you look at the Google Store in a text browser you’ll find two links – one has the pound. The Google Store is actually two different Web site. The US site isn’t okay, the UK site [http://google-store.com/] however is a mess. He shows a canonical issue between different versions of the same site that Google doesn’t seem to have figured out yet.
Front End Speed
80 percent of load time comes from the front end. He shows a waterfall chart. There are a lot of great tools you can use. Google measures page speed and site performance difference. page speed is the amount it takes the site to load. Site perfomance is the time it takes for the page to load + redirects. Forty percent of people will leave a site for good if it takes more than two minutes to load. Dude, who’s sticking around for two minutes? I’m not.
Technical SEO Checklist
- Host access
- Host crawl efficiency
- Provide a clear path
- Unique content
- Use Google Webmaster Tools
Next up is Jonathan. His says his presentation is going to be dirty because there’s lots of details and code. I’m basically never blogging a technical session ever again. I’ll just nap instead.
Why Details Matter
Staging Server Mischief
Be careful when copying files from stagging to live. Contents of their live robots.txt file was there for five yeras. oops! Never put a robots.txt file on your staging server because it may go live by accident. A better way to protect a staging server is to require a password via .htpasswd on Apache
Duplicate content in CMS and ecommerice systems
- Some OS commerce configs have funky session IDs in the URL parameters. You can download a module that fixes them.
- For WP, the All in one SEO pack
Running out of crawl time or stage
- If your site has millions of pages, code optimization should be a high priority strategy to get more pages indexed. Good code is often five times shorter than average code.
- Watch out for infinite URL spaces, such as calendars.
- If all your pages are indexed, this tactic might improve the frequency of indexing
- Submit sitemap.xml to Google/Bing to get accurate feedback on how many of your pages are indexed
- <500 pages, see XML-Sitemaps.com
- >499 pages, download GSiteCrawler.
- A sitemap won’t help indexed pages rank better. However, it may help pages that aren’t indexsed or help identify duplicate content.
- Inspect sitemaps.xml top to bottom and make sure each page is listed and unique
Unique Titles and Meta Descriptions
- Same meta data on multiple pages is a sign of low quality, generate less clickable search listings and makes pages more likely to be considered duplicate.
- Best to have some code that provides acceptable titles and descriptions by default.
Spelling and Typos
Why does my listing in Google have a spelling error? Why don’t I rank? Has happened numerous times that pages didn’t rank because of typos in critical places such title tags and anchor text.
Use Xenu Link Sleuth whenever Google webmaster tools reports broken internal links, and whenever you do a major overhaul. Dead links are bad for user experience and a waste of link juice
Hacking & Malware
If your site gets hack, traffic will tank. Scanning for malware is weak, best scans only detect 30 percent of threats. Real security requires regular software upgrades, file integrity monitoring, version control and strong access controls. Top reason for hacks is failure to patch CMS. Jonathan mentions how the IMCharityParty Web site was hacked, which ruined their marketing efforts for the event that took place last night. They saw much fewer attendees and donations than normal because of the warning that deterred people from going to the site.
- People love to argue about whether code validation is worth the trouble.
- Validation increases the chance of cross platform/browser compatibility. Not a magic SEO strategy. Don’t expect rankings to instantly improve, they won’t.
- Validation helps you check for errors automatically. It is easier to clear all errors and warnings than to pick and choose.
- Search engines can parsed messed up code, but sometimes bad code confuses spiders.
- If you look in Google Webmaster tools and see a code snippet appearing in the most common keywords, that may be a symptom of missing or malformed HTML tags.
- Happy visitors generate referrals, tweets, bookmarks and lnks, Unhappy visors don’t. Happy visitors are more likely to trust you and convert.
- What tends to make happy visitor? Sites that loud correctly, quickly and smoothly on any browser, any computer and any mobile device. It doesn’t matter if you have the perfect keywords when your site is slow or won’t render.
- Some people like to print Web pages. Do you have a print media stylesheet? For larger ticket items or B2Bs, printing may be important.
- Do you still have those obnoxious messages chastising your visitors if they have the wrong browser?
The Big Picture
- Technical SEO won’t magically lift your rankings, but correcting errors may help.
- Don’t think about technical SEO only in terms of ranking signals. User behaviors are a ranking signal, when users react favorably to a web site, search engines eventually notice.
And that’s it. Technical SEO hurts my brain. We’ll see you after lunch, kids.