Solving the Big URL Issues

With the keynote and my MIFI problems out of the way, it’s time to dive into the first session of SMX East Day 2.  Talking to us about big URL issues today are John Carcutt, Brian Cosgrove, Stoney deGeyter, and Joe Rozsa.  Hopefully they won’t fry my brain too badly with all the technical stuff.  I promise nothing.  Vanessa Fox is moderating.  We compared socks earlier to see whose was cuter.  I’m going with hers this time. But only cause they’re foreign.

Up first is Joe Rozsa.  His company is called KaLor and it’s named after his two daughters. How cute is that?

Joe starts off telling people to be consistent in their site architecture.

  • Add/Remove Trailing Slashes?
  • Enforce lower case URLs
  • Redirect to https://
  • Dub-dub-dub or not?

Add/Remove Trailing Slashes?

Is your site a directory-based site? If so, keep the slashes. If not, drop them. Inbound links are more likely to be without.  Be sure to 301 redirect the alternate URL to the URL format that you choose for going forward. Use in your internal linking structure and sitemap.xml file.  Your home page should always be slash-free – http://domain.com

Lowercase URLs

It doesn’t matter. URLs are case sensitive in the eyes of the search engines. Because URLs are case sensitive you want to make sure they’re consistent when you’re doing link building.  Same with internal linking, refer to your URLs properly.

Serving Secure Pages

If you are serving https pages, you do not want to return a 403 status code if someone comes in via http. Paypal uses 301 redirects to take visitors from http://paypal.com to https://www.paypal.com

Canonical Issues

WWW is a subdomain and some people feel it’s outdated. Should you use it? It’s your call, but handle the alternative. Choose one method and redirect the other form to the selected method.   He shares a code snippet for how to do that but…I’m only one woman, people. I can’t liveblog code.

Common Ingredients for Best URLs

  • Short and sweet and easy to read
  • Touch of static would be awesome
  • Don’t/have/a/hundred/folder/deep/navigation/path  [I see what you did there]
  • No sub (domain) it out either.
  • A pinch of keywords and a-dash-of-separation

Joe says that eCommerce SEO can be ugly but calls Magento the best eCommerce platform for SEO out of the box.   He calls out a number of ecommerce sites that have good URL structure.   It’s basically all completely over my head because I’m not a real SEO. I just play one on the Internet.  Ooo, he shows a product page on Papyrus and says it’s a great example.  I mean, who doesn’t love Papyrus?    You want to get your category/sub category/product in your URL if you can. That’s going to set you up really well.  It’s great for SEO and users.  Home Depot has bad URLs because their pages are duplicated depending on how you search.  Also, because Home Depot is boring.

How do breadcrumbs impact your search results? He shows how StubHub does it. The search engines show the breadcrumbs in the SERPs instead of the URL. That’s helpful for users.

[Vanessa asks how many people know about the canonical tag.  Lots of people raise their hand. She says if you don't know how to use it, there's lots of stuff you can read. I recommend this post from the Google Webmaster Central blog on specifying your canonical.]

Next up is John Carcutt.  He took the session title literally so he’s going to talk about BIG URLs. As in, really long ones.

John says that Google has the longest URL, but you should check out Darren Slatten’s post called What Is The Maximum Character Length For A URL That Google Will Index?

How many characters can you use in a URL?

HTTP Protocol: Does not set a limit  on the number of characters in a URL

Browser:

  • Internet Explorer – 2083 characters supported
  • FireFox – successfully tested up to 100,000 characters
  • Safari -successfully tested up to 90,000 characters
  • Opera – successfully tested up to 190,000 characters.

HTML 3 said URLs cannot have more than 1024 characters, but HTML 4 specifications removed it.  The Sitemap protocol says it has to be 2,048 characters. To him that’s an indication of Google trying to establish a limit.

So what’s the answer? John says that 2047 characters is all that can be successfully used in all browsers, servers and protocols.  Realistically, URLs should be as short as possible. If your URLs are over 200 characters, start looking for solutions to fix this.

What Makes a URL Long?

SEOs: We make URLs long by seeing how many keywords can fit in a single URL. Does every category level need to be included? You should be able to look at a URL and get a clear idea of what you’re going to find on that page.  But don’t take it too far.

Technology: Dynamic pages need to pull data from the server based on a set of specific instructions. As the amount of information that needs to be passed grows, so does the length of your URLs.

Find alternative methods to pass data when possible. Parameters in the URL are not the only way to get data to a server.

  • Cookies can provide a wide array of data to help generate dynamic pages.
  • Global.ASA or Global.ASAX files
  • Hardcoding

Drill-Down Navigation

The deeper the the drill, the longer the URL. The number of drill-down options is also the number of parameters you can add to the end of the URL with this type of navigation.

The path of the users builds the URL: Users will use a variety of paths and end up at the same selection of parameters. This can create multiple paths of navigation to the same content.

Search engines follow every link and will index every version of the URLs that they find. Many of the URLs are just an alternate path to the same content. The “nofollow fix” that some recommend can work but takes continual maintenance if the navigation is updated regularly and can be complicated.  Also, who knows how well nofollow is really working anymore. Don’t use band aids.

Get rid of duplicate categories.  Add only the necessary parameters.

Dealing with the multiple paths of navigation issue: Consistency of parameter placement in the URL will solve the potential duplicate content issue associated with user’s pathing behavior. This applies to ModRewrite stuff, too,

Large Scale Content Sites

When a site published 60-70 or more pages of content every day there are unique challenges to making sure that content is indexed due to the “flow rate” of content through the site. Internal paths to content may change based on content being pushed to deeper pages. Older content may be archived or placed in different sections of a site. Categorization of content may also complicate the issue if the system allows content to be filed under multiple categories. Many of these sites have multiple departments publishing content and use internal tracking parameters to monitor cross channel support.

[Vanessa Fox chimes in again and reminds people that you want your parameters to appear in the same order. This helps to train the search engines.]

Next up is Stoney.

When you have duplicate URLs it creates too many pages in the index for the engines to come and start spidering through your site. That isn’t bad. But once they start indexing duplicate pages, they start getting a little wary about your site and want to move on to something else.  He shows how some pages can get indexed twice, while other pages won’t get indexed at all.  Duplicate content slows down the spider on your site and may mean some of your important pages get left out. It also splits the link juice and makes your pages inaccessible.

What are the causes of duplicate content & problem URLs?

  • Redundant URLs: It’s the same page access through 3-5 different URLs – example.com, www.example.com, www.example.com/.
  • Secure Pages: http vs https.  He sees this quite a bit. You can go to http OR https and you’ll pull up the exact same content.
  • Unfriendly Links: If the link itself can’t be followed, it creates a problem. The page is blocked from the search engines.  If you’re using Javascript for the URL, that can be a problem. You want to avoid that.
  • Session IDs: You end up creating a duplicate page farm through session IDs because every page gets multiple unique IDs.

The Solutions!

  • Search-Engine Friendly Links: you want to have a properly written HTML link. Avoid the Javascript or anything that is out of the ordinary.
  • Link Consistently: However you decide to link to your site, keep it consistent in all of your internal links. If you’re going to use the slash, use the slash.  He encourages you to use Xenu to check your links.
  • Secure Shopping Path: When you’re in your checkout system, any links that you provide going back to your products or links should be hard coded links using http not https so you don’t have any duplicate content.   Block the search engines from getting into the shopping cart. They don’t need to be there.
  • Canonical URLs: Link only to the canonical page.
  • Redirect Old Links: If a page is gone, make sure you redirect the page (and all link value) to the new page. This is important.

Next up is Brian.  He says he’s going to get into the more technical side of things and I basically groan loud enough so he can hear me.  Technical is hard to liveblog. It would be like liveblogging something in Chinese. I understand both equally well.

Despite my whining, Brian starts.

  • Redirecting is when the Web server tells your browser to go somewhere else.
  • Rewriting tells the server you’ll give them a different piece of content

The traditional way of handling bad URLs is to fix all of the issues one at a time.   But that’s a long process and can create a lot of SEO headaches if you release them intermittently.  When your URLs are indexed nicely, you get a lot of benefit in the search engines. Things are just displayed better.  He shows the results page for [smx] and all the added benefits they get for doing things right.

URL Abstraction

URL abstraction refers to the separation of URLs and backend code. What you want your URL to be doesn’t have to be your backend technology. It becomes really easy to optimize your URLs. You can manage your URLs independently of the application deployed.  URLs can be simplified to support more streamlined analytics reporting.

URLs are based on content needs

  • Extreme: Any URL on any domain on that server can be mapped to any resource by updating a field in the CMS
  • Common: Certain elements like the headline of an article become components of the URL but other URL elements are forced to conventions.

URLs do not change with backend architecture changes.  Calls to links in the code “look up” the URL before publishing.   He shows a few examples of URL abstraction.  I think they’re written in hieroglyphics. Or maybe the slide is upside down. No, it’s just me.

URL Rewrite Filter is a J2EE Web Filter that provides:

  • Inbound URL rewriting
  • Outbound URL rewriting
  • URL redirection
  • Method Invocation
  • XML backend for rule storage

Potential Complications: Architecture of production environment could impact the complexity of distributing URL rewrite Filter’s configuration fields.  Hyperlinking might need to be adjusted once in order to support the outbound rewriting features.

Action Items

  • Create content managed URLs
  • Complete 1-1 URL scheme with redirects for alternative versions
  • Make decision on WWW subdomain
  • Make non-form pages http instead of https
  • Manage redirects and rewrites carefully
  • Manage secondary domains
  • Use cookies for sessions
  • Use Post-Redirect-Get pattern for cross-site-request-forgery prevention and do not allow application requests via get

And we’re done! Did that make sense? I hope so…

Share this post

About the Author

Lisa Barone

Lisa Barone co-founded Outspoken Media in 2009 and served as Chief Branding Officer until April 2012.

Get social with Lisa at Twitter

7 thoughts on “Solving the Big URL Issues

  1. Thanks for the excellent write-up Lisa! I noticed that you don’t use the www in the domain of your own site and I have chosen to ommit it too because I think it looks cleaner that way. Unfortunately many people who link to my site keep adding the www because they either think it is required or do this out of habit and I wonder if this means I don’t get the full benefit of those links in my ranking. Even though I am 301-ing the www to the non-www domain I read somewhere that 301’s do not transfer 100% of the pagerank (don’t remember where I read this though…)

    What is your view on this? Should I stick with the non-www version or add the www?

  2. For those that need the code (and are operating websites on an apache based server):

    From www to no-www:

    RewriteEngine on

    # 301 redirect to domain without 'www.'
    RewriteCond %{HTTP_HOST} ^www.example.com$ [NC]
    RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]

    From no-www to www:

    RewriteEngine on

    # 301 redirect to domain to 'www.'
    RewriteCond %{HTTP_HOST} ^example.com$ [NC]
    RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

    I used to be no-www until I wanted to make use of pipelining and CDNs to speed up my site and wanted to control what content sent cookies and what didn’t. Using sub-domains allows you to do this.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Comments links could be nofollow free.