Solving the Big URL Issuesby Lisa Barone on 10/05/2010 • 7 Comments | Internet Marketing Conferences
With the keynote and my MIFI problems out of the way, it’s time to dive into the first session of SMX East Day 2. Talking to us about big URL issues today are John Carcutt, Brian Cosgrove, Stoney deGeyter, and Joe Rozsa. Hopefully they won’t fry my brain too badly with all the technical stuff. I promise nothing. Vanessa Fox is moderating. We compared socks earlier to see whose was cuter. I’m going with hers this time. But only cause they’re foreign.
Up first is Joe Rozsa. His company is called KaLor and it’s named after his two daughters. How cute is that?
Joe starts off telling people to be consistent in their site architecture.
- Add/Remove Trailing Slashes?
- Enforce lower case URLs
- Redirect to https://
- Dub-dub-dub or not?
Add/Remove Trailing Slashes?
Is your site a directory-based site? If so, keep the slashes. If not, drop them. Inbound links are more likely to be without. Be sure to 301 redirect the alternate URL to the URL format that you choose for going forward. Use in your internal linking structure and sitemap.xml file. Your home page should always be slash-free – http://domain.com
It doesn’t matter. URLs are case sensitive in the eyes of the search engines. Because URLs are case sensitive you want to make sure they’re consistent when you’re doing link building. Same with internal linking, refer to your URLs properly.
Serving Secure Pages
If you are serving https pages, you do not want to return a 403 status code if someone comes in via http. Paypal uses 301 redirects to take visitors from http://paypal.com to https://www.paypal.com
WWW is a subdomain and some people feel it’s outdated. Should you use it? It’s your call, but handle the alternative. Choose one method and redirect the other form to the selected method. He shares a code snippet for how to do that but…I’m only one woman, people. I can’t liveblog code.
Common Ingredients for Best URLs
- Short and sweet and easy to read
- Touch of static would be awesome
- Don’t/have/a/hundred/folder/deep/navigation/path [I see what you did there]
- No sub (domain) it out either.
- A pinch of keywords and a-dash-of-separation
Joe says that eCommerce SEO can be ugly but calls Magento the best eCommerce platform for SEO out of the box. He calls out a number of ecommerce sites that have good URL structure. It’s basically all completely over my head because I’m not a real SEO. I just play one on the Internet. Ooo, he shows a product page on Papyrus and says it’s a great example. I mean, who doesn’t love Papyrus? You want to get your category/sub category/product in your URL if you can. That’s going to set you up really well. It’s great for SEO and users. Home Depot has bad URLs because their pages are duplicated depending on how you search. Also, because Home Depot is boring.
How do breadcrumbs impact your search results? He shows how StubHub does it. The search engines show the breadcrumbs in the SERPs instead of the URL. That’s helpful for users.
[Vanessa asks how many people know about the canonical tag. Lots of people raise their hand. She says if you don’t know how to use it, there’s lots of stuff you can read. I recommend this post from the Google Webmaster Central blog on specifying your canonical.]
Next up is John Carcutt. He took the session title literally so he’s going to talk about BIG URLs. As in, really long ones.
John says that Google has the longest URL, but you should check out Darren Slatten’s post called What Is The Maximum Character Length For A URL That Google Will Index?
How many characters can you use in a URL?
HTTP Protocol: Does not set a limit on the number of characters in a URL
- Internet Explorer – 2083 characters supported
- FireFox – successfully tested up to 100,000 characters
- Safari -successfully tested up to 90,000 characters
- Opera – successfully tested up to 190,000 characters.
HTML 3 said URLs cannot have more than 1024 characters, but HTML 4 specifications removed it. The Sitemap protocol says it has to be 2,048 characters. To him that’s an indication of Google trying to establish a limit.
So what’s the answer? John says that 2047 characters is all that can be successfully used in all browsers, servers and protocols. Realistically, URLs should be as short as possible. If your URLs are over 200 characters, start looking for solutions to fix this.
What Makes a URL Long?
SEOs: We make URLs long by seeing how many keywords can fit in a single URL. Does every category level need to be included? You should be able to look at a URL and get a clear idea of what you’re going to find on that page. But don’t take it too far.
Technology: Dynamic pages need to pull data from the server based on a set of specific instructions. As the amount of information that needs to be passed grows, so does the length of your URLs.
Find alternative methods to pass data when possible. Parameters in the URL are not the only way to get data to a server.
- Cookies can provide a wide array of data to help generate dynamic pages.
- Global.ASA or Global.ASAX files
The deeper the the drill, the longer the URL. The number of drill-down options is also the number of parameters you can add to the end of the URL with this type of navigation.
The path of the users builds the URL: Users will use a variety of paths and end up at the same selection of parameters. This can create multiple paths of navigation to the same content.
Search engines follow every link and will index every version of the URLs that they find. Many of the URLs are just an alternate path to the same content. The “nofollow fix” that some recommend can work but takes continual maintenance if the navigation is updated regularly and can be complicated. Also, who knows how well nofollow is really working anymore. Don’t use band aids.
Get rid of duplicate categories. Add only the necessary parameters.
Dealing with the multiple paths of navigation issue: Consistency of parameter placement in the URL will solve the potential duplicate content issue associated with user’s pathing behavior. This applies to ModRewrite stuff, too,
Large Scale Content Sites
When a site published 60-70 or more pages of content every day there are unique challenges to making sure that content is indexed due to the “flow rate” of content through the site. Internal paths to content may change based on content being pushed to deeper pages. Older content may be archived or placed in different sections of a site. Categorization of content may also complicate the issue if the system allows content to be filed under multiple categories. Many of these sites have multiple departments publishing content and use internal tracking parameters to monitor cross channel support.
[Vanessa Fox chimes in again and reminds people that you want your parameters to appear in the same order. This helps to train the search engines.]
Next up is Stoney.
When you have duplicate URLs it creates too many pages in the index for the engines to come and start spidering through your site. That isn’t bad. But once they start indexing duplicate pages, they start getting a little wary about your site and want to move on to something else. He shows how some pages can get indexed twice, while other pages won’t get indexed at all. Duplicate content slows down the spider on your site and may mean some of your important pages get left out. It also splits the link juice and makes your pages inaccessible.
What are the causes of duplicate content & problem URLs?
- Redundant URLs: It’s the same page access through 3-5 different URLs – example.com, www.example.com, www.example.com/.
- Secure Pages: http vs https. He sees this quite a bit. You can go to http OR https and you’ll pull up the exact same content.
- Session IDs: You end up creating a duplicate page farm through session IDs because every page gets multiple unique IDs.
- Link Consistently: However you decide to link to your site, keep it consistent in all of your internal links. If you’re going to use the slash, use the slash. He encourages you to use Xenu to check your links.
- Secure Shopping Path: When you’re in your checkout system, any links that you provide going back to your products or links should be hard coded links using http not https so you don’t have any duplicate content. Block the search engines from getting into the shopping cart. They don’t need to be there.
- Canonical URLs: Link only to the canonical page.
- Redirect Old Links: If a page is gone, make sure you redirect the page (and all link value) to the new page. This is important.
Next up is Brian. He says he’s going to get into the more technical side of things and I basically groan loud enough so he can hear me. Technical is hard to liveblog. It would be like liveblogging something in Chinese. I understand both equally well.
Despite my whining, Brian starts.
- Redirecting is when the Web server tells your browser to go somewhere else.
- Rewriting tells the server you’ll give them a different piece of content
The traditional way of handling bad URLs is to fix all of the issues one at a time. But that’s a long process and can create a lot of SEO headaches if you release them intermittently. When your URLs are indexed nicely, you get a lot of benefit in the search engines. Things are just displayed better. He shows the results page for [smx] and all the added benefits they get for doing things right.
URL abstraction refers to the separation of URLs and backend code. What you want your URL to be doesn’t have to be your backend technology. It becomes really easy to optimize your URLs. You can manage your URLs independently of the application deployed. URLs can be simplified to support more streamlined analytics reporting.
URLs are based on content needs
- Extreme: Any URL on any domain on that server can be mapped to any resource by updating a field in the CMS
- Common: Certain elements like the headline of an article become components of the URL but other URL elements are forced to conventions.
URLs do not change with backend architecture changes. Calls to links in the code “look up” the URL before publishing. He shows a few examples of URL abstraction. I think they’re written in hieroglyphics. Or maybe the slide is upside down. No, it’s just me.
URL Rewrite Filter is a J2EE Web Filter that provides:
- Inbound URL rewriting
- Outbound URL rewriting
- URL redirection
- Method Invocation
- XML backend for rule storage
Potential Complications: Architecture of production environment could impact the complexity of distributing URL rewrite Filter’s configuration fields. Hyperlinking might need to be adjusted once in order to support the outbound rewriting features.
- Create content managed URLs
- Complete 1-1 URL scheme with redirects for alternative versions
- Make decision on WWW subdomain
- Make non-form pages http instead of https
- Manage redirects and rewrites carefully
- Manage secondary domains
- Use Post-Redirect-Get pattern for cross-site-request-forgery prevention and do not allow application requests via get
And we’re done! Did that make sense? I hope so…
About the Author
Lisa Barone co-founded Outspoken Media in 2009 and served as Chief Branding Officer until April 2012.