Deep Linking

Preventing Deep Linking
Stephen P. Morse, San Francisco

Background

Deep linking refers to one website going directly (linking) to an internal page at another site rather than going to that site’s home page. For example, the site being linked to might be a catalog store with separate pages describing each item it is selling. The linking site might be a competitor selling the same products. Rather than having it’s own descriptions of the merchandise, it deep-links to the descriptions that the other site already put on the web.

Another example of deep linking occurs when a site has a search form for its users to fill out along with a corresponding search engine and associated database that the search form submits (links) to. A foreign site could develop its own search form and brand it with its site’s logo, but have that search form submit directly to the original site’s search engine. If the results page generated by the original site’s search engine doesn’t clearly brand itself as coming from the original site, users at the foreign site would think that the foreign site is responsible for the entire search.

As an example, the search form that is used by the google website is at http://www.google.com/index.html. That form is clearly branded as a google form by having the google logo on it. When you fill out that form and submit it, a request is made to the google search engine that is at http://www.google.com/search. The google search engine then returns a page containing the search results. If the google logo did not appear on the search results, a foreign site could put up its own search form at http://www.giggle.com/index.html and have that form use google’s search engine. The user of the giggle site would think that giggle performed the entire search for him.

Understandably the original site might be annoyed because it has spent time and money developing the catalog pages or the search engine, and the foreign site is piggy-backing on it and taking all the credit. So the original site would want to take some measures to prevent this deep linking.

One thing the original site could do is threaten legal action. But that involves the expense of an attorney, has the effect of alienating the foreign site rather than getting that site to voluntarily cooperate, and is on questionable legal grounds since deep linking has never been ruled illegal in the US.

A more sensible approach is to institute technical measures of preventing foreign sites from deep linking to pages that you don’t want them to link to. That’s less expensive and more effective than legal action, and is very simple to do. This paper describes how to accomplish it. The concept is a bit technical, but all necessary terms are clearly defined so this can be read and understood by a layperson. The actual implementation details (computer code) are in a separate section that you can skip if you don’t understand it.

Cookies versus Referrer

The whole technical issue boils down to a website being able to determine if a site that is linking to it is friend or foe. One way of doing that are by the use of the so-called referrer field and another is by the use of cookies.

When a user visits a webpage that contains a link to another page, and the user clicks on that link, the user’s browser sends a page request over the Internet for that other page. The page request contains a referrer field identifying the original page that contained the link. So in theory the linked-to website could test the referrer field and use that to determine if it wants to deliver the requested page. However not all browsers transmit the referrer field when making a page request, so any test based on the referrer field could have the effect of blocking valid linking – in effect throwing out the baby with the bath water.

A more reliable test is to use cookies. Cookies are small pieces of information that a website plants on the user’s computer. That information is sent back to the website when the user’s browser requests future pages from that same website. However the browser never sends the cookies back to any sites other than the site that planted it. So if a website’s search form plants a cookie when the user clicks on the submit button to start a search, and that website’s search engine tests for the presence of the cookie before doing the search, the site can know for certain that the link is coming from within the site itself and not from a foreign site.

The downside of using cookies is that, for privacy reasons, some users might have cookies disabled in their browser. In that case such users would look like they are coming in from a foreign site and their page request would be rejected. However cookies are a pervasive feature of the world wide web which many websites rely on, so any user having cookies disabled is going to have problems at many sites. Therefore it’s perfectly reasonable to reject valid users that have disabled cookies. But before doing so, it’s a good idea to test that the user had disabled cookies (that’s easy to do) and give him a message saying that your site relies on cookies. That way he won’t think that your site is broken when you reject his valid page request.

Cookies and Query-String

Here in a nutshell is how the cookie solution works. Let’s look at the search form/search engine example. As already mentioned, your search form would plant a cookie and your search engine would test for the presence of that cookie before it delivers its results. Since only your site can plant a cookie that will be sent on to other page requests from your site, there is no way that a foreign site can plant the necessary cookie.

However you must make sure to clear the cookie after doing the search. Otherwise, if the user has used your search form once and had the cookie planted, then any future links from the foreign site’s search form to your site’s search engine would result in the cookie being sent with the page request. Your search engine would then be fooled into thinking that the request came from within your own site.

Clearing cookies are not difficult to do, but a safer thing is to make the cookies good for one usage only. The way to do that is to pass information about the cookie in what is known as the query string. The query string is that part of the URL following a question mark. It is not part of the address of the page being requested, but rather is information that is being passed to that page. For instance, you can pass a name of JACK to a webpage by appending ?name=JACK to the URL of the webpage.

Let’s illustrate this with an example. Suppose that we have a search form at http://www.ourserver.com/ourform.html and the search engine is at http://www.ourserver.com/ourengine.cgi. When the user presses the submit button on our search form, a cookie is planted. A cookie consists of a name part and a value part. In this case the name of the cookie is “time” and the value of the cookie is the current time measured in seconds since some starting date (usually January 1, 1970). Let’s assume that the current time is 100. Our search form then passes that value of 100 to our search engine by linking to the URL http://www.ourserver.com/ourengine.cgi?time=100. Our search engine compares the value in the cookie with the value in the query string, and if they match it knows that the request came from within the ourserver.com website. And that cookie with a time of 100 can be used only at time 100 (since that is the value that is passed in the query string) so any future links to our search engine by a foreign site will not succeed.

How to Block

Everything so far dealt with when to block – that is how to determine if the request was coming from our own search form or from a search form on a foreign site. But we never mentioned what to do when we determine that we should block.

We could just put up a message saying “Search Request Denied” or maybe something even nastier. But a much more effective procedure would be to redirect the user of the foreign search form to our own search form. So when the user presses the submit button on the foreign search form, rather than receiving back the results of the search he will suddenly find himself at our search form from which he can submit the search and obtain the results. So instead of alienating the user of the foreign website, we are converting him into a bona fide user of our website.

Cooperating Websites

The above assumed that the search form and search engine are in the same domain, so the search form can set a cookie that the search engine can see. But what if two organizations have a partnership and they want to put the search form on one organization's domain and the search engine on the others. In this case there needs to be some additional handshaking involved so that the webpages on each domain can verify that it is getting information from the other one.

Let’s assume that the searchform is at http://ourpartner.com/hisform.html and the search engine is at http://ourself.com/ourengine.cgi. Here’s how the handshaking works.

1. The search form at our partner’s website sets the time cookie and passes the time parameter in the query string as normal. But it doesn’t link directly to our search engine. Instead it links to an intermediate page that we established at http://ourself.com/verify.cgi.

2. Our verify page is going to verify that the request really came from our partner. Since it can’t see our partner’s cookie, it will have to ask our partner if he just made a request. It does this by linking to an intermediate page that our partner has set up at http://ourpartner.com/verify.cgi. In order for our partner to know that this request is really coming from our website, our verify page will put the time parameter into the query string when it links to our partner’s verify page. It also sets a cookie with the time value. We’ll see why it does that shortly.

3. Our partner’s verify page now receives our request and it can compare the time value in the query string with the time value in the cookie that it had already set in step 1. This allows it to verify that this request really came from us and not from someone trying to gain access to our search engine. If the time in the cookie and the query string match, our partner’s verify program will now link to the search engine on our website. And of course it passes the time value in the query string.

4. Our search engine receives the request but it needs to verify that this request really came from our partner. Now it can test the time value in the query string with the time value in the cookie that our verify page set in step 2. If they match, then our search engine knows that this request had to come from our partner since nobody else would know the correct time value to put into the query string. And in that case our search engine will go ahead and perform the search.

Well that was a little circuitous, but it’s a clever way of verifying when you can’t see each other’s cookies directly.

Implementation

This section is being written for web developers and a basic knowledge of the programming languages involved is assumed. However, a knowledge of the coding is not required if you already have a search form and search engine and simply want to embed blocking into it. This would be the case if you generated the form and engine automatically by using my search-application generator, which is at http://stevemorse.org/create

The techniques for blocking presented in this paper are quite easy to implement. A search form is usually written in html plus javascript code. We would want the search form to plant the cookie and create the necessary query string. Here is the html/javascript code that accomplishes that.

We need to add javascript code that gets executed after the page is loaded. So it should appear inside a javascript function that is called by the onload handler of the <body> tag. Assuming our <body> tag looked as follows

<body onload="Init();">

we would add the following code to the Init() function:

function Init() {
     ...
     var time = new Date().getTime(); // get current time
     document.cookie = "time=" + time; // plant the cookie
     document.searchform.time.value = time; // insert time in a hidden field
     ...
}

Somewhere between our <form> tag and our </form> tag we would add a hidden field as follows. The nature of submitting a form automatically puts the values of all fields (hidden or not) into a query string and appends that query string at the end of the URL that it submits to.

<form name="searchform">
    ...
    <input type="hidden" name="time">
    ...
</form>

The search engine can be written in a variety of languages. Two popular ones are php and perl. Here are the php and perl codes that should be added to the appropriate search engine.

First the php code

if ($_COOKIE['time'] == "" || $_COOKIE['time'] != $_GET['time']) {
exit;
}

Note that we explicitly test for the lack of a cookie. We do that because a foreign site’s search form could simply not transmit a query and if our site never set a cookie it would look as though the cookie and query string matched (they are both blank).

And now the perl code

# Check for time cookie
    $cookieTime = "0";
    $rcvd_cookies = $ENV{"HTTP_COOKIE"};
    @cookies = split /;/, $rcvd_cookies;
    foreach $cookie (@cookies) {
      ($name, $value) = split(/=/, $cookie);
      if ($name eq "time") {
        $cookieTime = $value;
      }
    }

# Check for time query string parameter

    $queryTime = "0";
    @pairs = split(/&/,$ENV{'QUERY_STRING'});
    foreach $pair (@pairs) {
      ($name, $value) = split(/=/, $pair);
      if ($name eq "time") {
        $queryTime = $value;
      }
    }

# Exit if the time cookie doesn't match the time query parameter

    if ($cookieTime eq "" || $cookieTime ne $queryTime) {
      exit;
    }

The search engine can even be written in javascript and executed on the browser instead of the server. In that case the code to add is

// Check for time cookie
    var cookieTime = "0";
    var cookies = String(document.cookie).split(";");
    for (var index = 0; index < cookies.length; index++) {
      var parts = cookies[index].split("=");
      if (parts[0] == "time") {
        cookieTime = parts[1];
      }
    }

// Check for time query string parameter

    var queryTime = "0";
    var query = String(document.location).split("?")[1];
    var arguments = query.split("&");
    for (var index = 0; index < arguments.length; index++) {
      var parts = arguments[index].split("=");
      if (parts[0] == "time") {
        queryTime = parts[1];
      }
    }

// Exit if the time cookie doesn't match the time query parameter

    if (cookieTime == "0" || cookieTime != queryTime) {
      return;
    }

In the above code, the search engine simply exits and does nothing when the time cookie and time query parameter don't match. As mentioned earlier, a better solution is to redirect the user to our real search form. The simplest way to redirect the user is to have the search engine return the following webpage to him instead of giving him the results of his search.

<html>
    <head>
      <script>
        document.location.replace(“http://ourserver.com/ourform.html”);
      </script>
    </head>
</html>

Implementation continued -- domains and paths

The above code for cookies assumed that the search form and the search engine are at the same place on our server. That is, if our search form is at http://www.ourserver.com/ourfolder/ourform.html, then our search engine must be at http://www.ourserver.com/ourfolder/ourengine.cgi. But that is often not the case. Although both may be on the same site, they may be at different places on that site.

In order to clarify this, we need to introduce some terms. Rather than giving rigorous definitions (which might be hard to comprehend), I’ll give informal ones.

A host is the first part of the URL and usually corresponds to a particular server (machine) on the Internet. In our example, the host is www.ourserver.com. A domain is a group to which the host belongs. In our example www.ourserver.com is in the .ourserver.com domain. This is significant because a domain usually belongs to a single company, so all host machines in the same domain are owned by the same company. This loosely corresponds to what we think of as a website. A path is a directory location (sometimes called a folder) on a host. And a web page (search form, search engine, image, etc.) is in a directory. So the search form and search engine in our example are both in the /ourfolder directory on the www.ourserver.com host.

Unless specified otherwise, cookies set by a page on a particular host are sent back to other pages on that same host only. To get around that limitation, we can explicitly state that we want the cookies to go to any page in the same domain. But in no case can we have the cookie sent to a page outside of the domain. Similarly, cookies set by pages in one directory are sent back to pages only in that same directory unless specified otherwise.

Since a domain name is usually associated with a single company, it seems strange that cookies can be restricted a specific path. The reason is that although a single company may own the domain and even the host, that company might be in the business of providing server space to different individuals or companies. For example, the network provider pacbell has the host home.pacbell.net and provides each of its customers with a directory on that host into which the customer places his webpages. So if I had a pacbell account with the user name morse, my webpages would all go in http://home.pacbell.net/morse. I certainly wouldn’t want my cookies going to my competitors webpages at http://home.pacbell.net/antimorse and vice versa.

The time to specify the domain and path for a cookie is when the cookie is set. This occurs in the search form where we had the line

document.cookie = "time=" + time;

Suppose for example that our search form is at http://www.ourserver.com/forms/ourform.html and our search engine is at http://www.ourserver.com/cgi-bin/ourengine.cgi. In other words, our search form is in the /forms directory and our search engine in the /cgi-bin directory. In this case the common directory location between the two of them is simply /. So the above statement would be written as

document.cookie = "time=" + time + ";path=/";

As another example, assume that our search form is at http://www.ourserver.com/ourform.html and our search engine is at http://engines.ourserver.com/ourengine.cgi. Now the two are on different hosts but both are in the .ourserver.com domain. So the above statement would be written as

document.cookie = "time=" + time + ";domain=.ourserver.com";

And of course we could specify both a path and a domain when setting a cookie as follows

document.cookie = "time=" + time + ";path=/;domain=.ourserver.com";

Conclusion

Blocking another site from deep linking into our website is not difficult to do. We don’t need to rely on legislation to make the practice illegal since we can take our own proactive measures to prevent it. Just as we would lock the door to our house if we don’t want intruders to come in, it stands to reason that we should block access to our website if we don’t want to be deep-linked.