HowTo's, Technical
 

How to convert a URL string to unique integer in PHP?

PHP provides the popular md5() hash function out of the box, which returns 32 hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what happens when you need to generate an integer fingerprint instead?

Background:

Currently, when a web admin wants to add more than one instance of our rating widget to his website, he has to assign a unique id for every rating. It’s a very simple task, but we receive at least three emails a week asking Why do all the ratings in the page show the same values? or Why when I rate one of the ratings, it affects all the others? As a customer driven company, we strongly follow the KISS principle. Thus, we have decided to provide an easier way of embedding our rating widgets. We’ve decided that each rating can be uniquely identified by the combination of site’s ID, the URL of the hosting page and its position / order among the other ratings in the html.

Challenge:

We use a LAMP stack (Linux, Apache, MySql, PHP) for the Rating-Widget backend. Our database ratings id column is BIGINT, which is obviously not a string. So we had to find a solution to convert a SiteID +Page URL + order into 64 bit integer fingerprint. Also, we had to canonize the URL to make sure the query string parameters order didn’t generate different fingerprints which would lead to different ratings.

Solution:

We had a piece of code which was originally written for generating security tokens which were already ordering the query string parameters. So the canonization method was a simple task. A quick manipulations lead to the following code:

function canonize($url)
{
    $url = parse_url(strtolower($url));

    $canonic = $url['host'] . $url['path'];

    if (isset($url['query']))
    {
        parse_str($url['query'], $queryString);
        $canonic .= '?' . canonizeQueryString($queryString);
    }

    return $canonic;
}

function canonizeQueryString(array $params)
{
    if (!is_array($params) || 0 === count($params))
        return '';

    // Urlencode both keys and values.
    $keys = urlencode_rfc3986(array_keys($params));
    $values = urlencode_rfc3986(array_values($params));
    $params = array_combine($keys, $values);

    // Parameters are sorted by name, using lexicographical byte value ordering.
    // Ref: Spec: 9.1.1 (1)
    uksort($params, 'strcmp');

    $pairs = array();
    foreach ($params as $parameter => $value)
    {
        $lower_param = strtolower($parameter);

        if (is_array($value))
        {
            // If two or more parameters share the same name, they are sorted by their value
            // Ref: Spec: 9.1.1 (1)
            natsort($value);
            foreach ($value as $duplicate_value)
                $pairs[] = $lower_param . '=' . $duplicate_value;
        }
        else
        {
            $pairs[] = $lower_param . '=' . $value;
        }
    }

    if (0 === count($pairs))
        return '';

    return implode('&', $pairs);
}

function urlencode_rfc3986($input)
{
    if (is_array($input))
        return array_map(array(&$this, 'urlencode_rfc3986'), $input);
    else if (is_scalar($input))
        return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));

    return '';
}

Creating the integer fingerprint was the more challenging part of this job. After Googling “convert string to 64 bit integer”, we found this great post from Code Project that shows how to do something very similar in C#. The post showed us how to use a hash method like md5, convert its substrings from hex to decimal and run XOR bitwise manipulations. Simple.

We tried to implement it in PHP and what was supposed to be a very simple code translation, became a real hassle. We found out it would not work because the 64 bit integer is not PHP’s best friend. None of the functions bindec(), decbin(), base_convert() consistently supports 64 bit. To have 64 bit integer support in PHP, the operating system and PHP must both be compiled for 64 bit.

After further digging on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers support. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.

function get64BitHash($str)
{
    return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
}

Combining the URL canonization, the final solution looks like this:

function urlTo64BitHash($url)
{
    return get64BitHash(canonizeUrl($url));
}

Collision and Performance Test of get64BitHash
Platform: Intel i3, Windows 7 64 bit, PHP 5.3
10,000,000 Times generated get64BitHash
Elapsed Time: 460 millisecond for 100,000 generations
Collision: Not found

Summary

This method can be super useful when storing a large amount of URLs in your database. Instead of querying the URL column, you can add an extra indexed 64 bit column to store the URL hash, and this should drastically increase query performance.

We hope this information will help you in your projects. If you have any comments or any additional use-cases where this info can be applied, please feel free to comment below.

Posted by
 

2 Comments

 

  1. Gur Dotan

    Great post!

    Two side notes:
    1. I see you’re doing URL canonicalization. I wrote a package called Domo that does top-level domain name canonicalization, i.e “www1.motors.ebay.co.uk” => “ebay.co.uk”. It’s available for both Ruby and Javascript if you ever need it:
    https://github.com/gurdotan/domo-rb
    https://github.com/gurdotan/domo.js

    2. When I blogged about this, people pointed out my mistake – it’s “canonicalize”, not “canonize”. See http://en.wikipedia.org/wiki/Canonicalization

     
    Reply
  2. Roger

    I think people who have the issues you described with widgets like this are actually too lazy to read the installation tutorials. It´s pretty simple to understand that you need an unique identifier for each post for the widget to work the right way. You don´t have to be an expert to get that.

    I myself prefer to use the post title or permalink rather than the post id as the unique id. In fact, years ago, when i found out about rating-widget, the reason why i decided not to use it in my blog was exactly the fact that you could only set the post id as the id for the widget. That´s why I chose the no longer available js-kit-rating at that time.

     
    Reply