- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
When I started in web development I just used the autoincrement integer as an identifier and returned this in requests. I did not care about security or large datasets with distributed systems back then. Nowadays I learned there are quite a lot of different identifiers types a domain object/resource can have all with their own difficulties, advantages and disadvantages. So let's dive in the wonderful world of identifier types.
Contents:
Auto-increment integers
Auto-increment integers are the simplest form of id's. You store the object in the database with an autoincrement integer and you will get back a generated id. So the first item will be id 1, the next one 2, etc. Of course it means the storage has to be centralized since not two servers should ever return the same id. Another downside is that you reveal how many records there are and it's easy to predict the next generated id.
Example url: /my-resource/345
Advantages
- fast insert and searching, natively supported in database.
Disadvantages
- You leak internal information, like amount of records, etc.
- It's also easy to guess possible id's.
- You have to store it in the database before you have an identfier.
- It's harder to make it work in distributed systems, even though you can workaround this with database sharding techniques.
In my Apie library auto increment integers can be created with AutoincrementInteger:
use Apie\Core\Identifiers\AutoincrementInteger;
$id = new AutoincrementInteger();
$id->toNative(); //returns null
$object = $datalayer->find($id);
$object->getId()->toNative(); //returns the database id.
Hashed id's
Hashed id's are a variation of autoincrement integers where you try to hide internal information. Instead of working with an integer in the url, you encrypt the integer with a secret key with an encryption algorithm that can also decrypt the data again (either with a private key or with a synchronous key).
It's important that this encryption/decryption only happens on the server and that there is no way to find the integer in some other way.
It's important that this encryption/decryption only happens on the server and that there is no way to find the integer in some other way.
For example to insert a record:
If the encryption/decryption leaks out you can figure out the internal integer very easily.
A good example where the hashing was figured out is the video game Super Mario Maker where people can upload their own Super Mario levels. Because the algorithm was figured out they could already generate every possible level code that will be uploaded in the future and even make a level-search-website.
Example url: /my-resource/adfbadad
Advantages:
- Almost as fast as integers as it only adds an encrypt/decrypt step.
- Could feel as a wrong sense of security. Once the encryption/decryption is figured out, it has no advantages over autoincrement integers.
- Could be challenge to ensure the integer is not communicated somehow.
- If the key is lost or changed, all identifiers ever communicated will be useless
Hashed id's are not implemented yet in Apie, but they are in the planning and will work like this:
use Apie\Core\Identifiers\HashedId;
class EmployeeId extends HashedId {
protected function getEncryptionKey(): string {
return $_ENV['employee_id_key'] ?? 'dummy';
}
}
Uuid
Uuid's rely on generating a large code that is very unlikely that an other server will generate the same code. It uses a random 128-bit number as an identifier. Because this number is so big it's often shown in a specific 'text-friendly' format. There are multiple variations to display an UUID:
- very common display: 550e8400-e29b-41d4-a716-446655440000
- as number: 113059749145936325402354257176981405696
- GUID format found in Windows registers: {550e8400-e29b-41d4-a716-446655440000}
- binary: 01010101000011101000010000000000111000101001101101000001110101001010011100010110010001000110011001010101010001000000000000000000
- some legacy format: 34dc23469000.0d.00.00.7c.5f.00.00.00
A very common mistake is storing uuids as a string and not as a 128-bit number in the database. Storing it as a 128-bit number will only cost 16 bytes, while storing it as a readable string cost the double amount. Storing it as a 128-bit number does come with the problem that writing a raw SQL query when searching for one record could be annoying. Sadly, a very common framework like Laravel does store them as strings and I don't see that they will fix this soon. Symfony does it better in later versions, but sometimes still have some code forgetting to convert the strings in a binary code.
Another common mistake is to create a really unique, random uuid, which is called Uuid version 4. SQL databases can not store this efficiently in the database as it uses a B+ tree to sort all values which assumes that new data that is being added is already sorted.
Because how an uuid is used differently in different situations, Uuid's are grouped in 8 versions which determine how an UUID is created and in what use cases you can use it. Uuid version 4 is truly random and has the lowest chance that 2 systems generate the same uuid at the same time. Often Uuid version 1 or 6 are used as they contain a timestamp so they will be sorted better within a database. The downside is that you leak the timestamp when an object was created.
An example url would be /my-resource/550e8400-e29b-41d4-a716-446655440000
Advantages:
- Can be generated before you store it.
- Unless Uuid version 1 or 6 is being used, no hidden information is being leaked
Disadvantages:
- Either you leak internal information or you store records inefficiently for searching and inserting in the database.
- Since the index is large it could result in very large indexing tables or inefficient searches in the database.
Uuid's version 1 to 6 are supported in Apie
use Apie\Core\Identifiers\Uuid;
use Apie\Core\Identifiers\UuidV1;
use Apie\Core\Identifiers\UuidV2;
use Apie\Core\Identifiers\UuidV3;
use Apie\Core\Identifiers\UuidV4;
use Apie\Core\Identifiers\UuidV5;
use Apie\Core\Identifiers\UuidV6;
$faker = \Faker\Factory::create();
$anyUuid = Uuid::createRandom();
$uuid1 = UuidV1::createRandom();
$uuid2 = UuidV2::createRandom($aker);
$uuid3 = UuidV3::createRandom($faker);
$uuid4 = UuidV4::createRandom();
$uuid5 = UuidV5::createRandom($faker);
$uuid6 = UuidV6::createRandom();
Ulids:
Ulids are a variation that get more and more traction because it is dealing better with some of the shortcomings of uuid's.
The full specs of ulids can be found on github.
ULIDS are also 128 bits characters, but they are grouped better. First of all the first 48 bits are used for the timestamp and the rest will be random data. If 2 id's are created in the same milliseconds there's are still 280 possible ulids in that milliseconds. The order is not guaranteed if they were created on the same millisecond. It also reveals the timestamp the id is being generated.
Just as with UUID's, ULID's has several formats to display it as text. The text is shorter than UUID's because it uses more than 16 characters for display. The most common is base32 encoded, for example: 01AN4Z07BY79KA1307SR9X4MV3.
- Other format for example is base58: 1BKocMc5BnrVcuq2ti4Eqm
- Or according rfc4122: 0171069d-593d-97d3-8b3e-23d06de5b308
Advantages:
- More suited for storing and searching on it in the database compared to UUID while still keeping the same UUID benefits.
- Smaller number of characters needed to display as UUID.
Disadvantages:
- Always leaks the timestamp
- Still best performance if storing the id as a binary number and not in any text format as it did with uuid's.
In Apie we can easily use the Ulid class to get a ulid in rfc4122 format:
use Apie\Core\Identifiers\Ulid;
$id = Ulid::createRandom();
Slugs:
Slugs are small unique texts often created from the object itself. For example a books database containing a book like 'De ontdekking van de hemel' could have a slug like 'de-ontdekking-van-de-hemel' and use this in the url like /my-resource/de-ontdekking-van-de-hemel
The biggest benefit is that the url is very 'user-friendly' as it often portrays the product. The downside is that the uniqueness is not garanteed. So we would need to find a non-used slug before storing it by appending things like '-2', '-3' or allow the user to add a slug manually.
Advantages:
- Results in user-friendly URL's or can be entered manually by the user.
Disadvantages:
- Hard to keep a slug unique, not every user wants to enter one manually.
In Apie I have different slug classes for writing classes in a specific case: pascal case, camel case, kebab case. They can also be converted into each other with toXxx methods:
use Apie\Core\Identifiers\CamelCaseSlug;
use Apie\Core\Identifiers\KebabCaseSlug;
use Apie\Core\Identifiers\PascalCaseSlug;
use Apie\Core\Identifiers\SnakeCaseSlug;
$id = new CamelCaseSlug('exampleOfSlug');
$id = new KebabCaseSlug('example-of-slug');
$id = new PascalCaseSlug('ExampleOfSlug');
$id = new SnakeCaseSlug('example_of_slug');
$id = $id->toPascalCaseSlug();
JWT's:
You might think: how can you use a JWT as an identifier? In most cases you will not need a JWT as an identifier.
In a previous project I had to deal with the PowerDNS API and this API has a very annoying way of updating DNS records where you have have to provide the previous DNS record and tell PowerDNS to change into an other DNS record. There was no id column. Since our API was just a proxy to it that works with ID's we used a JWT as ID to deal with this situation. We could decrypt and verify the JWT as a valid DNS record.
Since the JWT contains the data you can make this stateless, for example you could use it to make your own game RPG without a database handling everybody's status:
- You could make a stateless API without a database this way as you can always assume that a verified JWT has been created in the past.
- Could be a simple solution if you want a REST API and deal with calling a different API without identifiers.
- If you lose the private key you can no longer use any of the previous identifiers communicated.
- The id can be very large if the dataset is large as you communicate the current dataset indirectly with the JWT.
- JWT could expire or you ignore the expire date in which case a working id will always work.
- ID's change after data changes which make no sense for PUT or PATCH requests in a REST API.
Snowflake Id's:
Snowflake ID's were basically not being used until big companies like Twitter and Stripe used it for their API's. They are created as a string that contains information for the server to know how to find it, but are also written in such a way that they are also pleasing for a user to see or remember. There is no standard format but often a snowflake id is built something like this:
'<standard prefix><server_id><autoincrement_integer><random code>'
'<standard prefix><server_id><autoincrement_integer><random code>'
Twitter and Stripe are such big companies that they have more than one server creating new records. Every server gets its own server id when being started so there it no possibility there will be a collission. Internally every server will just autoincrement the amount of objects it creates. For example it could store this in a local sqlite database for example or the file system with dummy files. The random code part is in the end so it will not give trouble with indexing, but will fix the predictability of the id's.
The standard prefix is a way to be able to distinguish the type of id's. For example any id starting with 'empl' is an employee id. Of cource the side effect is that this will create a duplicate identifier if proper URI's are being created,
for example a proper url would be something like /employees/empladaf45d0001afwqpfmw13411
Because all employee id's start with empl we will always get /employees/empl in the url.
Advantages:
- Very fast id generation and searching in distributed servers
- Good enough in indexing id's
- Not possible to mix up the wrong id's in the wrong situation because of the static prefix.
Disadvantages:
- There is no standard how a snowflake id is being constructed
- Could leak information, for example the autoincrement integer part.
- Id's are never short in the url and in the database.
Conclusion
There is no 'best' solution for which id to use. We can see there are basically different types of id's:
- id's that are stored in the database most efficiently (hashed id, autoincrement integer)
- id's that are generated efficiently in a distributed system (uuid, ulid)
- id's that are generated from the data/server configuration in a distributed system (jwt, slugs, snowflake id)
- id's that are pleasant for end users to remember/see in a url (snowflake id, slugs)
Comments
Post a Comment