How to scrap website metadata using LinkThumbnailer

LinkThumbnailer has come a long way since its first release in 2012. The gem lets you scrap website metadata like its title, description, images and even embedded videos. However, the value of LinkThumbnailer's gem is its ability to return you the best description representing what the website is about. Let's see how this works.


Today's websites have a lot of valuable contents. There are now recent standards to help developers to organize their data by enclosing them in logical and meaningful containers. For example, you would expect a descriptive content of your website to be between <p> html tags. Perhaps with a .description class? Unfortunately those standards are implemented at the good will of the developer. There is nothing stoping you from creating a working website using only table tags.

Writing automated scraping scripts that works for all possible websites is a difficult task. You have to pay attention to the semantics of how website are build and make some tradeoffs.

LinkThumbnailer's goal is to return a website metadata information such as its title, description and images. Images are sorted between each other to present you what the gem think is the more relevant ones. Same goes for the description. The gem embeds a custom algorithm that helps detect the most relevant description of the website.

Let's see how the algorithm works to find the best matching description of a given website.

Description Length

What is the ideal description length? Well after some experiments, it seems that a good description length is between 100 to 120 characters. To illustrate that visually, let's look at some lorem ipsum text of various length:

Here is a text of length 55:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Here is a text of length 85:

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Here is a text of length 117:

Pellentesque lobortis, lacus vitae tristique tincidunt, velit sem posuere dolor, luctus iaculis sapien ex vitae orci.

Here is a text of length 143:

Fusce lobortis lacinia magna at placerat. Pellentesque non augue ac purus volutpat mattis sit amet at tortor. In hac habitasse platea dictumst.

Visually, the texts with a length of 86 and 117 are more likely to be good candidates for a descriptive enough website description.

Starting from this idea, the gem will ignore any strings whose length is less than 50 characters. Now for all the other strings, we need a strategy to rate them between each other.

The tricky part comes when you think about it. Let's imagine we have a website with those 2 descriptions:

Pellentesque lobortis, lacus vitae tristique tincidunt, velit sem posuere dolor, luctus iaculis sapien ex vitae orci.

Fusce lobortis lacinia magna at placerat. Pellentesque non augue ac purus volutpat mattis sit amet at tortor. In hac habitasse platea dictumst.

The first one has 117 characters and the second one has 143. Now the first one seems more relevant relative to the length of the other one. What this means is that descriptions with a length closer to 100/120 characters are more likely to be meaningful descriptions.

Back to Math 101, there is a great resource we can use to help us in this situation. It's a reversed Gaussian function. Let's look at its shape:

The x axis represent the number of words and the y axis represent a score. The lower the better. Now all we have to do is translate that graph from 0 to 120 on the x axis in order to be centered. Here is what are end up when zooming only on the part that is interesting:

As the number of words get closer to 120 characters, the score gets closer to 0. Meaning that an ideal description length is considered to be 120 characters here. But a description length of 119 characters would still be considered better than a description length of 140.

Let's have a quick look at the equation in ruby:

Math.sqrt(2.0 * Math::PI ** 2) * Math.exp(-(description_length - 120.0) ** 2 / 2.0 * 0.005 ** 2)  

Now that we have a score for each description length, the gem can compute the probability for the description to be considered good. This is done with the following code:

x = "some text".length  
y = Math.sqrt(2.0 * Math::PI ** 2) * Math.exp(-(x - 120.0) ** 2 / 2.0 * 0.005 ** 2)

p = y / 4.442882938158366  

The 4.442882938158366 value is just the computed score for x = 120 characters. Let's compute the score of our previous examples:

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Probability to be good is in percent 98.48%

Pellentesque lobortis, lacus vitae tristique tincidunt, velit sem posuere dolor, luctus iaculis sapien ex vitae orci.

Probability to be good is in percent 99.98%

Fusce lobortis lacinia magna at placerat. Pellentesque non augue ac purus volutpat mattis sit amet at tortor. In hac habitasse platea dictumst.

Probability to be good is in percent 99.34%

Here we can see that the description with a length of 85 is considered better than the one with a length of 143. But the winning one is the length 117. Just for the demo, let's look at a final one:

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Pellentesque lobortis, lacus vitae tristique tincidunt, velit sem posuere dolor, luctus iaculis sapien ex vitae orci. Maecenas orci leo, condimentum vel felis in, volutpat mollis tellus. Donec congue arcu eu porta tincidunt. Vestibulum ullamcorper gravida magna. Donec ac suscipit lorem, ac feugiat sapien. Morbi vulputate varius dui, eget fermentum sem pulvinar non. Praesent sit amet quam nec dolor tristique efficitur. Cras luctus odio nec tellus dapibus, vitae molestie augue bibendum. Mauris feugiat tellus eget nibh auctor rutrum. Nulla euismod ex et lacus dapibus mollis. Aenean malesuada, est in maximus condimentum, nulla augue tincidunt mauris, a viverra diam justo id velit. Suspendisse ut odio a dolor lobortis semper. Aliquam laoreet turpis ornare, fermentum erat ut, efficitur ligula.

This text has 884 characters. Its probability to be good in percent is 0.06%

This is another great characteristic with the Gaussian function. You can choose how steep you want it to be. Here the longer is the text, the worse the probability is going to be.

What we've learn

We learned how LinkThumbnailer length grader system works and is able to find the best description length possible. We saw that using graph techniques combined with probabilities, is a great way to score website description's length between each other. Of course, LinkThumbnailer has many other techniques built-in. It will, for example, compute the link density, the position of the text in the website and much more. I strongly recommend you give the gem a try if you ever need to scrap websites metadata informations for you business. Feel free to head over the demo to give it a try.

Show Comments