Need cloud computing? Get started now

Improving Popularity Rankings for Better Threat Intelligence, Part 1

Akamai Wave Blue

Written by

Noy Aizenberg and Congcong Xing

January 13, 2023

Noy Aizenberg

Written by

Noy Aizenberg

Noy Aizenberg is a Software Engineer at Akamai Technologies.

Congcong Xing

Written by

Congcong Xing

Congcong Xing is a Data Analyst at Akamai Technologies.

Threat intelligence is the fuel that powers the robust cybersecurity services that protect mobile devices and internet users everywhere.

Introduction

Threat intelligence is the fuel that powers the robust cybersecurity services that protect mobile devices and internet users everywhere. But today’s internet threats are sophisticated and adaptable, continuously changing their complexion to evade security defenses. This makes it difficult to deliver threat intelligence that covers the diverse threat landscape, is accurate enough to avoid false positives, and has the agility to deter fast-changing exploits. 

Security researchers need complex algorithms, infrastructure, and data to uncover and validate today’s exploits quickly. Algorithms can be created, and infrastructure can be built, but data can be hard to come by. Data availability is governed by technical, legal, commercial, and privacy considerations.  

Alexa Top 1 Million 

The Alexa Top 1 Million list of popular websites was widely considered the industry standard and used across a large swath of internet industries to help research and develop products and systems. The Alexa list had been available for more than 25 years in December 2021 when Amazon announced the impending cessation of the service. And in May 2022, the service was retired permanently. 

The importance of minimizing false positives

As it did for many companies, the announcement of the cessation of the Alexa list caught our attention. Akamai security researchers relied on the list to help identify popular domains among our users so we could make better decisions to remove false positives from blocklists. At the same time we recognized limitations. 

Popular domains among internet users in different parts of the world aren’t necessarily popular in worldwide lists. Data biased toward users in the United States won’t reflect popularity among Japanese users, for example. Also, domains for DNS servers won’t appear on popularity lists that primarily rely on data based on HTTP transactions. This led to initiatives to create internal popularity lists to minimize the probability of false positives due to regional preferences.

False positives can be extremely disruptive to the user experience, which is especially important for our ISP and MNO customers. Inadvertent blocking of legitimate web resources does not make for happy subscribers or productive workers.

Akamai: global visibility and massive scale

Akamai is one of the largest providers of content delivery, compute, and security services in the industry and provides services via its massive global network. As a result, Akamai has unique visibility into internet activity and can utilize anonymized DNS and other data in its analysis of popular domains. Additional anonymized traffic from service provider partners adds hundreds of billions more queries to fuel our analytical tools.

This large volume of data, along with other data from Akamai sources, equips us to make informed inferences about domain popularity. Our global presence also allows analyses to be carried out at a regional level to incorporate nuanced differences among countries and languages — and major differences between residential and business users — into the process for allowlisting domains.  

Improving threat list accuracy

Our threat list production process has guardrails to ensure popular domains do not appear on threat lists. A number of allowlisting checks are made to all threat list entries before they are added to our lists. These checks include traffic ranking for the domain, popularity ranking, static allowlists for high-profile domains, and so forth. We also apply unconditional allowlisting rules for domains relating to all of the major internet properties, such as Google, Facebook, and Twitter. 

Our teams constantly review and enhance source data for these allowlist checks, and although the Alexa list was an important source, we have also identified the benefits of other ranking systems. To improve the accuracy of our allowlisting processes even more, several years ago we began to develop our own ranking system. 

Malicious and legitimate signals 

It’s common research practice within the cybersecurity industry to use VirusTotal (acquired by Google) as part of the analysis to determine whether a website is malicious or benign. This practice is a part of every cybersecurity endeavor because verification and validation are essential to definitively determine whether the resources being blocked are in fact malicious.

In addition to malicious signals, researchers typically obtain legitimate signals as well for all artifacts they’re studying. VirusTotal gathers malicious signals from more than 70 security vendors and legitimate signals from a smaller set of sources that provide information on the popularity of domains. This is useful because, in general, it's assumed only a very small portion of extremely popular domains are malicious, so it’s safe to start with an assumption they’re legitimate. 

The legitimate signals appearing on VirusTotal are daily lists of the most popular domains as observed by domain popularity ranking products: Alexa, Majestic, Statvoo, and Umbrella (Figure 1).

Popularity Ranks Fig. 1: VirusTotal popularity ranking product results for google.com [Note: Quantcast has not published new data since April 1, 2020.]

The different ranking products use different methods for gathering data:

  • Alexa ranks websites based on how often a website’s domain is typed in by toolbar users.

  • Majestic ranks websites based on structural properties rather than popularity with actual visitors.

  • Umbrella includes any type of domain observed at their public DNS resolvers, including internal, non-web domains. 

  • Statvoo does not offer insights about the metrics they use to create their lists.

Tranco, another provider of popularity lists, mathematically aggregates Alexa, Majestic, and Umbrella, but it does not currently appear in VirusTotal.

Assessing popularity lists

Recent research has identified shortcomings of the popularity lists from Alexa, Umbrella, and Majestic, including a lack of reliability and the inclusion of entries that are malicious, invalid, internal to an organization (such as .local), or not actually popular. 

In a 2019 research article called Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research, the authors provided evidence that a phenomenon called the “weekend effect” strongly exists in the Alexa and Umbrella lists. The weekend effect phenomenon is expressed in changes within domain rankings between the workweek and the weekend, as shown in Figure 2.

Top 10 Domains Fig. 2: The weekend effect within the top 10 most popular domains in Alexa and Umbrella

The researchers analyzed five measures — list stability, domain extensions, invalid domains, website categories, and clusters — and reported interesting results.

  • Highly ranked domains in Alexa are more stable than in Umbrella, where changes within the top 10 domains occur on a regular basis. 

  • The weekend effect affects the geographical diversity of Alexa and Umbrella (for example, on weekends Alexa loses domains from European countries and gains in Russia and India)

  • The Alexa and Umbrella popularity lists are dominated by office traffic during the workweek, and leisure traffic during the weekend. 

  • Umbrella also contains domains used internally in corporate networks.

  • Alexa and Umbrella rank domains with equivalent traffic in alphabetical order.

In another research article from 2019, Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation, the researchers introduced Tranco and evaluated the problems in competitive products. Some of their findings included the following:

  • Since January 2018, the Alexa popularity list is based on data for one day, and half of the list changes every day.

  • Umbrella contains invalid domains and subdomains — only 49% of the entries are real.

  • Majestic contained 2,162 malicious domains.

  • Quantcast has been unavailable since April 1, 2020.

Introducing AkaRank

Research papers show today’s popularity lists don’t capture geographic differences very well. Global web properties are popular everywhere, but local languages, websites, and preferences will influence popularity locally. As a large global company, our security research teams have learned we can’t rely on popularity lists with geographical biases. 

We’ve been working to improve popularity rankings for many years to neutralize the biases of the other lists, and we are now formally introducing AkaRank, Akamai’s popularity list. In the next post, we’ll discuss how AkaRank compares with the alternatives covered in this blog. 

Summary

Threat intelligence is the fuel that powers cybersecurity. Accurate threat intelligence, that avoids blocking legitimate web resources, is critical to a positive and productive user experience. Security researchers need data to uncover and validate today’s exploits quickly. Popularity rank is an important factor in assessing the legitimacy of domains. 

The demise of Alexa forced the industry to assess alternatives and it’s becoming clear there are noticeable limitations in all of them. Akamai’s AkaRank popularity list improves popularity rankings to neutralize the biases of other lists, and we continuously refine it to ensure users can always access legitimate resources.

For real-time security research updates, follow us on Twitter.



Akamai Wave Blue

Written by

Noy Aizenberg and Congcong Xing

January 13, 2023

Noy Aizenberg

Written by

Noy Aizenberg

Noy Aizenberg is a Software Engineer at Akamai Technologies.

Congcong Xing

Written by

Congcong Xing

Congcong Xing is a Data Analyst at Akamai Technologies.