NcNLabs – Digging deeper, looking for exploits and malware in dark nets with scrapy

Written by  on April 29, 2018 

The past month I was accepted to perform a talk-workshop about open source intelligence at the NcNLabs in Barcelona. NcNLabs is an initiative of noconname, the oldest computer security conference in Spain. What I wanted to tell there is how easy it may be to completely watch and analyze websites and applications with simple tools, even if these sites are underground hacking communities in the dark web. I also wanted to talk about how useful may be to scan online sites for getting threat intelligence and the role of threat intel in the general security strategy of an organization. I decided to use scrapy and show an example of how to extract threat intel from underground hacking forums.

Github repos:

Clone them before we start 🙂


So scrapy is really simple to use. We can play with the program so easily using it’s shell. The scapy shell is really useful for, for example, testing our expressions (css or selectors) before putting them into the bot, as launching a full scraper bot may be a complicated and slow process if we only want to test if our expressions work. So we start the shell with.

scrapy shell

Now that we are in a scrapy shell we have to know that scrapy is so simple. It just works by downloading HTML from a web server and PARSING it for content as html is just raw text. We can fetch the html of some uri with:

scrapy fetch(“url)

And if we’ve done right, we get our html and it gets stored in the “response” variable. We can run


To see what we have (in the web browser)

And if we just want to see the full text that scrapy is about to parse we just do:

print response.text

It will show the full html. Then by using css or selectors we’ll parse that for data.

Let’s test our first expression:


Let’s analyze our expression. With .css, we apply a CSS expression to our response look for content matching it and with extract we just extract the text. With title we look for the html matching the “title” class and with ::text we look for the text that’s inside the tag. So with <div class=”title”> SomeTitle <… we’ll get “SomeTitle”. If we test this expression we basically get all the titles of the posts in the “dankmemes” subreddit.

And that’s it. The other way of get custom content is to use selectors. A selector may look like this:


That selector will do basically the same. As this is not a general scrapy course or tutorial I’ll stop right here. So look for some tutorials or documentation if you are more interested in working with selectors/css.

For doing real scrapy projects we have to create our own. So we do:

scrapy startproject demo1

And we can create our own project. We can define and write our custom bots to look for different kinds of content and perform different operations on the content, from storing content in databases, calling certain functions if certain content is detected and more, just whatever you want as in a regular python project, but with the features of scrapy.

If we look at the structure of our project we can see:

So basically in we define some general properties of our project, such as database information, proxy options and such, on we define middleware programs and actions to be used and on we define actions to do as a pipe for the content. For now, the real interesting stuff is in “spiders”, there we will create our robots.

A simple bot inside spiders dir may look like the one I have here:

Here we have to pay attention at the name variable as this is the name of our bot, we’ll call it with this. If a domain is not inside allowed domains scrapy will not make any request on it and start urls are the urls that scrapy will start requesting for work.

Then we have the parse function. Scrapy uses the response variable (from requesting star urls) for parsing content, we work with that content in parse. Here I’ve used a couple of tags:

titles = response.css(‘.title.may-blank::text’).extract()

url = response.css(‘.title.may-blank::attr(href)’).extract()

These are easy to understand. We look for a couple of classes in title as I explained before. Then at url we look for title class and may-blank class in the same tag and then we get what’s inside the href option in the tag, so we get the url of the post.

We can run our bot with:

scrapy crawl botname

We can do a lot with scrapy. We saw that in the previous example we were able to show something to the screen, but this is pretty useless. What if we want to keep a CSV record of those sexy memes for late usage or analysis?

We’ll look at the file and add the following

And then we edit our bot like this:

After running that we see a dankmemes.csv file here at our project folder:

But as I said before, we can do a lot with scrapy as it’s just python! For example, we can import libraries like wget and just download those memes!

What we see in this other example is that we added a second function to our scrapy parse function. This is a common practice. In the first function we access the subreddit posts, in the second function we go to every post found and we download the image.

We enter the post with:

yield scrapy.Request(url=posturl, callback=self.scan_post)

We search for the image using:

img = response.css(‘img.preview::attr(src)’).extract()[0]

And we test that with:


So, now that we now the very basics about how scrapy works, let’s test it with a more complete example. We can use scrapy as a security researchers for, for example, parsing the web for hidden content, looking for webapp versions, searching for e-mail addresses or informations leaks and many more. I’ve decided to use it as a monitor for dark web forums as a proof of concept.

What I’ve done here is to search for a forum, in my case “” forum and then access the website. I wanted to extract information of the latest posts so, at first place i created a user for the forum. We can tell scrapy to login at web applications for parsing internal sites with:

return [FormRequest.from_response(response,
formdata={‘username’: username, ‘password’: password,

It’s so easy, after doing that scrapy will act as a registered user at

Then after the login I wanted to get subforum information:


I created filters for subforums with:

subforums = ‘//td/strong/a/@href’

Then I looked for posts with:

threads = ‘//tr/td/div/span/a/@href’

And then I inspected posts with:

pagtitle = ‘//title/text()’
postdate = ‘//td[@style=”white-space: nowrap; text-align: center; vertical-align: middle;”]/span[@class=”smalltext”]/text()’

But at that point I realized that is hosted in a dark site (it’s also accessible through clear net but, the risks of doing these kind of tests from your clear ip are considerable) so I needed to tell scrapy to use tor.

This can be done by edditing the config file like this:

And the middlewares file like this:

Also the pipes goes like this:

And of course you should have tor installed as well as some manager like vidalia, you can do that with:

apt-get install tor vidalia

If you get stuck look at some tutorial like this one:

I’ve also decided to store all the information scraped from the forum to a mongoDB database. I just wrote

client = MongoClient()
client = MongoClient(‘localhost’, 27017)
db = client[‘tor1’]

At the beginning of my bot file. Then “inserts” to that DB can be done with:

forumkey = {‘pag_url’:url}
forumdata = {‘date’:date,’subforum’:subforum, ‘post_title’:title, ‘post_url’:mainurl, ‘pag_num’:str(pagnum), ‘content_hash’:identifier, ‘html’:html}subforums = self.db.zerodayforum
subforums.update_one(forumkey, {‘$set’:forumdata}, upsert=True)

And if we launch the bot we get:

It works!

But what to do with all of these data? What I did is just write a custom API for accessing that information using json and http requests with FLASK.

Here you have the whole project:

And the api:

With that whole system, searching for exploits in the dark web is just as easy as:

And of course we can search for malware as well: