DarkCrawler: Crawling Deep-Web

Deep-Web: The ultimate frontier. The invisible part of the internet blah-blah-blah. I am sure you got here because you know exactly what deep-web is. What you wish to know is how far this can go and how much you may learn from it.

You have to agree with me, deep-web is an ocean of information. There are many ways to explore it, but I have tried most of them, they didn’t work for me. So I developed the DarkCrawler, a software in java, which takes a range of IPs, creates a pool of IPs, visits the pages and partially saves the response. The software was written back in my University as a side-project because I was bored to death. But it’s an excellent tool to play with, so I start the script every here and then.

The crawler is using TCP sockets for the requests. I also can set a port, so anything from an ssh server, web server or a telnet service all can be recognized by using a single TCP request. Communicating in the transport layer has some drawbacks such as waiting time, which could be avoided, and low-level protocols are not available. I was too lazy to build a jar file – to be a proper tool – as I only wanted to get the scanner up and running.

The crawler is able to save the results and upload them to a database using the ODESA: DarkCrawler’s search engine. The user may use the ODESA when the crawling finishes to query for specific strings.

Find the vein:
The crawler needs a list of IP ranges in order to start crawling. A good place to find some is https://www.ip2location.com/. Here is a simple script that I have used to download from ip2location to form a file used to feed the crawler:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"}

r = requests.get("https://lite.ip2location.com/...your-country...-ip-address-ranges", headers = headers)
txt = r.text

f = open("_ranges.txt","w")

soup = BeautifulSoup(txt,"lxml")
tbody = soup.find("tbody")
rows = tbody.find_all("tr")
for row in rows:
	cols = row.find_all("td")
	start = cols[0].text
	stop = cols[1].text
	iplen = int(cols[2].text.replace(",",""))
	if iplen >= 128:
		print start, stop, iplen
		f.write(start + "-" + stop + "\n")

$ java DarkCrawler .txt

The crawler starts by reading all files in the folder Ranges and read all .txt files being in there. Then it parses the file and starts the scanning. You have to make sure to have the configuration file set before let it rung (main.cfg). By using the configuration file you may set up the number of threads for the thread-pool, how many bytes to save from the client’s response and more. The crawler it saves the results in the folder Data. Using the tool reader we may read some results.

$ java reader 0 Data/FUX-YPort80_0.bin

Here are some of the results:

9X.184.219.XXX alphapd
9X.184.208.XXX alphapd
9X.184.203.XXX webserver/1.0
9X.184.203.XXX Microsoft-IIS/6.0
9X.184.203.XXX DNVRS-Webs
9X.184.198.XXX cisco-IOS
9X.184.193.XXX lighttpd/1.4.35
9X.184.193.XXX nginx/1.10.2
9X.184.208.XXX alphapd/2.1.8
9X.184.203.XXX AvigilonGateway/1.0 Microsoft-HTTPAPI/2.0
9X.184.203.XXX NET-DK/1.0

The exact addresses were protected for obvious reasons. Use shodan if you wish more.

What you see here is a set of IP cameras, an old version of Microsoft server (i smell working exploits), a cisco router, more CCTV cameras, and other services I don’t recognize by only watching the signature.

Pushing the results into the ODESA I can navigate the results gracefully. Using the frontend of ODESA I have the port, title, headers, port, identification of the server, the country the IP is registered (ISP), and a possible fingerprint if found (using sha1 of the response). The backend written in PHP/MySQL and the frontend built on pure javascript/css/html. If you start about angular, spring or anything like that, go fuck-yourself, I was developing that while I was learning java.

Using the ODESA I am able to search into hundred of thousands of responses.

Here are some of my findings for a sample of 20k addresses

Many of HIKVISION IP Cameras – Default Password??
Solar-Power Control System
Audio Interface of RX-V673 – In Total Control, no password
Philips Home Automation Control System
Wireless Broadband Platform – Personal Data removed (name, location, mac address)

I was able to register and login to a national service where I shouldn’t. Pfsense firewalls, cisco routers, many DVRs, DiskStations, collaboration platforms, Email Servers Login Page, dreambox admin interface, routers admin interface, plant monitoring system, Paradox Alarm System Panel.

The thing with DarkCrawler is that you don’t have restrictions. Firstly, I have unlimited access to my data. Secondly, the founder of shodan admitted publicly that the CIA and other factors filter the results. My results are genuine, uncensored.