How to make a Simple Website Crawler

TheBitDoodler's Byte
3 min readNov 12, 2019

--

“If you spend more on coffee than on IT security, you will be hacked. What’s more, you deserve to be hacked”

― Richard Clarke

When I started to explore Cyber Security Field I had zero level knowledge about how things work. From zero to making a Website Crawler, the journey is as satisfied as to the feeling after getting a major bug in any web application after sleepless nights.

I’ll share my experience which may help cybersecurity enthusiasts who are new in this field like I was before 1year ago. I will try to make it very clear.

There are a lot of websites available now-a-days and there are billions of links for each and every web page.

Being a bug bounty hunter/ web application penetration tester we should have knowledge about all the URLs associated with a particular website and it is not possible to go to the browser and type all possible combinations for each and every URL that may exist. In this scenario Web Crawler takes entry into this field.

I’ve made this Web Crawler using python program. As we all know it is very easy to understand Python and it has a bundle of Libraries which make our job easy tho I’m using only using requests module.

To start with the coding part first we have to understand how a web crawler works. Let visualize it.

In the above diagram, we can easily understand we have to pass a Test URL trough a user function named response(). Now the question is what will be the Test URL? It will be the word list contains suitable words for subdomains + experimental URL.

e.g. www/mail/… + xyz.com

This Test URL will pass as an argument in response() method and it will make a request for the particular link if it exists then this will return the valid subdomain otherwise invalid subdomains will be discarded manually(For this I’ve used exception concept).

First of all we have to import requests module:

import requests

After this let’s create the response() function:

def response(URL):
try:
return requests.get("https://" + URL)

except requests.exceptions.ConnectionError:
pass
except requests.exceptions.InvalidURL:
pass

In this above function, the request will be generated and requests.exceptions.ConnectionError and requests.exceptions.InvalidURL exception will be passed so that we do not have to bother about invalid URLs.

Now we have to open subdomain list and concatenate with the experimental URL. To do so follow the code below:

with open("/root/Downloads/subdomains.list","r") as wordlist:
for line in wordlist:
word = line.strip()
test_URL = word + "." + target_URL
res = response(test_URL)
if res:
print("[+] Discovered a new subdomain -->" + test_URL)

Hope you’ve understood basic concept behind Web Crawler.

To download the full project go to my GitHub profile

Thank You very much for giving you precious time.

--

--