Web Crawl Like a Pro

9 min readAug 1, 2023

Welcome back! Let’s talk about web crawling, there’s one thing hackers love and it is to crawl websites to find security holes for them to exploit. As a hacker you want to abuse everything you can that benefits you so let’s begin. Let’s first start to understand what URLs are, what websites are, and how they work.

What are URLs?

URLs are like addresses for websites and online resources. Just like how your home address tells people where to find your house, a URL tells computers where to find a specific website or file on the internet.

Here’s a breakdown of a URL:

The beginning part: It usually starts with “http://” or “https://” to show how to access the resource. It’s like saying “Hey, use this method to reach the website!”
The middle part: This is the name of the website, like “example.com” or “google.com.” It’s the place you want to visit, just like the name of a store or a house.
The next part: It’s like the directions inside the website. It tells the computer exactly which page or file you want to see, like a specific product in a store or a particular picture in a photo album.
Optional extras: Sometimes, there are extra details added to the URL. These can help you do specific things on the website, like searching for something or going to a particular section of a page.

So, when you type a URL into your web browser and hit Enter, the browser uses this information to find the website’s location and then shows you the content you wanted to see.

For example:

If your blog post is about a cute cat video on YouTube, the URL might look like this:

https://www.youtube.com/watch?v=abcd1234

“https://” tells the computer to use a secure method to access the website.
“www.youtube.com" is the name of the website (YouTube).
“/watch?v=abcd1234” is the path that takes you to the specific video you want to watch.

What are websites?

Websites are online locations on the internet that contain information, images, videos, and other resources that people can access using web browsers like Google Chrome, Mozilla Firefox, or Safari. Think of websites as virtual places where you can find all sorts of content and interact with various services.

Imagine the internet as a vast digital world, and each website is like a building or a house within that world. Each website has its unique address (URL) that helps people find and visit it.

Websites can serve various purposes:

Informational Websites: These websites provide information about specific topics, like news websites, educational sites, or Wikipedia.
E-commerce Websites: These websites are online stores where you can buy products and services, like Amazon, eBay, or Etsy.
Social Media Websites: These platforms allow you to connect and interact with friends, share posts, pictures, and videos. Examples include Facebook, Twitter, and Instagram.
Blogs and Personal Websites: Blogs are like online journals where individuals or groups share their thoughts, experiences, and expertise on various topics.
Entertainment Websites: These sites offer fun and entertainment, like streaming platforms (Netflix, YouTube), online games, or comic websites.
Business Websites: Many companies and organizations have their websites to showcase their products, services, and contact information.
Government and Institutional Websites: These websites provide official information and services from governments, universities, and other institutions.

When you type a website’s URL into your web browser, the browser uses the internet to locate the website’s server (a powerful computer that hosts the website) and retrieves the content stored there. The browser then displays the website’s content on your screen, allowing you to navigate and interact with the information presented.

Websites have become an essential part of our daily lives, offering a wealth of knowledge, entertainment, and services just a click away. They enable us to connect globally, explore new ideas, and access a vast array of resources at our fingertips.

What is HTTP/HTTPS?

HTTP is called the “The Hypertext Transfer Protocol” and HTTPS is called “Hypertext Transfer Protocol Secure” it encrypts network traffic from threat actors like Hackers, it’s there to prevent them from intercepting it. Although HTTP doesn’t have an encryption layer to it so that’s more likely vulnerable to data being intercepted by cyber criminals. HTTP/HTTPS is used worldwide for communication on the internet which brings us back to URLs at the first part mentioned from the origin area of the URL “http://” Without that protocol you can consider your or whoever’s website offline. In networking when we talk about port protocols “http://” is being hosted by port 80 or port 443, sometimes port 8080.

What is a web crawler?

A web crawler or web crawling is whenever an attacker or whoever sends rapid web requests to the “HTTP/HTTPS” server, like many protocols it uses various commands like GET, POST, PUT, DELETE, HEAD, OPTIONS, PATCH, TRACE, CONNECT, LINK, UNLINK, what do they do you ask?

GET: This command is used to retrieve data from the server. When you visit a website or click on a link, your browser sends a GET request to the server to fetch the page’s content. A very important part to note, this will be used to crawl directories from a web URL.
POST: POST is used to send data to the server to create a new resource or submit data to be processed. For example, when you submit a form on a website, the data is often sent using a POST request.
PUT: PUT is used to update an existing resource on the server. It sends the updated data to replace the current resource with the new data.
DELETE: DELETE is used to remove a resource from the server. It instructs the server to delete the specified resource.
HEAD: This command is similar to GET, but it only retrieves the headers of the response without the actual content. It’s often used to check if a resource has been modified or to get metadata about a resource.
OPTIONS: OPTIONS requests the server to provide information about the communication options available for a specific resource, such as which HTTP methods are allowed.
PATCH: PATCH is used to partially update a resource on the server. It sends only the changes that need to be applied, rather than replacing the entire resource.
TRACE: TRACE is a diagnostic method that echoes the received request back to the client. It’s used for debugging and testing purposes.
CONNECT: CONNECT is used with proxy servers to establish a network connection to a resource on the server through the proxy.
LINK: LINK is used to establish one or more relationships between resources.
UNLINK: UNLINK is used to remove relationships between resources.

What are status codes?

HTTP response codes are like messages that a web server sends back to your web browser or application when you try to access a website or use an online service. They help you understand if your request was successful or if there’s a problem.

Imagine you’re ordering something online, and the website sends you a message after you click the “buy” button. If everything goes well, you’ll get a message saying “Order placed successfully” (that’s like a “200 OK” status code). But if you make a mistake, like entering the wrong shipping address, you might get a message saying “Oops, something went wrong” (that’s like a “400 Bad Request” status code).

There are different types of messages for various situations. For example, if a website moved to a new address, you might get a message saying “This page is no longer here, go to the new website” (that’s like a “301 Moved Permanently” status code). Or if you try to access a page that doesn’t exist, you might get a message saying “Sorry, we can’t find what you’re looking for” (that’s like a “404 Not Found” status code).

These response codes can help Hackers detect some special areas from a web URL path that can help them visit the site if only it’s status 200 or 302. That’s why we will be using a tool or making our own that can crawl and check response codes from each directory in the web URL.

So now that you know all of the details i think it’s time for us to actually get started with. We will be using a tool called ffuf, a great web crawling tool that will read a word list containing directories to try on the URL. There are many web crawler tools out there, but we’ll be using this one instead.

What are web directories/pages?

Web directories are like organized phonebooks for websites on the internet. Just like you’d find phone numbers and addresses in a regular phonebook, web directories have lists of websites categorized by different topics or industries.

Back in the early days of the internet, before search engines became super smart, people used web directories to find cool websites. You could go to a directory related to your interest, like “Tech Stuff” or “Travel Tips,” and browse through the websites listed there. It was a handy way to discover new content.

Website owners could submit their sites to these directories, hoping to get more visitors and improve their online presence. But, over time, search engines like Google got really good at finding the best websites for you, so people started using them more and more.

So, in a nutshell, web directories are like old-school phonebooks for websites, giving you a structured way to explore the web and find interesting web pages.

Web directories will always lead you to the web page you want to visit, every website uses this for everyone to visit. You can kinda understand why Hackers love them, it helps them to map out their targets. Maybe there are bugs on a website, so where do you find them? You find them from any page you can visit, bugs are well hidden so that’s why crawling for pages are important to find flaws.

(1) Install ffuf from your terminal

sudo apt install ffuf

go get -u github.com/ffuf/ffuf

Now that it’s installed we can now crawl https://www.dac.gov.za

(2) Run the following on your terminal

ffuf -u https://www.dac.gov.za/FUZZ -mc 200,302 -w /usr/share/wordlists/dirb/common.txt

ffuf -u for url, /FUZZ where the directories will go for checking, that’s the path of the URL, -mc 200,302 to only gather response 200 and 302, -w to use a word list you want to use for the crawling process, in this case we are using the word list/file “common.txt”

After the scan finished this is what we got

The following directories were found from https://www.dac.gov.za with a status code 200 from each path

index.html, index.php, user, so the URL with the path will be https://www.dac.gov.za/index.html, https://www.dac.gov.za/index.php, https://www.dac.gov.za/user

When viewing one of them this is what we got:

Cool now we know there are view stuff we can see from /user, you can also use a much larger word list if you want. For the URL we used there aren’t many results we could find. However, many other URLs might have more, this is only an example.

(3) Let’s make our own crawler shall we

Coding time! Time to make a little web crawler in Python, this one will crawl for admin panels. If you didn’t know, admin panels are also pages obviously lol, so it’s also in the path of a URL like /admin.php, /wp-login.php, /webmail, /cpanel, /login.php, etc…

Here’s the python script we will be using

In this script it imports the “requests” library to send out web requests to the HTTP/HTTPS server, then uses the array containing the directory pages as a word list. Then it loops through the list in variable “panels” inside the for loop. After that it will send out a GET request to the given URL https://destinata.co.za/ inside the script, it then retrieves the response codes. If the server returns 200 or 302 it will let us know that the panels were found. After executing the script this is what we got:

We found /wp-login.php, /webmail and /cpanel, after viewing them we can confirm that the script works.

Boom it worked!

DISCLAIMER!

This tutorial’s only meant to educate you how to crawl websites that’s ALL! I can’t control what you decide to do with this knowledge! I did not exploit any flaws on the following sites of this tutorial!

Now that you know how to fuzz or crawl sites, cheers!

Web Crawl Like a Pro

Written by Jared

No responses yet