GitHub - BakedSoups/USF-Search-Engine: Concurrent Go web crawler with SQLite, leveraging worker pools, batch processing, and optimized database transactions for efficient large-scale indexing. Supports thousands of goroutines with robust search functionality.

This is a Search Engine all written in go, Crawl a website domain concurrently at the speed of light! then search what you want.

Crawling the USF website,

The crawler gets a chunk of the information retreived from the website, then uploads the infromation into a sqlite database,

Crawling Concurency

Crawling is purposely set to chunks so it can easily be scalable way to assign 100-1000 workers to work together and conccurently crawl.

Way this works

We have 2 types of workers: Extract Worker

processes a link and downloads it
returns the contents and the links found DB worker
gets infromation from extraction
Uploads this into the database

I then call these paramaters

const (
	MAX_WORKERS    = 300  // maximum number of crawler workers
	MAX_DB_WORKERS = 300   // maximum number of database workers
	BATCH_SIZE     = 150  // documents per batch
	QUEUE_SIZE     = 9000 // this is the buffer size of jobs 
						  //determines how many jobs are allowed to be qued
)

then this function uses these global paramters to assign the workers the task to work concurently

func crawlFullyConcurrent(db *sql.DB, seedURL string) (int, error)

Scale

with this system of crawling conccreutly this app can scale very well. its just a matter of tuning the paramters for the website.

Drawbacks

QUEUE_SIZE is a major bottle neck, if the website its looking at has more than 9000 links the program crashes

Init: Install and run go

go build
go run .

Once its done crawling visit:

http://localhost:8080

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
templates		templates
.gitignore		.gitignore
README.md		README.md
clean_extract.go		clean_extract.go
clean_term.go		clean_term.go
crawl_download.go		crawl_download.go
go.mod		go.mod
go.sum		go.sum
inverted_index_db.go		inverted_index_db.go
main.go		main.go
main_test.go		main_test.go
project3		project3
robots.go		robots.go
search.go		search.go
stop_words.json		stop_words.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawling the USF website,

Crawling Concurency

Way this works

Scale

Drawbacks

About

Uh oh!

Releases

Packages

Uh oh!

Languages

BakedSoups/USF-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Crawling the USF website,

Crawling Concurency

Way this works

Scale

Drawbacks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages