No description
Find a file
Umar Getagazov 2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
cmd Use the real crawler UA at /about 2022-07-11 13:13:05 +02:00
config Fix searchut typo in the config file path 2022-07-11 13:17:16 +02:00
crawler Put domain labels minus eTLD into the text index 2022-07-11 17:48:46 +02:00
database database: add middleware 2022-07-09 13:52:55 +02:00
graph API: Implement search resolver 2022-07-09 15:48:03 +02:00
import/mediawiki mediawiki: don't parse until we know we want it 2022-07-11 14:35:22 +02:00
query web: add search results page 2022-07-09 17:48:52 +02:00
static Dark theme 2022-07-11 13:17:02 +02:00
templates Highlight result title in bold 2022-07-11 13:16:47 +02:00
.gitignore Add Makefile 2022-07-09 18:14:00 +02:00
config.example.ini sh-api: expand top-level server riggings 2022-07-09 15:39:04 +02:00
COPYING Initial commit 2022-07-08 19:46:11 +02:00
go.mod web: add search results page 2022-07-09 17:48:52 +02:00
go.sum web: add search results page 2022-07-09 17:48:52 +02:00
gqlgen.yml API: Implement search resolver 2022-07-09 15:48:03 +02:00
Makefile Add Makefile 2022-07-09 18:14:00 +02:00
README.md Add README.md 2022-07-08 20:55:55 +02:00
schema.sql schema.sql: set default exclusion list to {} 2022-07-11 17:48:36 +02:00

WIP

Why is this crawling my site?

This crawler is still under development. It respects robots.txt Disallow and Crawl-Delay directives. But, if it's annoying you, email sir@cmpwn.com and I'll knock it off.