Umar Getagazov
fde8b75efd
Drop crawl schedule-related fields
...
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f
Rank authoritative websites and index pages higher
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
2971603710
Put domain labels minus eTLD into the text index
...
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org ).
2022-07-11 17:48:46 +02:00
Drew DeVault
e44770b9b7
schema: add "source" column to page
2022-07-10 10:13:11 +02:00
Drew DeVault
01b2b1349b
crawler: compute checksum and make unique
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55
Track pages with JavaScript and total crawl time
2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b
Handle Retry-After as timestamp
2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28
crawler: re-schedule after HTTP 429
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4
Handle canonical URLs
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8
crawler: perform HEAD before GET
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
759ad758af
crawler: improve index settings
2022-07-09 18:57:39 +02:00
Drew DeVault
d6bc032d24
crawler: respect robots.txt
2022-07-08 20:30:09 +02:00
Drew DeVault
eb6769c904
crawler: follow links regardless of readability
2022-07-08 20:13:32 +02:00
Drew DeVault
050694c4f2
Initial commit
2022-07-08 19:46:11 +02:00