Drew DeVault
c2c75b565a
Display search query time on results page
2022-07-13 15:34:29 +02:00
Drew DeVault
3e62c61e8f
main.css: ellipsize overflow in excerpt
2022-07-13 15:17:53 +02:00
Drew DeVault
c364bc316f
import/xkcd: new importer
2022-07-13 14:54:40 +02:00
Drew DeVault
cc0c144528
Switch search to rum rankings (<=>)
...
Turns out we don't want to order by desc here
2022-07-13 14:23:08 +02:00
Drew DeVault
8941c46191
Use ts_rank_cd rather than <=>
...
We may want to evaluate this more later but for now I need to reduce the
number of independent variables while testing indexing changes
2022-07-13 11:27:55 +02:00
Drew DeVault
1c2252bc83
.gitignore: add sh-admin
2022-07-13 11:27:55 +02:00
Drew DeVault
778b4c41c1
Use RUM operators for ranking
2022-07-13 10:29:10 +02:00
Drew DeVault
731950a326
crawler: trim excerpt
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/38
2022-07-13 10:26:22 +02:00
Drew DeVault
9473f3b49b
import/*: fix page_size issues
2022-07-13 10:24:26 +02:00
Drew DeVault
69a9e20a0a
sh-admin: new command
2022-07-13 10:20:57 +02:00
Drew DeVault
69cf99e367
schema: add default for domain tags
2022-07-13 10:20:27 +02:00
Umar Getagazov
cbd3732deb
Store page size in the database
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/33
2022-07-13 10:14:37 +02:00
Umar Getagazov
7f555e21f5
web: match alert's dark theme colors with sr.ht
2022-07-13 10:14:35 +02:00
Taavi Väänänen
00a37d0b48
import/mediawiki: use namespace IDs for filtering
...
Updates the mediawiki importer to use the namespace IDs for filtering
instead of matching for the beginning of the article title. This better
supports other language versions and non-Wikipedia wikis.
Signed-off-by: Taavi Väänänen <hi@taavi.wtf>
2022-07-13 10:14:30 +02:00
Drew DeVault
82d73c6e31
schema: use rum index
...
https://github.com/postgrespro/rum
2022-07-13 10:13:54 +02:00
Drew DeVault
ed9031a3a3
API: add index size to stats
2022-07-11 21:38:29 +02:00
Drew DeVault
53eefd6787
crawler: fix log message
2022-07-11 21:31:20 +02:00
Drew DeVault
19a9a3a3b5
sh-index: add -u flag to add URLs to schedule
...
This is useful for indexing parts of sites which are not reachable from
the index page.
2022-07-11 20:57:59 +02:00
Drew DeVault
009b2b31d4
web: add total pages indexed to home page
2022-07-11 20:40:53 +02:00
Drew DeVault
13d5f95eab
import/mediawiki: drop File: pages
2022-07-11 20:22:35 +02:00
Drew DeVault
74b26cecfa
import/mediawiki: more improvements
2022-07-11 19:30:57 +02:00
Haelwenn (lanodan) Monnier
5689b79e13
import/cve.org: truncate content for excerpt
2022-07-11 19:11:37 +02:00
Haelwenn (lanodan) Monnier
062e63437a
import/cve.org: New importer
2022-07-11 17:53:58 +02:00
Umar Getagazov
fde8b75efd
Drop crawl schedule-related fields
...
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f
Rank authoritative websites and index pages higher
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
72649f0f0e
Truncate page titles and URLs in search results
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/25
2022-07-11 17:48:50 +02:00
Umar Getagazov
2971603710
Put domain labels minus eTLD into the text index
...
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org ).
2022-07-11 17:48:46 +02:00
Drew DeVault
c6777e21a7
schema.sql: set default exclusion list to {}
2022-07-11 17:48:36 +02:00
Drew DeVault
5848adfea0
mediawiki: don't parse until we know we want it
2022-07-11 14:35:22 +02:00
Drew DeVault
4567044626
import/mediawiki: delete elements when done
...
To avoid blowing up memory usage
2022-07-11 14:27:21 +02:00
Umar Getagazov
5471687556
Add per-domain page exclusion mechanism
2022-07-11 13:20:31 +02:00
Umar Getagazov
ef32533b75
Fix searchut typo in the config file path
2022-07-11 13:17:16 +02:00
Umar Getagazov
3b056cc0b4
Dark theme
...
Colors taken from the dark theme of SourceHut services; some of them
tweaked for contrast.
Implements: https://todo.sr.ht/~sircmpwn/searchhut/24
2022-07-11 13:17:02 +02:00
Drew DeVault
50fd2562f5
Highlight result title in bold
2022-07-11 13:16:47 +02:00
Umar Getagazov
dda780c694
UI fixups for f449fe8
...
Mostly returning the look to the previous state, code formatting, and
adjusting the look of the search results label.
2022-07-11 13:13:09 +02:00
Umar Getagazov
67c60ef5c1
Use the real crawler UA at /about
2022-07-11 13:13:05 +02:00
Umar Getagazov
3bc5cd9689
Responsive UI
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/20
2022-07-11 13:08:37 +02:00
Rohan Kumar
f449fe8a32
Semantic/a11y markup improvements
...
- Make search results an <ol> with an ARIA label. If more elements are
erver present on the SERP (e.g. settings), the <ol> should be placed
inside a <section> and its label should move to that section too.
- Remove list-style and padding from the <ol> in the stylesheet
- Add the "search" ARIA role to the search form.
- Make search result titles headings. This is established convention
that assistive-technology users are already familiar with from other
engines.
- Add an indicator for "N search results found". This is where the list
label comes from.
- Exclude the brand name from machine translation.
2022-07-10 15:03:04 +02:00
Drew DeVault
76bc26d639
Adding missing /about bits
2022-07-10 15:02:55 +02:00
Umar Getagazov
7a67438e9c
Add favicon
2022-07-10 15:02:28 +02:00
Drew DeVault
c367bbddd3
Add about page
2022-07-10 13:07:00 +02:00
Drew DeVault
c8762965ac
import/mediawiki: initial commit
2022-07-10 11:11:18 +02:00
Drew DeVault
e44770b9b7
schema: add "source" column to page
2022-07-10 10:13:11 +02:00
Drew DeVault
d30cdbf52e
crawler: fix interval input
2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b
crawler: compute checksum and make unique
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55
Track pages with JavaScript and total crawl time
2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b
Handle Retry-After as timestamp
2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28
crawler: re-schedule after HTTP 429
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4
Handle canonical URLs
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8
crawler: perform HEAD before GET
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00