Commit graph

67 commits

Author SHA1 Message Date
Drew DeVault
1c2252bc83 .gitignore: add sh-admin 2022-07-13 11:27:55 +02:00
Drew DeVault
778b4c41c1 Use RUM operators for ranking 2022-07-13 10:29:10 +02:00
Drew DeVault
731950a326 crawler: trim excerpt
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/38
2022-07-13 10:26:22 +02:00
Drew DeVault
9473f3b49b import/*: fix page_size issues 2022-07-13 10:24:26 +02:00
Drew DeVault
69a9e20a0a sh-admin: new command 2022-07-13 10:20:57 +02:00
Drew DeVault
69cf99e367 schema: add default for domain tags 2022-07-13 10:20:27 +02:00
Umar Getagazov
cbd3732deb Store page size in the database
Implements: https://todo.sr.ht/~sircmpwn/searchhut/33
2022-07-13 10:14:37 +02:00
Umar Getagazov
7f555e21f5 web: match alert's dark theme colors with sr.ht 2022-07-13 10:14:35 +02:00
Taavi Väänänen
00a37d0b48 import/mediawiki: use namespace IDs for filtering
Updates the mediawiki importer to use the namespace IDs for filtering
instead of matching for the beginning of the article title. This better
supports other language versions and non-Wikipedia wikis.

Signed-off-by: Taavi Väänänen <hi@taavi.wtf>
2022-07-13 10:14:30 +02:00
Drew DeVault
82d73c6e31 schema: use rum index
https://github.com/postgrespro/rum
2022-07-13 10:13:54 +02:00
Drew DeVault
ed9031a3a3 API: add index size to stats 2022-07-11 21:38:29 +02:00
Drew DeVault
53eefd6787 crawler: fix log message 2022-07-11 21:31:20 +02:00
Drew DeVault
19a9a3a3b5 sh-index: add -u flag to add URLs to schedule
This is useful for indexing parts of sites which are not reachable from
the index page.
2022-07-11 20:57:59 +02:00
Drew DeVault
009b2b31d4 web: add total pages indexed to home page 2022-07-11 20:40:53 +02:00
Drew DeVault
13d5f95eab import/mediawiki: drop File: pages 2022-07-11 20:22:35 +02:00
Drew DeVault
74b26cecfa import/mediawiki: more improvements 2022-07-11 19:30:57 +02:00
Haelwenn (lanodan) Monnier
5689b79e13 import/cve.org: truncate content for excerpt 2022-07-11 19:11:37 +02:00
Haelwenn (lanodan) Monnier
062e63437a import/cve.org: New importer 2022-07-11 17:53:58 +02:00
Umar Getagazov
fde8b75efd Drop crawl schedule-related fields
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f Rank authoritative websites and index pages higher
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
72649f0f0e Truncate page titles and URLs in search results
Implements: https://todo.sr.ht/~sircmpwn/searchhut/25
2022-07-11 17:48:50 +02:00
Umar Getagazov
2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
Drew DeVault
c6777e21a7 schema.sql: set default exclusion list to {} 2022-07-11 17:48:36 +02:00
Drew DeVault
5848adfea0 mediawiki: don't parse until we know we want it 2022-07-11 14:35:22 +02:00
Drew DeVault
4567044626 import/mediawiki: delete elements when done
To avoid blowing up memory usage
2022-07-11 14:27:21 +02:00
Umar Getagazov
5471687556 Add per-domain page exclusion mechanism 2022-07-11 13:20:31 +02:00
Umar Getagazov
ef32533b75 Fix searchut typo in the config file path 2022-07-11 13:17:16 +02:00
Umar Getagazov
3b056cc0b4 Dark theme
Colors taken from the dark theme of SourceHut services; some of them
tweaked for contrast.

Implements: https://todo.sr.ht/~sircmpwn/searchhut/24
2022-07-11 13:17:02 +02:00
Drew DeVault
50fd2562f5 Highlight result title in bold 2022-07-11 13:16:47 +02:00
Umar Getagazov
dda780c694 UI fixups for f449fe8
Mostly returning the look to the previous state, code formatting, and
adjusting the look of the search results label.
2022-07-11 13:13:09 +02:00
Umar Getagazov
67c60ef5c1 Use the real crawler UA at /about 2022-07-11 13:13:05 +02:00
Umar Getagazov
3bc5cd9689 Responsive UI
Implements: https://todo.sr.ht/~sircmpwn/searchhut/20
2022-07-11 13:08:37 +02:00
Rohan Kumar
f449fe8a32 Semantic/a11y markup improvements
- Make search results an <ol> with an ARIA label. If more elements are
  erver present on the SERP (e.g. settings), the <ol> should be placed
  inside a <section> and its label should move to that section too.
- Remove list-style and padding from the <ol> in the stylesheet
- Add the "search" ARIA role to the search form.
- Make search result titles headings. This is established convention
  that assistive-technology users are already familiar with from other
  engines.
- Add an indicator for "N search results found". This is where the list
  label comes from.
- Exclude the brand name from machine translation.
2022-07-10 15:03:04 +02:00
Drew DeVault
76bc26d639 Adding missing /about bits 2022-07-10 15:02:55 +02:00
Umar Getagazov
7a67438e9c Add favicon 2022-07-10 15:02:28 +02:00
Drew DeVault
c367bbddd3 Add about page 2022-07-10 13:07:00 +02:00
Drew DeVault
c8762965ac import/mediawiki: initial commit 2022-07-10 11:11:18 +02:00
Drew DeVault
e44770b9b7 schema: add "source" column to page 2022-07-10 10:13:11 +02:00
Drew DeVault
d30cdbf52e crawler: fix interval input 2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55 Track pages with JavaScript and total crawl time 2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b Handle Retry-After as timestamp 2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28 crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4 Handle canonical URLs
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8 crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
759ad758af crawler: improve index settings 2022-07-09 18:57:39 +02:00
Drew DeVault
35a4faa05b sh-index: fetch user agent from config 2022-07-09 18:14:06 +02:00
Drew DeVault
2ec534d63a Add Makefile 2022-07-09 18:14:00 +02:00
Drew DeVault
3535309004 web: add link to index from search page 2022-07-09 18:07:46 +02:00
Drew DeVault
b41abd9376 main.css: change URL color in results 2022-07-09 17:51:05 +02:00