Ideally I'd love an interface to a large database including the text in each webpage that has been crawled. But something like Google the way it was 10 years ago will still work. Are there any such search engines left? Even ddg is switching to semantic/NLP results.
I must completely disagree with the other posters claiming that keyword searches are not useful. For niche research, they are extremely helpful or even necessary. Google and Bing have reached the point where it is impossible to do real, niche academic research on them. For instance, I had a very specific thing I was trying to look up involving medicine, religion, and Marco Polo.
Try searching for "marco polo doctors" on Google and witness it giving very counterintuitive, one-sided results that may align with the current zeitgeist of interest from people searching Google, but diverge completely with the aim of literal, precise keyword search needed by academics. I did work to improve or hone the search down, looking up Kublai Khan, doctors, atheism brings up blogspam articles on doctors and atheism, but scant results on the 13th century Mongol emperor's religious medical interest. Trying to narrow the search further by including variations on Cambulac, Cambaliech, trying to find any info beyond surface-level on John of Montecorvino and his retinue... all is impossible with search engines in 2018.
https://news.ycombinator.com/item?id=16153840
For me, the niche is repair information and in particular, identifying IC part numbers and finding datasheets. Searching "service manual" now invariably brings up useless user's manuals, and searching too many times for IC part numbers gets you CAPTCHA-banned.
(Somewhat understandbly, part numbers tend to look like semirandom bot-queries, but it's still a horrible experience to be called a bot just because you're actually after more information than the average user.)
Keyword-based would be a great step forward(!), but something like "grep for the Web" would be ideal. I remember many decades ago learning how to use boolean operators and such, since nearly all search engines of the time provided such functionality. Now the mainstream ones which have a big enough index to be effective also have removed much of that functionality and try very hard to limit you from using it. For another example, try using "site:" searches multiple times with Google --- another way to get rapidly banned.
Interesting enough, I find separate web crawling as a service and search engine as a service, but not both?
However they are quite pricey. Maybe some solution that one can host himself is a nicer alternative.
A couple of these (e.g., Blekko) popped up 5-10 years ago. I don't think any made it far.
It's especially annoying when you search "keyword1 keyword2" then "keyword1 keyword2 keyword3" and get the same results, just with a "Missing terms: keyword3" note below each (and more often than not, an alternative search will find what I'm looking for, so it's not just a case of there being nothing to match all three).
Edit: missed "note".
Not strictly related to your comment, but similarly frustrated.
These days its fairly easy to code up an n-gram index and host it on a moderate server. If you have a corpus of documents you can see how well it works. The simplest corpus is to just down load all of Wikipedia. Last time I checked it fit on a 2 TB disk drive.
You could also use the Common Crawl database and index it as you would like, or talk to the Internet Archive project for some sort of collaborative project. However I will warn you that of the people who have done this experiment (and it was a new hire task at both Blekko and Google where I worked) folks quickly discovered it wasn't very useful.
Also did you mean every single wikipedia site or just the English one? I'm thinking just the English one's sufficient.
I still don't understand why wikipedia articles aren't translated according to a common script, at least where there are featured translations (marked a star)
As for anyone curious:
https://dumps.wikimedia.org/backup-index.html
https://dumps.wikimedia.org/enwiki/20180320/
The article text contains Wikipedia markup which is a bit difficult to remove but not impossible, there are some existing projects for doing that. DBPedia would have the raw text, but it's not nearly as current.
http://dumps.wikimedia.your.org/other/cirrussearch/
These are structured with a JSON string for each doc roughly like https://en.m.wikipedia.org/wiki/Foobar?action=cirrusdump
Our complaint is that the results we get from our queries simply do not match the queries we are making.
I'm imagining that "the frontier" a crawler needs to crawl is actually a distributed queue and that crawlers are massively parallelized. I'm also imagining the frontier is bucketed by its indexing frequency i.e daily, weekly monthly etc. Is that close? Might you have any resources on how this is architected at large search providers?
[1] https://venturebeat.com/2013/03/01/how-google-searches-30-tr...
Certain pages get crawled more often than others. As a website owner, you can tell Google how often your content changes, which they will use as a clue for how often to recrawl your site.
If you're really big (like for example Reddit), you actually lose that control in Google dashboard -- they fully control the crawl rate. From experience, I can tell you that they are crawling large sites, like Reddit, continuously. Even 7 years ago, when I last worked there, they were crawling Reddit so much that we had to set up a separate server infrastructure just to respond to Google's requests, because their access patterns were so different from every other user.
The reality is that there are roughly between 5 to 15 billion pages that are nominally "not spam" and not duplicates. Literally 99.9% of the internet is crap. So finding web pages as long since switched from 'crawling every page that is accessible from the web' to 'only surfacing those pages that have something of value on them.' That was the fundamental thesis for the founding of Blekko and it is still true today.
That said, a cluster of ~100,000 threads and a couple of petabytes of storage attached with sufficient bandwidth to keep the threads busy can deal with what is out there. If you can create a hash space for strings that is sufficiently uniform you evenly spread the load of crawling every URI you discover across the cluster.
As you crawl you take new pages you discover and apply your ranking algorithm to them where they will score a value between 0 (never index) and 1 (always index).
At which point you can dial the 'rankable' value from 0 (index everything) to 1 (rank only must rank pages) to set the size of the index you can tolerate.
I think what OP and several people in this thread actually want is Google search minus synonyms and the ability to specify advanced syntax like AND and NEAR queries. I believe that would go a long way to satisfying someone who says they just want "keyword search".
[1]: http://symbolhound.com/
I can use it via DuckDuckGo too: https://duckduckgo.com/bang?q=symbolhound
For example if you type "marco polo doctor's -doctor -who" or "marco polo doctors group:science".
Google operators for example: https://en.wikipedia.org/wiki/Google_Search#Search_syntax
Cheat Sheet: https://www.searchlaboratory.com/wp-content/uploads/2012/11/...
I notice that one of the pages you link to is from 2012. Again, the old ways just don't cut it any more. Since it doesn't even mention Google's "Verbatim" search option, it suggests to me that while its content might be technically correct, it's useless to hand-wave towards it as the cure for contemporary complaints about Google's results.
This certainly was not like this 10 years ago.
Unless it came back at some point, it wasn't very recent - it was intentionally removed in 2011 due to their using "+" for their social network.
What you get in return is a bunch of slashed out "Linux"s and "Nginx"s and a bunch of "How to setup obscure program... On windows" and "How to setup obscure program... on Apache". It's downright infuriating having to learn some of the tools I cannot do without. Ones where documentation is spotty, but user forums/mailing lists/etc are top notch, even for Linux and Nginx. But, you won't know that even if you specifically type: 'setup obscure program "Linux" "Nginx" -apache -windows'.
It has changed. You don't have the right to find what you're actually looking for. You have the privilege to only look in places Google approves.
Verbatim mode is a little better, but still not as direct as it was 10 years ago.
This is literally my whole business, and I wrote about this here: https://austingwalters.com/is-search-solved/
Hinting at why search was broken.
Essentially, search providers are for the general, meaning if you type "whales" it'll bring you to the wikipedia. This is because probably 80% of the time you're looking for wikipedia. It uses NLP to determine when you say "I want to know about whales" because it works in the general case. If you want an exact match do "I want to know about whales" and it'll look for that exact phrase.
Now my business, is actually the reverse - not looking for the general case, but identifying the niche - i.e. "what an expert would want":
https://projectpiglet.com/
This lets me build a financial advisor which is averaging over 100% YoY returns because it identifies and tracks specific topics (as opposed to just wikipedia changes). So if you go on there and type "Iran", you'll get a lot of search results about Iran, but also about Isreal, Jordan and the like because it identifies Iran being associated in the graph. This works great for investing because you want to know about the related topics (you may not even realized you wanted to know about).
Now, that's NLP. But it works for my customers, because exact matches typically are not what people want. They want the Niche, the general, or occasionally as in your case an exact match (if they can think of the right words). Luckily you have quotes "search phrase", in my system I always assume you mean to type "search phrase", so I always look for an exact match. But I still apply NLP to the results, because that's the value.
You can watch my 1-minute pitch in which I mention why Kozmos' search functionality is more interesting: https://youtube.com/watch?v=ETjeEz5Dk_M
We have academician/researchers users who love Kozmos. My goal is to improve search with deep learning, keeping the sorting algorithm same (like count) though. I'm walking towards this goal slowly but surely!
P.S If you're into this topic and live in Europe (Berlin), let's have coffee!
If you have some loose change on you.. a bit of processing on 71TB of data.. and you got yourself an index precisely like you want it.
Anyway, without "some" NLP no search engine is going to be very useful.
You need to know how to tokenize.. at a minimum. For many languages, this is not as trivial as it is for English.
http://symbolhound.com/
It don't like it, but I'm sure that for the average query using quotes, it does better find what a user is looking for. That's why Google does this.
On the other hand, there's a lot of observation bias: I'm much less likely to notice Google is ignoring quotes when it works out well. It's mostly in the frustrating cases that I really notice and remember noticing that Google sometimes ignores quotes.
As for what black magic allows them to sometimes determine that you don't really want quotes, you're guesses would probably be as good as mine. I'm just pretty sure that on average, this black magic has a positive impact on search quality.
[1] https://productforums.google.com/forum/#!topic/websearch/6gH...
https://duck.co/help/results/syntax
(Of course, since I only use Google when DDG's results are poor, I would expect to see Google's results be superior a lot of the time, irrespective of whether their results are generally better or worse.)
keyword1 w5 keyword2 [...]
[find pages where keyword1 is within 5 words' distance of keyword2]