Hey mate, should search by PMID work? Like 35982160 is PMID for "Rare coding variation provides insight into the genetic architecture and phenotypic context of autism" - not seeing this publication at all in search results...
Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/
If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/
1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.
2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?
I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.
It uses a vector search approach. Your query is embedded in a vector space using a language model and we find the closest vector to the query from the PubMed papers. This is a good summary of the techniques: https://learn.microsoft.com/en-us/azure/search/vector-search.... There are a couple more tricks but this is the gist.
The nice part is that this approach allows you to find relevant papers to your question. E.g, you can ask "Can secondhand smoke cause AMD?" and the very first few papers are answering your question (https://pubmedisearch.com/share/Can%20secondhand%20smoke%20c...). The more specific question, the better. :)
Glad you like it! I did this as a mini-project within our startup MediSearch (https://medisearch.io/) & the search pipeline is custom tuned for the problem.
Some of these GPT engines maintain their own vector DB to do semantic search, others are directly hooked into Bing / Google. So pubmedisearch.com would be one component of a GPT-based engine. We actually have a GPT-based engine here: https://medisearch.io/.
Yes, that’s where I’m these days. I don’t even think of venturing outside of Postgres these days, except for say things like Redis etc. where there are mature and established options for specific use cases.
https://pubmedisearch.com/share/Do%20some%20individuals%20wi...
Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/
If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/
1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.
2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?
Anyhow, love the idea.
2. We do weight those ... it is a lot of trial and error and you have to have good & exhaustive benchmarks.
I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.
It uses a vector search approach. Your query is embedded in a vector space using a language model and we find the closest vector to the query from the PubMed papers. This is a good summary of the techniques: https://learn.microsoft.com/en-us/azure/search/vector-search.... There are a couple more tricks but this is the gist.
The nice part is that this approach allows you to find relevant papers to your question. E.g, you can ask "Can secondhand smoke cause AMD?" and the very first few papers are answering your question (https://pubmedisearch.com/share/Can%20secondhand%20smoke%20c...). The more specific question, the better. :)
What are some papers labeled "High Quality Article"? How do you determine that?
Out of curiosity what model(s) are you using to generate the embeddings?
Edit: One suggestion: in the results list, please make the headings links to the articles, too.
Some of these GPT engines maintain their own vector DB to do semantic search, others are directly hooked into Bing / Google. So pubmedisearch.com would be one component of a GPT-based engine. We actually have a GPT-based engine here: https://medisearch.io/.