Machine Learning: Full-Text Search in JavaScript – Relevance Scoring (2015)

(burakkanber.com)

153 points | by octosphere 1860 days ago

8 comments

diegolo 1860 days ago
"a general search can be machine learning" I don't get this sentence: Machine learning is about building a mathematical model of sample data, known as "training data".
If you want to talk about machine learning and search you should probably talk about learning to rank (https://en.m.wikipedia.org/wiki/Learning_to_rank)
[-]
- snotrockets 1860 days ago
  I'd argue that you're too restrictive in your definition. e.g. unsupervised clustering has no sample training data.
  The usual definition (due to Mitchell) is that machine learning is a system s.t. its performance on a given task improves by past experience.
  [-]
  - thegginthesky 1859 days ago
    Actually, any unsupervised method, including clustering, still has training data. The only difference is it doesn't have a target y variable in the training set to minimize the error metric, hence the name unsupervised.
    But the definition you mention is right. Yet, any dataset that you use to fit your model will be your training set, even if you don't have a train test split or the like, because you used it to train your model over.
    [-]
    - snotrockets 1858 days ago
      K-means has no "training data" per se.
inertiatic 1860 days ago
Search is now machine learning? Interesting introduction to the topic otherwise.
[-]
- softwaredoug 1860 days ago
  I would say this isn't machine learning, but relevance in general is an interesting topic to apply supervised learning. Of course the training data is the hard part,
  An article on the topic, https://opensourceconnections.com/blog/2017/08/03/search-as-... (disclaimer I wrote it...)
  [-]
  - inertiatic 1859 days ago
    I also work on this field so I do have an idea of what's possible if you apply machine learning techniques to improve relevance rankings.
    But to my intuition, basic search doesn't feel like a machine learning task. After reading some of the responses to my post however I'm trying to come up with a meaningful reason why I wouldn't consider IDF to be machine learning, given that it is updated as more documents enter the corpus and your system "learns" to re-rank existing result sets based on these new documents.
- Cybiote 1860 days ago
  These categorizations tend to be arbitrary and inconsistent because what is an intelligence is probably subjective. People consider kNN and naive bayes to be machine learning. One is "just" sorting and the other is just counting. The learning in naive bayes and even in some higher order bayesian networks is of a similar flavor as the count structures generated for IR.
  Because we understand how these algorithms work, we can always reduce them to just this or that. Prediction in many linear classifiers is just dot products. relu neural networks are just lots of clamped dot products. Random projections on simple count data can generate word embeddings.
  Whether or not something is merely model-fitting, super-scaling, compression or AI, AI art and machine learning will depend on the field it originated from. It's indisputable that the algorithms are so reducible but I tend to think that we should care more about functional capabilities when compared to an appropriate subset of a known intelligence's abilities than details of implementation.
- ma2rten 1860 days ago
  In general search can be machine learning. Google certainly uses machine learning as part of it's ranking.
  I guess you can make the argument that even tf-idf as described in the article is a form of unsupervised machine learning because you obtain ("learn") the idf from the data.
  [-]
  - jahewson 1860 days ago
    TF-IDF is a feature extracted from the data, much like a simple count of words, but it is not learned. It is simply computed. An example of learned features are word embeddings where it is necessary to train on data to obtain them.
    If you want to apply machine learning to search then you need clickstream data, embeddings, or learned feature weights.
    [-]
    - yorwba 1860 days ago
      Word embeddings are also "simply computed." If you use GloVe, then the vectors are obtained by factoring a matrix of co-occurence counts.
      The difference between machine learning and "simple" feature extraction is mostly just in the choice of metaphors used to describe the computation, not in any fundamental properties.
      [-]
      - ma2rten 1860 days ago
        Right. Naive Bayes is considered to be a machine learning algorithm, but also consists of just "simple counting".
    - drongoking 1860 days ago
      Your distinction between TFIDF as simply computed vs embeddings as learned is odd and artificial. Both are computations from data, but TFIDF has an understandable closed form and word embeddings do not. As for machine learning, it has to do with improvement and doesn't even necessarily need data.
      [-]
      - jahewson 1859 days ago
        I think you’re right. It’s the improvement in the learning process that’s the important bit. TF-IDF lacks that.
- pilooch 1860 days ago
  Search is often formulated as 'learning to rank'.
  [-]
  - inertiatic 1859 days ago
    AFAIK learning to rank refers to things more advanced than simple TFIDF.
humbleMouse 1860 days ago
Reading your site on my phone and it reloads every 10 seconds. Annoying.
rajangdavis 1860 days ago
Curious to see how this might compare against Postgres's full-text search.
The text search vector type is pretty much a poor man's bag of words model (with removing stop words and some lemmatization) but instead of counts, you get placement of where the words occur.
eggie5 1860 days ago
he generated query-document features. Now he just needs to collect relevance labels for the documents, then he can learn a ranker a la LTR.
ElD0C 1860 days ago
(2015)
magma17 1860 days ago
relevance==frequency?
anything is ML now...
4FNET7 1860 days ago
thanks --