"a general search can be machine learning" I don't get this sentence: Machine learning is about building a mathematical model of sample data, known as "training data".
Actually, any unsupervised method, including clustering, still has training data. The only difference is it doesn't have a target y variable in the training set to minimize the error metric, hence the name unsupervised.
But the definition you mention is right. Yet, any dataset that you use to fit your model will be your training set, even if you don't have a train test split or the like, because you used it to train your model over.
I would say this isn't machine learning, but relevance in general is an interesting topic to apply supervised learning. Of course the training data is the hard part,
I also work on this field so I do have an idea of what's possible if you apply machine learning techniques to improve relevance rankings.
But to my intuition, basic search doesn't feel like a machine learning task.
After reading some of the responses to my post however I'm trying to come up with a meaningful reason why I wouldn't consider IDF to be machine learning, given that it is updated as more documents enter the corpus and your system "learns" to re-rank existing result sets based on these new documents.
These categorizations tend to be arbitrary and inconsistent because what is an intelligence is probably subjective. People consider kNN and naive bayes to be machine learning. One is "just" sorting and the other is just counting. The learning in naive bayes and even in some higher order bayesian networks is of a similar flavor as the count structures generated for IR.
Because we understand how these algorithms work, we can always reduce them to just this or that. Prediction in many linear classifiers is just dot products. relu neural networks are just lots of clamped dot products. Random projections on simple count data can generate word embeddings.
Whether or not something is merely model-fitting, super-scaling, compression or AI, AI art and machine learning will depend on the field it originated from. It's indisputable that the algorithms are so reducible but I tend to think that we should care more about functional capabilities when compared to an appropriate subset of a known intelligence's abilities than details of implementation.
In general search can be machine learning. Google certainly uses machine learning as part of it's ranking.
I guess you can make the argument that even tf-idf as described in the article is a form of unsupervised machine learning because you obtain ("learn") the idf from the data.
TF-IDF is a feature extracted from the data, much like a simple count of words, but it is not learned. It is simply computed. An example of learned features are word embeddings where it is necessary to train on data to obtain them.
If you want to apply machine learning to search then you need clickstream data, embeddings, or learned feature weights.
Word embeddings are also "simply computed." If you use GloVe, then the vectors are obtained by factoring a matrix of co-occurence counts.
The difference between machine learning and "simple" feature extraction is mostly just in the choice of metaphors used to describe the computation, not in any fundamental properties.
Your distinction between TFIDF as simply computed vs embeddings as learned is odd and artificial. Both are computations from data, but TFIDF has an understandable closed form and word embeddings do not.
As for machine learning, it has to do with improvement and doesn't even necessarily need data.
Curious to see how this might compare against Postgres's full-text search.
The text search vector type is pretty much a poor man's bag of words model (with removing stop words and some lemmatization) but instead of counts, you get placement of where the words occur.
If you want to talk about machine learning and search you should probably talk about learning to rank (https://en.m.wikipedia.org/wiki/Learning_to_rank)
The usual definition (due to Mitchell) is that machine learning is a system s.t. its performance on a given task improves by past experience.
But the definition you mention is right. Yet, any dataset that you use to fit your model will be your training set, even if you don't have a train test split or the like, because you used it to train your model over.
An article on the topic, https://opensourceconnections.com/blog/2017/08/03/search-as-... (disclaimer I wrote it...)
But to my intuition, basic search doesn't feel like a machine learning task. After reading some of the responses to my post however I'm trying to come up with a meaningful reason why I wouldn't consider IDF to be machine learning, given that it is updated as more documents enter the corpus and your system "learns" to re-rank existing result sets based on these new documents.
Because we understand how these algorithms work, we can always reduce them to just this or that. Prediction in many linear classifiers is just dot products. relu neural networks are just lots of clamped dot products. Random projections on simple count data can generate word embeddings.
Whether or not something is merely model-fitting, super-scaling, compression or AI, AI art and machine learning will depend on the field it originated from. It's indisputable that the algorithms are so reducible but I tend to think that we should care more about functional capabilities when compared to an appropriate subset of a known intelligence's abilities than details of implementation.
I guess you can make the argument that even tf-idf as described in the article is a form of unsupervised machine learning because you obtain ("learn") the idf from the data.
If you want to apply machine learning to search then you need clickstream data, embeddings, or learned feature weights.
The difference between machine learning and "simple" feature extraction is mostly just in the choice of metaphors used to describe the computation, not in any fundamental properties.
The text search vector type is pretty much a poor man's bag of words model (with removing stop words and some lemmatization) but instead of counts, you get placement of where the words occur.
anything is ML now...