Forgive my ignorance, but is there something like word2vec but for images- like an image2vec? In terms of text processing, word2vec is one of the best approaches. So can't you describe the features of images using neural networks and then vectorize them using word2vec?
Its been acouple years since my computer vision course (was my favorite course in university) but isn't SIFT a bit '99? Aren't there better methods now such as neural networks for feature description?
Recent research in deep manifold traversal may interest you . This method uses deep neural networks to approximate the manifold of natural images, learns transformations that traverse the manifold from one image class to another (i.e. from images of young people to images of old people), and then allows you to transform any source image into a new, automatically-generated image by mapping it to the manifold and performing the transformation. So, for example, the paper's authors learn the transformation from images of young people to old people, and then are able to generate realistic-looking images of the celebrities with older facial features (as well as darken hair colors, change skin tones, and even colorize black and white images). This is somewhat analogous to the way that word2vec vectors allow you to do things like queen = king + (man - woman), with an image's projection onto the approximated manifold being analogous to a word's word2vec vector.
I think in addition to histograms and features like SIFT, SURF, and others like DAISY that would permit searching with images as a query it would be beneficial just to use a neural network to text index using the classes at very top of a neural network, although you are right some features in layers just below that could be used too as I understand it.
You could then use classical text indexing on the text, perhaps with a topic model like LDA. Then an image with a plane in it will be indexed by "plane" via the output of the neural network but would also come first, or in the top results, when using "flight" as the query via a topic model.
Ditto for word2vec or para2vec over those words, the benefit being you can bring the knowledge of relations contained in the textual training data, Wikipedia or something else, to bear on the problem. I.e. a golf club and a baseball glove might not be correlated in the neural network that annotated the images but might be correlated in the text based knowledge model trained on Wikipedia and so a query of "sport" might bring both images up.
The big players like Google already have caption generation that's capturing relationships between objects.
I think depending on the need Neural Network feature detection/description might be an overkill.
What is great about stuff like SIFT and more modern ORB, BRISK and AKAZE is that they are fast and given appropriate implementation they would work just as well as Neural Network would. I haven't researched NN computer vision whole lot, but it seems like it might be slower in feature detection/description compared to traditional approach. If that's the case then for live/near-live video processing NN won't work that well.
I'm also pretty ignorant here, but if I understand word vectors correctly they are a trained model that results in the ability to predict the next word in a sequence. I can imagine extending that idea to images, i.e. in such a way that the color of a target pixel can predict the colors of surrounding pixels. I wonder if tools like Photoshop and Gimp don't already have similar algorithms for some advanced effects. In any case, it seems several layers of abstraction down the stack from the problem of attaching meaning to images. Isn't that more what the BOW approach tries to do?
There're ton more local features proposed since SIFT but SIFT, arguably, still among the best. Deep feature, on the other hand, is more like global feature. May be a combination of these two could give the best?
For large datasets simple "Bag of Words" approach actually is not that great since for given set of features you have to compare it to the whole vocabulary. More modern approach calls for use of Vocabulary Tree which represents your bag of words. This vocabulary tree significantly reduces amount of matching that has to be done for each individual feature.
In my [limited] experience with CBIR and image based search, I found that using a color space with perceptual spatial qualities (such as one of the CIE LaB variants) to be more effective than a purely normalized geometric color space (such as RGB or HSV), as color similarities in the latter may not make much sense to a human.
Does anyone have a solution to this where the input (search) is also a vector as opposed to a single color, but would still allow for exact color matches?
Bucketing means that you can't get the granularity of a specific shade of a color.
I have put quite a bit of similar effort to image retrieval using Elasticsearch before. While it is nice and convenient, what I found was that Elastic Search is too slow for larger scale (million of Images with dictionary of millions visual words in BoW model). May be there're some steps that I did not do right, but I gave up.
Really cool. But maybe its just me, aren't butterflies kind of hard to distinguish between each other. I feel as though his search result page- i couldn't really tell if it was good or not because they all looked kind of similar shape-wise. Only difference is color and even then its fairly little color.
The strategy does seem to assume a highly normalized set of images that only differ in color and contour.
I wonder if pHash wouldn't make this a lot more effective. Anyone tried building an ES-usable distance function for pHash?