First off, I want to make the disclaimer that this project is very much an MVP and the landing page/front end is quite ugly.
I'm sharing tingy - this is a service that allows you to upload a video and query it using a text description. For each upload, you can make two queries and they can be any text that you wish (in English).
Here's an example: You have 4 hours worth of security footage. During those 4 hours you know that someone stole a bike. You could query the video for "person riding bike".
I'm looking for some test users - please reach out here if you would like to trial tingy. I would be happy to set you up with a free account.
Very interesting. I do this with my security cameras right now but mainly to find out who left poop in front of my house or who left the trash can behind my garage door. The second use case is much easier than the first since a lot of dogs walk past my house, a lot of dogs poop in front of my house, but not a lot of dogs walk by that leave poop behind (or sometimes the poop is out of view).
its great in theory but real world creates harder challenges. the real answer is to keep track of all the dogs that pass by and then compare the before/after scenes for differences. i'm doing this with my garage door camera for detecting water (for a completely different problem)
I could see that there are some users who would want an ongoing subscription, such as a security office who looked through security videos on a regular basis to investigate thefts. But I have to think that in a lot of cases there will be people who just want to search through a video or two on a one-time basis - why not a per-video, non-subscription pricing option?
I cant speak to the economics on your side, but from a consumer point of view, the main issue I'm having is: how do I know this thing will even work? I don't want to shell out $10 and wait for processing time only to find out it doesn't even work
Maybe one way is to let users upload it for free, interactively search the first 10s for free, and then require payment to look through the rest of the video. If I felt confident that it was going to work, and I had a real reason to search through the video, I would probably pay between $1 and $5, depending on the length of the video. Much more than that and I'd rather just scrub through it myself. I guess the length of the video is roughly proportional to how much I'm willing to pay to not have to scrub through it myself.
The other thing is that it's much easier to scrub through and find an image, rather than voice. If I want to find a bike, I can just scrub through at 10x and look for bikes, a bike isn't going to just appear in a single frame and disappear in the blink of an eye. If it also searched voice, it would be worth even more.
Nice to see someone pursue this. I built something similar at a hackathon in NYC a few years back and won. I remember the tricky part was that in addition to uploading videos, I wanted people to just use a URL, like a YouTube video or something and process that. But the Vision API only worked with videos that were in storage on a bucket or something. Spent the whole night building a workaround.
Interesting idea. As others have suggested it would really help if you have an accompanying video to support the claims.
A few thoughts/questions here:
1. What markets and use-cases were you thinking of when building out this MVP? The applications could be broad enough, but it seems like you expect CLIP to handle bespoke query results and hope that they return a result that is relevant. Also what might be interesting to test if you search for something that doesn't exist in the video, can you handle that well-enough (assuming it's just a simple threshold you're picking to identify relevant search results)?
2. Licensing is something that has always piqued my curiosity when it comes to ML-based apps. Do you have a sense of the commercial-use for models such as CLIP, especially when the datasets that they were probably trained on were not permitted for commercial-use? This also applies to the raw video data uploaded by the user.
- production companies with large video archives (this would require more tooling)
I am unsure whether to focus on one of these groups or to go for a more generic tool. I'll add a video demo to the landing page. So far, for all the tests I've performed the ML model can generalize well enough to cover this range of uses.
Licensing: I need to research this further. I'm not sure how the licensing changes due to the fact that I've also fine-tuned the model on my own data.
Re: licensing, the world of startups is somewhat of a wild-west these days with folks offering pre-trained models as-a-service without really thinking about the licensing implications (both on the dataset and model front). Huggingface is a classic example, and they seem to suggest that it's perfectly OK to fine-tune and use commercially (https://github.com/huggingface/transformers/issues/3357#issu...), but I'm not certain that their lawyers would put it the same way.
Pre-trained CLIP gets you 95% of the way there, so you're correct, fine-tuning isn't necessary to test the market. The one downfall of pre-trained CLIP is that it hasn't been trained on still images from videos. These have a different noise characteristic and contain considerably more motion blur than your average image used for training.
This looks dope! Any chance you'd be willing to share a bit about the core tech underneath? Ie, assuming this is neural net based, which architecture/paper/repo did you use? Did you have to do any training / finetuning? ..etc.
I totally understand if you'd like to keep some/all of this secret, but I thought it's worth a shot :)
ML: fine-tuned CLIP model. Each video frame is embedded using CLIP and then the image embedding is compared against the text query embedding
Architecture: everything is serverless using AWS lambda. Basic flow is: video upload to storage, lambda for converting video to still frames, perform ML inference on each frame, aggregate inference results and create customer output.