Getting image and video orientation correct, which should be trivial, is decidedly not. In writing PhotoStructure , I discovered
1) Firefox respects the Orientation tag using a standard  CSS tag, but Chrome doesn't 
2) That videos sometimes use "Rotation", not "Orientation" (and Rotation is encoded in degrees, not the crazy EXIF rotation enum). Oh, and some manufacturers use "CameraOrientation" instead, just for the LOLs 
3) That the embedded images in JPEGs, RAW, and videos sometimes are rotated correctly, and sometimes they are not, depending on the Make and Model of the camera that produced the file. If Orientation or Rotation say anything other than "not rotated," you can't trust what's in that bag of bits.
And after all that, editing (simply rotating) the photo in different applications has different results. Some change only the Orientation tag, others change the actual data, some seem to change both so it's still incorrect when opened in other viewers; and then there's the embedded thumbnail (but that is rarely used), so the result is a mess.
I'm interested in your PhotoStructure application, just subscribed to the beta!
When you rotate a photo or video in PhotoStructure, you can have it persist rotation/orientation by updating the file directly (PhotoStructure uses exiftool under the hood), but it's not the default out of concern for unknown bugs that may invalidate the original file in some way.
By default it just writes the new orientation to an XMP or MIE sidecar. The downside of this approach is that most applications don't respect sidecars.
This has caused me quite a bit of trouble with my image gallery as well, I've taken to just using exiftool to remove all rotation from my photos, which breaks half of them, and then manually hard-rotating them using ImageMagick, which is technically not a lossless operation if I'm understating it correctly.
I really don't understand why it was decided that most photo viewing applications would honor EXIF rotation, but web browsers would not.
This would (maybe) help with a common thing my wife does when taking videos: Start recording in portrait mode, then realize you did that, and rotate the phone 90 degrees to get widescreen video (but without restarting the recording).
When you play it back on a phone (with auto-orientation mode on), starting from holding the phone in portrait mode (as you normally do):
* it starts playing back as portrait, which looks fine
* the video rotates (because the camera was physically rotated), so now you're watching a widescreen video that's 90 degrees off
* Your natural reaction is to flip the phone 90 degrees to make down "down" again, but this changes the phone into widescreen mode, and because it thinks it's playing a portrait-style video, it changes to portrait-in-widescreen mode, and now the video is again tilted 90 degrees but 1/3 the size with huge black bars on either side
If you play it back on a computer/TV, you get the same end result: a widescreen video that's rotated 90 degrees, and 1/3 the size with huge black bars on either side.
Can't help you with the video-taking technique but when playing back the videos, if you hold the phone so that the video matches the screen, then rotate it so the screen is upwards, you can then spin it so it looks the right way up as long as you keep the screen vertical enough.
One of my pet peeves in ML/stats/data science is people who hardly look at their data. Unless there are privacy reasons not to, then you really need to look at some data. You'll learn so much more from looking at a few hundred samples than you will from different metrics. You'll get a feel for how complex the problem is, or whether something simple will do. Check your assumptions. You might even realize that your images are sideways.
The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will typically also pay attention to my own process for classifying the data, which hints at the kinds of architectures we’ll eventually explore. As an example - are very local features enough or do we need global context? How much variation is there and what form does it take? What variation is spurious and could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much does detail matter and how far could we afford to downsample the images? How noisy are the labels?
It's not directly related to this topic, but my favorite example of this sentiment is Anscombe's Quartet , four sets of datapoints that have (almost) the same statistical values, but a very obviously different layout when simply viewed together on a graph.
Plus an animated version that includes a T-Rex .
That’s what my first numeric mentor taught me. You have to look at raw data. The first q he’d ask any programmer was “did you look at the data, the actual data?”. He was a PhD in Physics and his approach really sticked with me.
But it’s not always straightforward to “look at data”
Just looking at data and descriptive statistics is one of the first things a person is taught in machine learning, data science and statistics coursework. It’s a major skill in the field that is emphasized all the time.
Practitioners frequently do cursory data analysis and data exploration to gain insight into the data, corner cases and which modeling approaches are plausible.
Just to give some examples, Bayesian Data Analysis (Gelman et al), Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill), Doing Bayesian Data Analysis (Kruschke), Deep Learning (Goodfellow, Bengio, Courville), Pattern Recognition and Machine Learning (Bishop) and the excellent practical blog post  by Karpathy all list graphical data checking, graphical goodness of fit investigation, descriptive statistics and basic data exploration as critical parts of the model building workflow.
If you are seeing people produce models without this, it’s likely because companies try to have engineers do this work, or hire from bootcamps or other sources that don’t produce professional statisticians with grounding in proper professional approach to these problems.
When people mistakenly think models are commodities you can copy paste from some tutorials, and don’t require real professional specialization, then yes, you get this kind of over-engineered outcome with tons of “modeling” that’s disconnected from the actual data or stakeholder problem at hand.
I've tried working like that and it makes a massive difference. To be able to visualize all the intermediate stages is valuable because they are basically never right until you are able to isolate and refine them individually.
This is a byproduct of hiring bootcamp grads or tasking a modeling project to engineers who read some tutorials. People think they can scan a few Jupyter notebooks and then professionally solve statistics problems.
People wonder why it’s hard and expensive to hire ML engineers... because they actually solve these problems with craft. Meaning, they systematically grow understanding of the data, start with simple models, and have well articulated reasons explaining cases when complexity is justified.
Yes, thank you. This can't be stated enough or loudly enough! You absolutely have to manually check, then check it again -- even with privacy issues, find a way to be non-identifiable and check it. Data is such a huge factor in whether your project will fail or not, yet too many people don't give it the respect (and dare I say love) it deserves.
I made an image upload widget that provided a preview, and when users selected the "take a picture" option on their phones, I showed the preview with a blob link and CSS background-image property. The images were showing sideways on some phones.
I looked at the EXIF data of those photos and of course Orientation: 90 showed up.
It was easy to fix on the backend when processing the images but I struggled to do it in a performant way on the front-end. One solution involved reading the EXIF data with JS and rotating with canvas and .toBlob(), but proved too slow for large 10MB photos as it blocks the main UI thread.
One thing I thought of is just reading the orientation and the using CSS transforms to rotate the preview, but I never got around to trying it.
This shows up even in basic websites! My partner, who is an artist, ended up with some of her portfolio broken seemingly at random in different browsers---and for some of it, "up" wasn't visually obvious and she'd rotated some of the images in apps in addition (preview, adobe things, etc.), so there wasn't a good way to just change everything. ended up having to do a ton of work to strip exif data on an image by image basis.
Basically entirely because of this ridiculous issue where chrome refuses to respect exif tags.
Wait, I know its a big company, but you're at the same company...
Your second approach is almost exactly what I've done in an iPhone video editing app I've been working on to deal with video previews.
Rather than reencode the video just to preview it, I can just apply the same transforms (scale, rotation, translation) to the view that's displaying the preview. I then mask that view with another view of the same size so it doesn't go outside the edges.
Of course I still need to encode the video with those transforms if I want it to show up in their camera roll later.
If you don't need to actually modify the image data you shouldn't be doing it at all. The correct solution is to specify a rotational transform so the hardware will do the heavy lifting on the GPU where this kind of computation belongs.
The faster a CPU finishes, the faster the CPU can sleep, which is how you save battery life. The more cores you use the lower the CPU frequency can be, which saves power, since frequency increases do not use power linearly.
> The faster a CPU finishes, the faster the CPU can sleep, which is how you save battery life.
> The more cores you use the lower the CPU frequency can be, which saves power, since frequency increases do not use power linearly.
More cores in use has little to do with frequency and more to do with heat. More heat means more thermal throttling which lowers frequency. Lower frequency means that the CPU doesn't sleep sooner.
Yup. That's exactly why I don't want them. Why should I execute something which doesn't, and shouldn't, have anything to do with rendering page content?
Using all your cores for the same workload would mean it finishes faster or finishes in the same time with significantly lower frequency. It saves power and heat. Your example would mean using more cores for the same amount of time, which makes no sense in this comparison.
Do you also buy single core computers to save power?
Have run into this myself when building the same "preview upload" feature. It's annoying that it's not something that is just handled by the browser, it feels like it should be a supported feature of the "img" tag or something.
> One solution involved reading the EXIF data with JS and rotating with canvas and .toBlob()
At my previous job I did the same thing, although I never noticed a significant slowdown. I also made the file size smaller since we wanted to have predictable upload times and mitigate excessive usage of storage space.
The other reason was that the EXIF data was wierd on some devices and the back end library didn't rotate them correctly.
It might not be up to the app? If the phone has hardware-accelerated JPEG compression, then potentially the image will already be compressed in the 'wrong' orientation before the app gets its hands on it. So rotating the image data could involve re-compressing the image again, leading to quality loss. Or if you choose to get the raw sensor data instead, and rotate the data before doing the compression yourself, you might lose out on the hw acceleration entirely.
(I've not developed any camera apps, so this is just a guess!)
"... It can also perform some rearrangements of the image data, for example turning an image from landscape to portrait format by rotation.
jpegtran works by rearranging the compressed data (DCT coefficients), without ever fully decoding the image. Therefore, its transformations are lossless: there is no image degradation at all, which would not be true if you used djpeg followed by cjpeg to accomplish the same conversion. ..."
Because the data sources are unrelated. The camera sensor is hooked up to the ISP which is hooked up to a hardware JPEG encoder. This is necessary in order to get those hyper fast shots off.
You'll notice that an orientation sensor is no where in that list. So what happens is the camera hardware spits out a JPEG. The app then combines it with the orientation sensor & produces the EXIF headers. It could choose to decode, rotate, re-encode, but that's slow (~100ms) and hurts shot-to-shot latency. And it loses quality. And, hey, since everything supports EXIF orientation anyway, why bother?
Or it could simply rotate without decoding or re-encoding, which has the added advantage of being lossless.
Obviously it's still added processing time and (probably more importantly) development time, so it's generally not worth bothering, however it's important to point out that JPEG rotation can (in the case of 90 degree increments) be done losslessly.
The phone is accurately recording the image, as well as the orientation. The bug happens when the EXIF information is stripped. Sure, phone apps could add an option to physically rotate the image, as a bug workaround, but it's not surprising that they don't do this.
It is surprising, because it seems like such an obvious fix to the problem of stripped EXIF data. Couldn't be that hard to implement a user setting which tells the app to rotate the file itself before saving.
This is a major oversight for the companies who develop camera apps. The major ones even have whole teams dedicated to that single app.
I'm not sure how good your algorithm is if it can only work in a specific orientation. A slightly tilted image can already cause problems for such an algorithm and many people have a hard time getting a picture with only a few degrees of rotation.
Depends. Based on the linked article, the picture can be completely upside down if it was taken in landscape mode. The article I linked is capable of rotating it upright regardless. After rotating to the right orientation, left and right are suddenly perfectly workable.
Of course using exif data for such rotations is easier, but a tilted picture of a tilted sign can create a lot of tilt that human vision copes with fine but an orientation dependent network cannot.
If the algoritm fails with a rotated image, then I would claim its a bad algorithm that is over-fit and not at all generalising what it has learnt. What about slight rotation? Where does it start failing, and would a human?
Also, making an algorithm for detecting rotated images should be easy if it affect the results so much.
One of the common image processing libraries does take into account EXIF rotation: OpenCV  I would tend to use that over manually rotating using the code from the article. Although beware that OpenCV cannot open quite as many formats as Pillow, most notably it cannot open GIFs due to patent issues. You can get a prebuilt wheel of OpenCV by pip installing the package opencv-python 
There's a big photography site I use that treats the EXIF inconsistently. I rotate the image in my editor to work on it, then save and upload. In some contexts it looks OK, but in other contexts the site rotates the image again and it's wrong. I don't want to strip the EXIF because it has interesting information such as the camera model and lens, and the exposure settings. My editor doesn't correct the EXIF rotation setting, so I have no choice but to use a utility to strip that single value from the file before I upload it.
This is functionally a coordinate system problem. Thankfully, it's pretty easy here. Just wait until we start getting more models for 3D data. I worked with 3D data for the longest time, and it was incredibly painful — different libraries can use wildly different coordinate systems (e.g. I've seen the up-direction be +z, -z, +y, and -y). At that point, it's nontrivially difficult to even figure out what the right way to convert between coordinate systems is.
The last time I measured ImageNet JPEGs with EXIF orientation metadata, the number of affected images was actually quite small (< 100, out of a dataset of 1.28M).
There are also some duplicates, but altogether it seems fairly "clean."
Data augmentation can have unwanted consequences. For example, horizontal or vertical flipping, what could be more harmless? You can still recognize stuff when it's upside-down, can't you? It's a great data augmentation... Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.
> Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.
If your dataset consists of nothing but isolated 'd's and 'p's in unknown orientation, you won't be able to classify them correctly because that is an impossible task. But it would be more common for your dataset to consist of text rather than isolated letters, and in the context of surrounding text it's easy to determine the correct orientation, and therefore to discriminate 'd' from 'p'.
So it's not a problem, except when it is. Good to know.
Incidentally, how does that work for mirroring, when all that surrounding text gets mirrored too? (Consider the real example of the lolcats generated by Nvidia's StyleGAN, where the text captions are completely wrong, and will always be wrong, because it looks like Cyrillic - due to the horizontal dataflipping StyleGAN has enabled by default...)
I'd say you're not making the problem "harder" - rather, requiring that object detection be orietation-agnostic makes the problem exactly as hard as the problem actually is. Allowing the network to train only on images with a known, fixed, (correct) orientation makes the object detection too _easy_, so the network results will likely fail if you feed it any real-world data.
i.e. you should be training your image application with skew/stretch/shrink/rotation/color-pallete-shift/contrast-adjust/noise-addition/etc. applied to all training images if you want it to be useful for anything other than getting a high top-N score on the validation set.
Yeah... this article really looks like blame-shifting to me. I'm imagining a future wherein we have bipedal murderbots, but we're still training AI with "properly oriented" images... a bot trips over a rock, and starts blasting away at mid-identified objects.
Well, even humans find it more difficult to process upside-down faces or objects. There's power in assumptions that are correct 99% of the time.
Regardless, the article isn't really shifting blame, in so much as explaining what's happening in the real world, with the real tools. The tools don't care about EXIF. Consumer software uses EXIF to abstract over reality. A lot of people playing with ML don't know about either.
you may not know what face it is, but you do know that it is a face right? it's like saying you know someone is in front of you if he were standing in up, in your clear view, but the moment he's lying on a couch in centre of your vision you can't tell with certainty if the thing on the couch is a human being?
I could tell, but I'm under impression that my 4 months old kid still has problems with that, or at least had when she was 2 months old.
I think this is closer to the performance we should expect of current neural models - a few months old child, not an adult. NNs may be good at doing stuff similar to what some of our sight apparatus does, but they're missing additional layers of "higher-level" processing humans employ. That, and a whole lot of training data.
When I built a consumer site that processed tons of photos I ran into this all the time. Ended up doing it all myself by parsing the exif data and doing the rotation. Also ended up writing some pretty extensive resize code that worked much better than what was built-in and more like the great scaling effects you see in Preview.app.
>Most Python libraries for working with image data like numpy, scipy, TensorFlow, Keras, etc, think of themselves as scientific tools for serious people who work with generic arrays of data. They don’t concern themselves with consumer-level problems like automatic image rotation — even though basically every image in the world captured with a modern camera needs it.
What a snotty attitude. The tools are already complex enough to take on responsibilities of parsing the plethora of ways a JPEG can be "rotated". This thread is a testament to the non-triviality of the issue and I certainly don't want a matrix manipulation or machine learning library to bloat up and have opinions on how to load JPEGs just so someone careless out there can save a couple lines.
I ran into this issue while developing https://www.faceshapeapp.com where user uploads their photo and face detector is run on top of the picture. Shortly after launching about 10% of users complained having rotated images, after some debugging, I discovered that it was due to exif rotation.
Would it really be too much of a burden for phones to just save photos at the correct orientation now? I understand the hardware limitations that were present in 2007, but surely these can't still be a factor?
Is there any good reason to save a photo in an orientation that does not match the orientation that the device is being held in? Shouldn’t up be up? If that results in a NxM photo, save it NxM. If it results in a MxN photo save it that way!
The only edge case is a camera pointed straight up or straight down. Or a camera in space.
Correct orientation is relative to the photographer, not to the ground. Most of the times the photographer is in the usual, vertical orientation but sometimes you really want to take a shot at an angle, and you have to fight with your phone to do it correctly. I really dislike these kinds of "convenience" optimizations.
Would it really be too much of a burden for python image libraries to decode jpegs to the correct orientation? I understand lack of knowledge about exif in the 2000s, but surely these days that can't still be a problem?
Well, if you are using the input Image as JPEG.
The information is stored in a tag, you can remove it with any kind of language. There's also a C++ library for it: https://github.com/mayanklahiri/easyexif
I just could not stop laughing reading this. It does remind me, that my professor once showed us a slide of the coastline of africa, which none of us recognized until he rotated it to the correct orientation.
I should add that this very problem also exists with video. What makes it worse is that "smartphone" video apps started regularly using video orientation metadata years before desktop video apps even seemed to acknowledge that it was a thing.
Wikipedia says EXIF was released in 1995 (https://en.wikipedia.org/wiki/Exif). If you were shooting with a DSLR, say 6 Mega pixel with 8 bits per pixel, the raw output would be 18MB in size (https://en.wikipedia.org/wiki/Kodak_DCS_400_series). In order to rotate this raw 6Mp image you would need 36MB of RAM (input and output buffers of the same size, non-overlapping). Then, after the rotation you could perform JPEG, so that the rotation is lossless. Finally, you could store the JPEG image to disk.
36MB of RAM just for raw image buffers would have been quite expensive in 1995. Simply tagging some extra data onto the image to say which orientation it should be presented in takes almost no extra memory or processing within the camera, some big desktop PC could easily rotate the uncompressed JPEG to perform a "lossless" rotation after the fact (ie: uncompress JPEG in wrong orientation, rotate, present to user).
Technically, you wouldn't need a full 18MB for the output buffer so long as you perform the JPEG in-line with the rotation and are willing to deal with slicing the image into swaths. So in theory you could get away with like a 1MB output buffer but then your rotation time would depend on your JPEG timing and you couldn't take another picture with the main raw buffer until rotation and JPEG were both complete. It's a tradeoff, time versus memory.
Surely someone can create a neural network that reorients the image, even if the Exif orientation is wrong (JPEG lets you can do this without re-encoding). Sounds like it should be a very simple problem for the vast majority of "regular" images.
Or how about just use the damn exif information to orient it correctly as the article outlines? The actual article is far less interesting than what the title seems to imply. While the title suggests a problem with computer vision. This is more of a programmer logic error.
EXIF info is often missing or wrong. Photos that have been passed around, going through various services, apps, scanners, screenshots, etc. along the way, frequently have their EXIF info stripped (e.g., for privacy when exported or uploaded), written incorrectly originally (e.g., you scan an old B&W print, put it in sideways to fit in the scanner, and the scanner invisibly includes EXIF data assuming it was correctly oriented), or rewritten incorrectly (e.g., stripped EXIF is replaced with default orientation).
I hold the - apparently controversial - view that if your ML algorithm cannot detect a clean picture of a goose when you rotate it by 90°, your ML algorithm is garbage. How is this of any use when it's so easily fooled?
it's kind of funny, but makes sense. if you think about it the gold standard for computer vision is humans but people are pretty bad at reading upside down text (if you're on mobile you can try it but don't forget to lock orientation first!). the same is true of upside down images as the famous "Thatcher effect" illusion shows: http://thatchereffect.com/ shows it pretty clearly.
I think it contributed the warning that the title is clickbait. Its very clear based on the discussion in the thread that people who've just read the title are discussion computer vision and machine learning strategies to solve this problem when the article describes an almost elementary programming issue.
Nonsense. Image recognition engines are very capable of detection in even at the most extreme angles. Sure it won’t rotate the image for you but it will certainly tell you there is a goose in it. What tricks image recognition technology more then anything is lighting.
At the core, there is something that is converting this jpeg or misc encoded data into raw encoded data, and this process MUST account for the orientation.
Either the app is reading the image, converting it before passing it to the CV/ML/AI library, and this conversion step needs to respect this tag, and either transfer the tag or apply it to the transformed object; OR, the CV/ML/AI library is getting in encoded image data, and it needs to check for this tag.
Those are the two options, either the CV/ML/AI library sees the tag, and should consider it, or it doesn't, and the library that is stripping it away shouldn't be doing that