hit counter

I'm currently working on a DPhil in HCT at the University of Sussex. This section of the website is for an on-going 'learning diary', for me to write my thoughts and notes on various courses and my thesis.


Image manipulation

About 99% of images we see in magazines wil have been digitally manipulated, and there is a range of software available for doing that.

Digital images are made up of pixels, so complex processing can be applied by manipulating the values of those pixels, either individually or by applying a formula to all of them.

Statistical operations

Starting from the intensity histogram of an image gives us the distribution of intensity values across an image. We can then create a bi-level threshold by picking a place on the diagram and making every pixel with an intensity value below that a 0 (black) and give everything else an intensity value of 1 (white). The thresholds are normally chosen according to minima in the histogram. You can also tell from the intensity histogram what kind of image this is. A dark image will have peaks in the low end of the chart, while a light one will have peaks at the high end. A low contrast picture will have all of its intensity values in a small range, while a high contrast picture will have a more spread out histogram. To increase the contrast you can stretch the intensity histogram out (known as linear contrast stretching), or lighten a dark image by moving the peaks.

Gamma correction

When an intensity is passed to an output device, there is a non-linear relationship between the intensity value and the output value. So if the intensity increases by 10 times, the output intensity may increase by more than that... Say 100 times for example. So a nice linear grey scale may come out looking not very linear. Gamma correction is applied to correct this.

Intensity = Voltage of the device gamma where gamma is a value for the particular device. It will vary between devices, but is generally between 2.3 and 2.4. 

Pixel group processing 

Uses the  values of the neighbouring pixels to determine the final value of a given pixel. Done by applying a 'convolution matrix' to the values. Doing this can give smoothing, sharpening, edge detection and noise removal, depending on the values used in the matrix. So, you take a 3x3 block of pixels and calculate the value of the one in the middle by applying a 3x3 set of weights to that block.

You can average out the values, by applying a constant 1/9 weight across the matrix. That gives you a smoothing effect. It's a simple thing to implement, but gives some pretty crude results. An alternative is to apply a weighting based on a Gaussian distribution, and this is known as Gaussian smoothing or blur. There's a good outline of the procedure here. (I'm pretty sure I did all about Gaussian distributions in my engineering degree - there are faint bells going off, but nothing particuarly useful!)



Image formats

Compuserve developed the Graphics Interchange Format (GIF). There are 2 versions; the 89A version is more useful, allowing for animated GIFs and transparent backgrounds. GIFs work on a restricted palette, so are best for a restricted colour range. They are great for line art or icons. However, be wary of using animated GIFs. They can look really naff and add nothing.

JPEG was formed by a working group from ISO and CCITT (which has apparently been renamed) called the Joint Photographic Experts Group. This is teh dominant format for true-colour images. They can achieve 15:1 to 30:1 compression rates, but compression and unpacking is relatively slow. There are many incompatible coding schemes, but standardisation has occurred.

The stages of JPEG compression are as follows:

  1.  Conversion from RGB to luminance/chrominance colour space (because the human eye is less sensitive to chrominance change than brightness).
  2. Colour information is then sub-sampled (lossy). Codes in 2x2 cells are stored as 4 intensity values and 2 colour difference values. Requires 6 values per group instead of 12.
  3. The image is then split into blocks of 8x8 and a Discrete Cosine Transform (DCT) is applied. Intensity data is transformed to frequency data. This process is reversible, so this stage is lossless.
  4. The frequency information is then quantized, which is lossy. Particularly applied to high frequency chrominance information, as the eye can't detect this. May remove some values and increase the occurance of others. E.g. the values 78, 79, 80, 81, 82 could all be replaced by 80.
  5. The data is then subjected to Huffman coding (as mentioned in the compression entry!).

It is possible to equate the quality setting to the compression rate. The default is a 75% quality setting, at a 12:1 compression rating. As compression rating gets higher, the 8x8 blocks become increasingly noticeable in the final image.

There is now a (relatively) new bitmapped graphics format called Portable Network Graphics (PNG). This is similar to the GIF format, and has been approved by the WWW consortium as a standard. GIF is subject to patent rules because the LZW compression that it uses is under patent. PNG uses lossless compression and handles full colour images, giving better quality than GIF. It doesn't support animation (but that may not be a big problem!). Apparently there may be a new version that does support animation.




This follows on from the colour model discussion in the bitmap image entry. If we are going to store high quality, memory intensive images, we may need some way to compress the data so that it takes up less space.

There are two ways to do this: lossy and lossless. Lossy throws away information. Lossless (guess) doesn't. In the lecture it was demonstrated as two different ways to tidy a room. You could either tidy it by restacking everything so that it takes up less space (but doesn't actually reduce the amount of stuff) or you can tidy by throwing out all the stuff you don't need any more (empty pizza boxes were the example used). Lossless compression restacks the information, but the original information is always still available. Lossy compression throws away the information we say we don't need any more.

Run Length Encoding (RLE) is a very basic form of lossless compression. It stores the data in a more appropriate form. So if (for example) you have a long string of 1s and 0s as your data like this: 11111111111111111100000111111111, it could be stored as  18,5,9. That takes it from 4 bytes (32 bits) to 3. This method of encoding rarely gives compression ratios of better than 2:1.

You can get more complex with lossless compression and use Huffman coding. That replaces common pixel values witha  short code, and less common with a longer code. It may result in some individual segments taking more memory to store, but these should be infrequently used and this can give from a 27% to maybe a 40% reduction in size. This does rely on an unequal distribution of pixels, so trying to compress an equally-distributed image may actually result in an increase in size. It is used as the final stage in the compression of jpegs.

Then we have Lempel-Ziv-Welsh (LZW) compression. This is probably the most widely used lossless compression - used in Unix compression and GIF. It works very well for text documents but much less well for noisy images. It is on a similar idea to Huffman coding, but instead of individual colours it replaces whole repeating sequences with a code that points to a data dictionary entry.

Lossy compression works on removing the information that we don't need. It works by taking advantage of the properties of human vision and hearing, and tries to remove data where it won't be noticed. In the case of images, that is normally colour information (the eye is not so good at recognising tiny changes in colour) and in sound it's the bits you can't really hear.  


Bitmapped Images

Digital pictures are made up of a series of pixels (or picture elements). The resolution of an image is the number of pixels in the x and y direction. With a digital camera, the quality of the lens is more important than the number of pixels available. A 2 megapixel camera with a really good lens can take better pictures than a 7 megapixel camera with a less good lens.

Bitmapped images are also known as raster images. Apparently raster is a grid square. A second definition for raster is the parallel lines that form the scan pattern on display screens or tvs. They are good at representing real-world images, where there is a comples variation in colours, shades and shapes. (As opposed to vector graphics.)

The size that the image will be displayed at affects the resolution that should be used to store it. So a small image or thumbnail can be stored at a really low resolution and still get across the data. If that same image was displayed at a larger size it would look all rubbish and pixelated. Thsi is the problem with lots of 'show large image' links - where clicking just increases the size to show a nasty grainy image and gives no extra information. Getting this right is a major concern in multimedia presentations, because you want to avoid things looking crap.

Resolution is often specified in terms of Dots Per Inch (DPI). Printers are normally very high resolution at 600-1200 dpi, scanners from 300-3600 dpi, and monitors are 70-204dpi. Interesting that the iPod Nano has the highest definition screen ever, because they need to squeeze more detail and information out of a very small display area. Smaller pixels allows finer detail, and provides a more readable (and vivid) display. One side effect of the difference in resolution between a monitor and a printer is that an image that fills the screen on a 72dpi monitor will look tiny on a 600dpi printer. To avoid this being an issue, most image formats allow you to specify the resolution in pixels per inch.

A 4:3 aspect ratio is normal for PAL TV format. That's the same for a lot of screen resolutions, and standard sizes like 320x240, 640x480, 800x600... 1280x1024 is an exception at 5:4, so circles drawn on a screen at that resolution will not be circles when displayed on a 4:3 screen. Digital TV and widescreen stuff has introduced a whole new range of aspect ratios.

One thing to remember is that a graphics package such as Photoshop is better at resizing than a browser, so it's better to resize photos to the right size and save them that way, rather than using the width and height tags to do it on the fly.

The simplest data model used in images is the true colour image data model. Each pixel has a 24 bit value, containing a RGB value (or the HSV, or YUV - basically three pieces of information).

Alternatively a palette colour model can be used. The palette colour has an 8 bit pixel index, which refers to a colour in a palette. The palette stores the full RGB (or HSV or YUV) value for that colour. Gifs use this system.

True colour vs. palette: True colour gives a very high quality image, but takes 3 times the memory to store. Palette limits the colour range used, but is cheaper storage-wise.  



Seminar 4: Phidgets begin

We had our introductory session on the phidgets today. The bits and pieces look quite fun, and actually programming them looks pretty easy. I don't think that's going to be the hard part. I think the tricky bit is going to be finding the right problem and solving it elegantly and in a universal manner.

We got into groups, and I'm working with Lizzie and Charlie again after Yves has been told he can't continue taking the course. We had a brief brain-storm session after the seminar, and tried a couple of different ways to come up with a problem to solve. First we limited our problem space to the home. Then we started thinking about problems that people with set disabilities may have around the home.

That wasn't actually terribly fruitful. We found that a lot of the problems we were identifying that way were quite specific to the disability, and people without it wouldn't need the solution. It just didn't really seem very universal. So we switched it around. We started trying to think of regular, every day kinds of problems, and thinking about how we could solve them for different types of user. And that did it. We hit upon the idea of things you need that are constantly getting lost. Like keys, wallet, phone, glasses etc. (in my case you could add work pass and train ticket to the list). There are solutions available already where you can get your keys to emit a noise (no good for deaf users).

We thought rather than get the item to emit a noise or light or anything we could use RFID tags to attach to the item, and have a handheld device that you could carry around that would alert you when you are near a tagged object. I think the idea is sound, but some of the design features are going to need some thinking about to get right. I had a quick blast of some questions that crossed my mind on the way into Brighton later:

  1. How do we alert someone? A screen change (or light) and/or noise.
  2. How do we tell which object we're close to?
  3. Or how close we are? We could either use different colours to indicate closeness, or different intensities of colour, or a flashing pattern which speeds up as we get closer. The noise could change pitch, volume or rhythm. How we monitor different objects will affect which solution we choose I think. So if we use a different note for each tag that's close, we probably don't want to change the pitch as we get close to something because the chord produced might be horrible. Multiple different rhythyms of either flashes or noises might get very difficult too, but how easy would different volumes be to detect? Do the RFIDs even support this kind of use, or is it a binary "signal on/off" type of thing?
  4. What if we're close to 2 or more tagged objects?
  5. Does the output need to accurately reflect the object (e.g. a noise like a phone ringing and a picture of a phone when you're close to the phone) or can we keep it simple and just have the tag related to a note and screen section, and let the user learn which is which? Do we actually need to differentiate at all, or can we just tell the user they are close to an object, and let them find out which one and how many?
  6. How do we deal with objects that have been picked up already, and are presumably close to the hand-held all the time?
  7. If we are representing multiple different objects, do we allow the user to select one to focus on? If so, how do we manage that? On the screen it would be easy to select something, but can we be clever with the noise too? (E.g. if we're using a different note for each item, can we let the users slide something up and down to select between high and low pitches? How do we go between searching for a single item and many? Is that over complicating things?)
So yeah. We've got a few things to think about. How we demo it will also be interesting. But I think we'll end up having to carry a laptop around to demo. I'm going to look up some stuff on torches for the blind that people were talking about a couple of seminars ago, see how they represent distance. Give it a go, anyway.