You are here: Start » Programming Tips » Optimizing Image Analysis for Speed
Optimizing Image Analysis for Speed
Rule #1: Do not compute what you do not need.
- Use image resolution well fitted to the task. The higher the resolution, the slower the processing.
- Use the inRoi input of image processing to compute only the pixels that are needed in further processing steps.
- If several image processing operations occur in sequence in a confined region then it might be better to use CropImage at first.
- Do not overuse images of types other than UInt8 (8-bit).
- Do not use multi-channel images when there is no color information being processed.
- If some computations can be done only once, move them before the main program loop, or even to a separate program. Below is an example of a typical structure of the "Main" macrofilter that implements this advice. There are two macrofilters: the first one is responsible for once-only computations, and the second is a Task implementing the main program loop:
Rule #2: Prefer simple solutions.
- Do not use Template Matching if more simple techniques as Blob Analysis or 1D Edge Detection would suffice.
- Prefer pixel-precise image analysis techniques (Region Analysis) and the Nearest Neighbour (instead of Bilinear) image interpolation.
- Consider extracting higher level information early in the program pipeline – for example it is much faster to process Regions than Images.
Rule #3: Mind the influence of the user interface.
- Note that in the development environment displaying data on the preview windows takes much time. Choose Program » Previews Update Mode » Disable Visualization to get performance closer to the one you can expect in the runtime environment.
- In the runtime environment use the VideoBox control for image display. It is highly optimized and can display hundreds of images per second.
- Using the VideoBox controls, prefer the setting of SizeMode: Normal, especially if the image to be displayed is large. Also consider using DownsampleImage or ResizeImage.
- Prefer the Update Data Previews Once an Iteration option.
- Mind the Diagnostic Mode. Turn it off whenever you need to test speed.
- Pay attention to the information provided by the Statistics window. Before optimizing the program, make sure that you know what really needs optimizing.
Rule #4: Mind the influence of the data flow model.
Data flow programming allows for creating high speed machine vision applications nearly as well as the standard C++ programming. This, however, requires meeting an assumption that we are using high-level tools and image analysis is the main part. On the other hand, for low level programming tasks – like using many simple filters to process high numbers of pixels, points or small blobs – all interpreted languages will perform significantly slower than C++.
- For performance-critical low-level programming tasks consider User Filters.
- Prefer formula blocks over arithmetic filters like AddIntegers or DivideReals.
- Use a lower number of higher level filters (e.g. RotatePath) instead of a big number of low level filters or formulas (e.g. calculating coordinates of all individual points of the path).
- Avoid using low-level filters (such as MergeDefault or ChooseByPredicate) with non-primitive types such as Image or Region. Filters perform full copying of at least one of the input objects. Prefer using Variant Step Macrofilters instead.
- Mind the connections with conversions (the arrow head with a dot) – there are additional computations, which is some cases (e.g. RegionToImage) might take some time. If the same conversion is used many times, then it might be better to use the converting filter directly.
- The sequence of filters with array connections may produce a lot of data on the outputs. If only the final result is
important, then consider extracting a macrofilter that will be executed in array mode as a whole and inside of it all
the connections will be basic. For example:
Common Optimization Tips
Apart from the above general rules, there are also some common optimization tips related to specific filters and techniques. Here is a check-list:
- Template Matching: Do not mark the entire object as the template region, but only mark a small part having a unique shape.
- Template Matching: Prefer high pyramid levels, i.e. leave the inMaxPyramidLevel set to Auto, or to a high value like between 4 and 6.
- Template Matching: Prefer inEdgePolarityMode set not to Ignore and inEdgeNoiseLevel set to Low.
- Template Matching: Use as high values of the inMinScore input as possible.
- Template Matching: If you process high-resolution images, consider setting the inMinPyramidLevel to 1 or even 2.
- Template Matching: When creating template matching models, try to limit the range of angles with the inMinAngle and inMaxAngle inputs.
- Template Matching: Do not expect high speed when allowing rotations and scaling at the same time. Also model creation can take much time or even fail with an "out of memory" error.
- Template Matching: Consider limiting inSearchRegion. It might be set manually, but sometimes it also helps to use Region Analysis techniques before Template Matching.
- Template Matching: Decrease inEdgeCompleteness to achieve higher speed at the cost of lower reliability. This might be useful when the pyramid cannot be made higher due to loss of information.
- Do not use these filters in the main program loop: CreateEdgeModel1, CreateGrayModel, TrainOcr_MLP, TrainOcr_SVM.
- If you always transform images in the same way, consider filters from the Image Spatial Transforms Maps category instead of the ones from Image Spatial Transforms.
- Do not use image local transforms with arbitrary shaped kernels: DilateImage_AnyKernel, ErodeImage_AnyKernel, SmoothImage_Mean_AnyKernel. Consider the alternatives without the "_AnyKernel" suffix.
- SmoothImage_Median can be particularly slow. Use Gaussian or Mean smoothing instead, if possible.
Application Warm-Up (Advanced)
An important practical issue in industrial applications with triggered cameras is that the first iteration of a program must often already be executed at the full speed. There are however additional computations performed in the first iterations that have to be taken into account:
- Memory buffers (especially images) for output data are allocated.
- Memory buffers get loaded to the cache memory.
- External DLL libraries get delay-loaded by the operating system.
- The modern CPU mechanics, like branch prediction, get trained.
- Connections with external devices (e.g. cameras) get established.
- Some filters, especially ones from 1D Edge Detection and Shape Fitting, precompute some data.
These are things that result from both the simplified data-flow programming model, as well as from the modern architectures of computers and operating systems. Some, but not all, of them can be solved with the use of FabImage Library (see: When to use FabImage Library?). There is however, an idiom that might be useful also with FabImage Studio – it is called "Application Warm-Up" and consists in performing one or a couple of iterations on test images (recorded) before the application switches to the operational stage. This can be achieved with the following "GrabImage" variant macrofilter:
The "GrabImage" variant macrofilter shown above is an example of how application warm-up can be achieved. It starts its operation in the "WarmUp" variant, where it initializes the camera and produces a test image loaded from a file (which has exactly the same resolution and format as the images acquired from the camera). Then it switches to the "Work" variant, where the standard image acquisition filter is used. There also an additional output outIsWarmingUp that can be used for example to suppress the output signals in the warming-up stage.
Configuring Parallel Computing
The filters of FabImage Studio internally use multiple threads to utilize the full power of multi-core processors. By default they use as many threads as there are physical processors. This is the best setting for majority of applications, but in some cases another number of threads might result in faster execution. If you need maximum performance, it is advisable to experiment with the ControlParallelComputing filter with both higher and lower number of threads. In particular:
- If the number of threads is higher than the number of physical processors, then it is possible to utilize the Hyper-Threading technology.
- If the number of threads is lower than the number of physical processors (e.g. 3 threads on a quad-core machine), then the system has at least one core available for background threads (like image acquisition, GUI or computations performed by other processes), which may improve its responsiveness.
Configuring Image Memory Pools
Among significant factors affecting filter performance is memory allocation. Most of the filters available in FabImage Studio re-use their memory buffers between consecutive iterations which is highly beneficial for their performance. Some filters, however, still allocate temporary image buffers, because doing otherwise would make them less convenient in use. To overcome this limitation, there is the filter ControlImageMemoryPools which can turn on a custom memory allocator for temporary images.
There is also a way to pre-allocate image memory before first iteration of the program starts. For this purpose use the InspectImageMemoryPools filter at the end of the program, and – after a the program is executed – copy its outPoolSizes value to the input of a ChargeImageMemoryPools filter executed at the beginning. In some cases this will improve performance of the first iteration.
Using GPGPU/OpenCL Computing
Some filters of FabImage Studio allow to move computations to an OpenCL capable device, like a graphics card, in order to speed up execution. After proper initialization, OpenCL processing is performed completely automatically by suitable filters without changing their use pattern. Refer to "Hardware Acceleration" section of the filter documentation to find which filters support OpenCL processing and what are their requirements. Be aware that the resulting performance after switching to an OpenCL device may vary and may not always be a significant improvement relative to CPU processing. Actual performance of the filters must always be verified on the target system by proper measurements.
To use OpenCL processing in FabImage Studio the following is required:
- a processing device installed in the target system supporting OpenCL C language in version 1.1 or greater,
- a proper and up-to-date device driver installed in the system,
- a proper OpenCL runtime software provided by its vendor.
OpenCL processing is supported for example in the following filters: RgbToHsi, HsiToRgb, ImageCorrelationImage, DilateImage_AnyKernel.
To enable OpenCL processing in filters an InitGPUProcessing filter must be executed at the beginning of a program. Please refer to that filter documentation for further information.
When to use FabImage Library?
FabImage Library is a separate product for the C++ programmers. The performance of the functions it provides is roughly the same as of the filters provided by FabImage Studio. There are, however, some important cases when the overall performance of the compiled code is better.
Case 1: High number of simple operations
There is an overhead of about 0.004 ms on each filter execution in Studio. That value may seem very little, but if we consider an application which analyzes 50 blobs in each iteration and executes 20 filters for each blob, then it may sums up to a total of 4 ms. This may already be not negligible. If this is only a small part of a bigger application, then User Filters might be the right solution. If, however, this is how the entire application works, then the library should be used instead.
Case 2: Memory re-use for big images
Each filter in FabImage Studio keeps its output data on the output ports. Consecutive filters do not re-use this memory, but instead create new data. This is very convenient for effective development of algorithms as the user can see all intermediate results. However, if the application performs complex processing of very big images (e.g. from 10 megapixel or line-scan cameras), then the issue of memory re-use might become critical. FabImage Library may then be useful, because only at the level of C++ programming the user can have the full control over the memory buffers.
FabImage Library also makes it possible to perform in-place data processing, i.e. modifying directly the input data instead of creating new objects. Many simple image processing operations can be performed in this way. Especially the Image Drawing functions and image transformations in small regions of interest may get a significant performance boost.
Case 3: Initialization before first iteration
Filters of FabImage Studio get initialized in the first iteration. This is for example when the image memory buffers are allocated, because before the first image is acquired, the filters do not know how much memory they will need. Sometimes, however, the application can be optimized for specific conditions and it is important that the first iteration is not any slower. On the level of C++ programming this can be achieved with preallocated memory buffers and with separated initialization of some filters (especially for 1D Edge Detection and Shape Fitting filters, as well as for image acquisition and I/O interfaces). See also: Application Warm-Up.
|Previous: Sorting, Classifying and Choosing Objects||Next: Understanding OrNil Filter Variants|