Vision in iOS 11 has everything you need to create an app that can recognize text characters with implementation happening simultaneously. You don’t need a technical coding knowledge – navigating the feature is quite simple. What’s more, the implementation is seamless.
The vision framework enables you to easily implement any task that involves computer details. The structure performs face and face landmark detection, barcode recognition, image registration, general feature tracking, and text detection. Vision also allows you to use custom Core ML models for tasks like classification or object detection.
The VN DetectTextRectanglesRequest is an image analysis request that finds regions of visible text in an image; the feature returns text characters as a rectangular bounding box with origin and size.
If you are used to using swift and have been programming for a while, then you are probably wondering what the use of Vision is when there are other features like image and AVFoundation. Well, Vision is more accurate and more straightforward. The feature is also available on a variety of platforms. However, using Vision may require more processing power and processing time.
To use Vision for text detection, you will require Xcode 9 and a device that runs iOS 11.
Creating A Camera with Avcapture
First, you need to create a camera with AVCapture; this is by initializing one object AVcapturesession to perform real-time or offline capture. After that, make the session to the device connection.
To save you time from building a UI of your app, consider having a starter project, to begin with, this will give you time to focus on learning the Vision framework.
- Open your starter project. The views in the storyboard should be all ready and set up for you.
- On the ViewController.swift, look for the code section with functions and outlets.
- Under the outlet-ImageView, declare a session for AVcapturesession – this is used whenever you want actions performed based on a live stream.
- Set the AVcapturesession and the AVmediatype to video since you will perform the camera shoot to enable it to run continuously
- Define the output and input device
- The input is what the camera will see, and output is the video at a set type format of KCVPixelFormatType_32GRA.
- Finally, add a sublayer that contains videos to imageView and start the session. The function is known as inViewdidload. You also need to set the frame of the layer.
Call the function in the viewWillAppear method.
As the bounds are not yet finalized, override the viewDidLayoutSubviews ( ) method to update the layers bound.
After the release of iOS 10, an additional entry in Info.plist is needed, this provides a reason for using the camera. You should also set Privacy-Camera Usage Description.
Text Detection; How Vision Framework Works
There are three steps to implementing Vision on the app.
- Handlers – this is when you want the framework to do something after the request is called.
- Observations – this is what you want to do with the data supplied by you beginning with one request
- Requests – this is when you ask for Detect framework
Ideally, you create one text request as VNdetecttextrectanglesrequest. This is a kind of VNrequest that borders around the text. After the framework completes the application, you proceed to call the Dettexthandler function. You will also want to know the exact frame that was recognized, set it to Reportcharacterboxes=True.
After that, define the observations that contain all the results of VNdetecttextrectanglesrequest, remember to add Vision to the output camera. Since Vision exposes high-level APIs, working with it is secure.
The function checks if the Cmsamplebuffer exists and PutOut Avcaptureoutput. You should then proceed to create one variable Requestoptions as 1 Dictionary Type VNimageoption. The VNmage option is a type of structure that contains properties and data from the camera. You should then create the VNimagerequesthandler and execute the text request.
Drawing Borders Around the Text Detected
You can start by having the framework to draw two boxes, one for every letter it detects and the other for every word. Tables are the combination of all the character boxes your request will find.
- Define the points on your view to help you position the boxes.
- After that, create a CALayer; use VNrectangleobservation to define your constraints, making the process of outlining the box easier.
You now have all your functions laid out.
To connect your dots, begin by having your code run asynchronously. You should then check to see if a region exists within your results from your VNTextObservation.
You can now call in your function, which will draw a box within the area. Check to see if there are character boxes within the region then call in the service that brings in a box around each letter.
After that, create a variable RequestOptions. You can now create a VNImageRequestHandler object and perform the text request you created.
Finally, the last step is running your vision code with the live stream. You will need to take the video output and convert it to Cmsamplebuffer.
- Always try to crop the image and process only the section that you need. This will reduce the processing time and memory footprint
- Turn on language correction when dealing with non-numeric characters then turn it off when dealing with a numeric character
- Include validation for recognized number strings to confirm the accuracy and eliminate showing false value to the user.
- The document camera controller is the best companion for text recognition since image quality plays a significant role in text recognition.
- Consider setting a minimum text height to increase performance.
With Vision, you have everything you need for text recognition. Since Vision is easy to use and takes a short time for implementation, using it is almost equivalent to playing with Legos. Try testing your app on different objects, fonts, lighting, and sizes. You can also impress yourself by combining Vision with Core ML.