ID Document Segmentation @Dyos
Between 2019 and 2021, I held the position of Lead Machine Learning Scientist at Dyos Technology GmbH, later known as AICOR Verwaltungs GmbH. My role involved leading a multidisciplinary team of scientists, machine learning engineers, and developers in creating Eli-Ident, an automated KYC product. Our objective was to create a product that enables “strong customer authentication” through video for financial institutions like payment providers or banks, as specified by BaFin. This product was developed exclusively using proprietary algorithms and ML-based models and evolved into qundo.de.
Goal
Build an ID document detection, classification and segmentation algorithm.
Context
In an automated KYC process, users present ID documents like ID cards, passports, or driving licenses. The system then uses OCR technology to extract crucial data, such as personal information from these images. The process starts with the ID document segmentation, where each pixel in an image is classified as either part of the document or not.
The effectiveness of the OCR algorithm greatly depends on the accuracy of this segmentation. While extracting text from well-scanned documents with clear, straight lines and high contrast is straightforward, it becomes more complex with colored texts on intricate backgrounds or documents with curved lines. Precise segmentation is critical for correctly aligning the document borders, essential for accurate text recognition.
Additionally, the system enhances user experience by automatically detecting when a document is presented to the camera of a device, such as a mobile phone, and identifying the type of document (ID card, passport, or driving license). These tasks, while simpler, are integral to the seamless operation of the KYC process.
Challenges
I was tasked with creating an ID document segmentation algorithm capable of operating effectively in uncontrolled settings. This initiative faces several substantial challenges:
- Background complexity: Documents placed over complex or noisy backgrounds can make segmentation difficult, as the model struggles to differentiate between the document and the background patterns
- Document diversity: we have various document types with different shapes and content. The model must generalize well across different types of documents and not just the ones it was trained on
- Document conditions: Worn-out, creased, or partially covered documents present additional challenges for segmentation algorithms, affecting their ability to accurately outline the document.
- Environmental conditions:
- lighting variations: changes in lighting can affect the visibility of document edges and features, complicating the segmentation task.
- camera quality: different camera qualities and settings (like focus and exposure) can significantly affect the image quality and thus the model’s performance.
Given the absence of available datasets that encapsulate these complexities, we were compelled to develop our own comprehensive dataset that accurately reflects these conditions. On top, we had to handle potential biases introduced by the data collection procedure.
Tasks and contribution
- Enhanced data curation:
- I created a unique image dataset by organizing the data collection process (definition and physical collection)
- I organized the annotation process (extensive labeling requirement definition and quality control) by setting up an external team
- Document segmentation model training, validation and selection
- Document classification model
- Document detection model
- Managed the CICD and model life-cycle
The final document segmentation model obtained over 99% binary accuracy on a quality-controlled test set.