Browse more internships

Machine Learning & Model Building (OCR Tool) Internship (Remote)

Applications are closed for this internship. Click here to browse more internships.

Actively hiring

Machine Learning & Model Building (OCR Tool)

IIT Bombay

Work from home

Start Date

Starts immediatelyImmediately

Duration

6 Months

Stipend

₹ 1,000 /month

APPLY BY

16 Dec' 22

614 applicants

About the work from home job/internship

Selected intern's day-to-day responsibilities include:

1. Work on data cleaning, pre-processing, and image collection
2. Write well designed, testable, and efficient code by using best software development practices
4. Engage in container creation and deployment in Docker
5. Stay plugged into emerging technologies/industry trends and apply them to operations and activities
6. Develop the next generation OCR technology to allow our users to generate line-wise bounding boxes for the scanned documents
7. Develop and train ML models to perform OCR on Indic languages (Sanskrit, Hindi, and Marathi)
8. Develop models for object detection to identify specific regions in an input image
9. Work on the pipeline of OCR text correction to understand the ground scenario (converting scanned text to digital text with manual correction of OCRed text)
10. Debug and resolve issues using open communities like Stack Overflow and GitHub

Skill(s) required

Deep Learning English Proficiency (Spoken) English Proficiency (Written) Image Processing Machine Learning Neural Networks OpenCV

Earn certifications in these skills

Learn Machine Learning

Learn Neural Networks

Learn Business Communication

Learn Deep Learning

Who can apply

Only those candidates can apply who:

1. are available for the work from home job/internship

2. can start the work from home job/internship between 16th Nov'22 and 21st Dec'22

3. are available for duration of 6 months

4. have relevant skills and interests

Perks

Certificate Letter of recommendation Flexible work hours 5 days a week

Additional information

Optical character recognition (OCR) is the process of converting document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like machine translation, speech recognition, and enhancing dictionaries and language models. OCR in Indian languages is quite challenging due to the richness in inflections.

Using open-source and commercial OCR systems, we have observed the word error rates (WER) of around 20-50% on printed documents in four different Indic languages. Moreover, developing a highly accurate OCR system with accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we started with the problem of developing 'OpenOCRCorrect', an end-to-end framework for error detection and corrections in Indic-OCR. Our models outperform state-of-the-art results in 'Error Detection in Indic-OCR' for six Indic languages with varied inflections and we have solved the out of vocabulary problem for 'Error Correction in Indic-OCR' in our ICDAR-2017 conference paper. We further improve the results with the help of sub-word embeddings in our ICDAR-2019 conference paper. Demo video for our framework is https://www.youtube.com/watch?v=u9bqUDrGugc

To install the software, you can go to https://github.com/rohitsaluja22/OpenOCRCorrect and follow the instructions given in https://www.youtube.com/watch?v=0hcdlF-zn8E.

There is an immediate demand to keep the softcopy of the Indian preserved texts. Currently, we are targeting Sanskrit. Although the OCR tools available online do a decent job on English texts, they are not optimized for Indic languages. Thus developing an OCR model for the same is our concern. The model should be able to detect text with maximum level accuracy and should be able to draw bounding boxes on each line of the text. Further, in the digitization process of such texts, the second step would be spelling correction and formatting of the text detected by the OCR models.

1. ICDAR 2019 Post-OCR competition: Our team 'CLAM' secured 2nd position in the multilingual PostOCR competition at ICDAR'19. Our model achieved the highest corrections of 44% in Finnish, which is significantly higher than the overall topper (8% in Finnish). Final report: https://drive.google.com/file/d/15mxNO-M9PiXBnffi7MOa8wUw33nj1xBp/view?usp=drive_open) and poster available (https://drive.google.com/file/d/1uuBWu1LQ1QZ49SCgLBoB1er4HpWSzmcx/view.

2. ICDAR2019: You can read the paper here - https://www.cse.iitb.ac.in/~rohitsaluja/PID6011473.pdf

3. ICDAR2017: You can read the paper here - https://ieeexplore.ieee.org/document/8269944

4. ICDAR-OST 2017:
(A) OpenOCRCorrect: you can read the paper here - https://ieeexplore.ieee.org/abstract/document/8270254
(B) Source code for our framework is available here - https://github.com/rohitsaluja22/OpenOCRCorrect.

Number of openings

About IIT Bombay

Website

The Indian Institute of Technology, Bombay (IITB) is one of the fifteen higher institutes of technology in the country, set up intending to make facilities available for higher education, research, and training in various fields of science and technology. Professor Ganesh Ramakrishnan (department of CSE) and professor Ramasubramanian (department of humanities and social sciences) are attempting to significantly speed up the process of digitization of Sanskrit texts. Enabled by the OCR and post-editing related technologies developed at IIT Bombay, they are now seeking the participation of the community of Sanskrit lovers, software developers, machine learning enthusiasts, project managers, etc.

Activity on Internshala

Hiring since December 2013

418 opportunities posted

107 candidates hired