Data Labelling And Digitization Of Text On OCR Application work from home job/internship at IIT Bombay
Data Labelling And Digitization Of Text On OCR Application
Start Date
Starts immediatelyImmediately
6 Months
5000-8000 /month
Apply By
9 Jul' 20
Part time allowed
Part time allowed
About IIT Bombay
The Indian Institute of Technology, Bombay (IITB) is one of the fifteen higher institutes of technology in the country set up with the objective of making facilities available for higher education, research, and training in various fields of science and technology. With the same mission and vision, Prof. Ganesh Ramakrishnan is gearing to take rural India a leap ahead. For his outstanding contributions, he has also been awarded the IBM Faculty Award 2011.
About the work from home job/internship
Selected intern's day to day responsibilities include:

1. Working on the pipeline of OCR text correction (converting scanned text to digital text, with manual correction of OCRed text)
2. Working on the text annotation of the scanned documents
3. Fixing mistakes of the scanned document
4. Making formatting (alignment, boldness and font) of the OCRed text in accordance with the scanned images
5. Aiding in testing the software and tools
6. Maintaining documentation and presentation, drafting project proposals, reports as and when needed
7. Working on a minimum of 5 books of 200 pages every month

Note: This internship does not involve programming or technical knowledge but primarily deals with manual text corrections, secondary aspects are software testing and software installation.
Skill(s) required
MS-Word MS-Excel English Proficiency (Spoken) English Proficiency (Written) Hindi Proficiency (Spoken) Hindi Proficiency (Written)
Learn these skills on Internshala Trainings
Who can apply

Only those candidates can apply who:

1. are available for the work from home job/internship

2. can start the work from home job/internship between 10th Jun'20 and 15th Jul'20

3. are available for duration of 6 months

4. have relevant skills and interests

Other requirements

1. You will need to use your own laptop and you can work from home or from campus once COVID situation improves

2. Any graduate or undergraduate from any stream can apply for this position who is comfortable to use the software on Windows/Linux

3. Must be self-motivated and work proactively to complete the tasks within the assigned time frame

4. Expertise in reading Indic (Sanskrit and Hindi) languages and English

5. knowledge of software installation on Windows and Ubuntu

6. knowledge of MS-Office

7. Hands-on experience in data labeling is an added bonus

8. Ability to handle and use custom made tools for annotations and spell-check efficiently

Certificate Letter of recommendation Flexible work hours 5 days a week
Additional Information

Demo video for our framework is at (MUST WATCH for applying candidates) To install the software, you can go to and follow the instructions given in This can be a remote internship or can be an in office internship & will start in June. There is an immediate demand to keep the softcopy of the Indian preserved texts. This is an in-office internship & will start in June. Candidates can work from home, till the lockdown is lifted up and it is safe to commute to the campus. Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models. OCR in Indian Languages is quite challenging due to richness in inflections. Using Open Source and Commercial OCR systems, we have observed the Word Error Rates (WER) of around 20-50% on printed documents in four different Indic languages. Moreover, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we started with the problem of developing "OpenOCRCorrect", an end-to-end framework for Error Detection and Corrections in Indic-OCR. Our models outperform state-of-the-art results in “Error Detection in Indic-OCR” for six Indic languages with varied inflections and we have solved the Out of Vocabulary problem for “Error Correction in Indic-OCR” in our ICDAR-2017 conference paper. We further improve the results with the help of sub-word embeddings in our ICDAR-2019 conference paper. Demo video for our framework is in the first video. Currently, we are targeting Sanskrit. Even after a good accuracy in OCR, the detected text needs a lot of improvement. Further, in the digitization process of such texts, the second step would be spelling correction and formatting of the text detected by the OCR models. Hence, the selected candidate’s task would be converting the generated OCR text in accordance with the scanned images of the 500 texts.

Number of openings

Save yourself from fraud!

If an employer asks you to pay any security deposit, registration fee, laptop fee, etc., do not pay and notify us immediately. Remember, Internshala doesn't charge a fee from the students to apply to an internship & we don't allow other companies to do so either.