Skip to content
This repository was archived by the owner on Mar 12, 2022. It is now read-only.

Meeting notes: Fall 2016 to Spring 2017

Peter Broadwell edited this page Aug 25, 2017 · 1 revision

2016-10-18

Tasks:

  • Research open source tools we can use (Scribe or another tool?).
  • Set up AWS so we can start deploying a backend.
  • Settle on the backend we want to use (Rails?, Node?, etc.).

General Notes:

  • Build an OCR module.
  • Workflow we want:
    • Books aren't catalogued and they are far away in Wilshire
    • There are two stacks of books and one is stuff they really want to catalogue and the other stack is all the other books they can't even read.
    • They would still like to catalogue them and we could be working with either pile and decide what the priorities are.
    • Someone there will take photos and it should be the inside cover with the title page and have fewer graphics.
    • Has good formatting and it is pretty easy to guess the formatting.
    • They want pictures of the last page to know how long the last book is.
  • We are trying to do an assembly line process with this and we want to split it up into photographs first and someone has to figure out how and these just go into the server and those get fed into the crowd sourcing interface.
  • Pipeline to catalogue that textbook and it is for upcoming new books and acquisitions.
    • People would have new books coming in and the library will take pictures of that and input it manually.
    • Get the book and type it into a pretty old Terminal system.
    • After it is catalogued, it will be ready to go into the shelves.
  • Just working with images of the most important pages, someone can go down and photograph something really quickly until the cataloguing is done.
  • People who can translate might not have access to the building, but we still need their help so that's where the online transcription interface comes in.
    • Once they catalogue a book, it will be ready to go into the shelves and the idea of this project is to make this process easier and faster.
  • Johnny is assuming that the catalog information is stored in the database, so we have to have that talk between Voyager and the application.
    • Voyager is a SQL database in Java behind the scenes.
    • If you have a bunch of book records, just output them as XML files and it is just a bulk upload and this should be pretty easy once we have data in the right format.
    • Some of the new books when they come in are given a lightweight catalog record.
  • Special case for when a book comes in with a number already on the book, but when we find the database, there is no metadata. Have to update that!
  • Images are already taken and you just take more images and transcribe or check the OCR on the title page.
    • Volunteers right now have to look at the physical book, but we just want them to look at the images of the book.
    • It will be automatic but volunteers double check if the identification is correct or not.
    • Have some way of associating photos with the book.
      • The photos they take are of that book, but it is NOT easy to tell which photo the book came from.
  • Type the information into the system and print out the labels right away.
    • They will do a bunch of books at once and they will remember that and they will complete that whole process at that time and type in everything and wait for the printing to finish on that day so we can remember where the book is.
    • No confusion about where the book goes, but it has slow throughput.
    • We don't want to get too bogged down and the images for the OCR won't build a system that won't work for them.
    • Have a system to recognize every image at a time.
    • How much work can be done when someone takes a photo?
      • Put a sticker on that book and say that sticker is 1, for example.
  • We don't want to be developing those apps, but we might want to advise on it if that is the way it goes.
    • Some processing instead of just associating those images with the books.
    • The main challenge is that all the writing is NOT in English, so when they take the photo, if it has a sophisticated enough app, it could do OCR right then.
      • The problem is most OCR parses for English!
  • Scribe is made for things you can OCR.
    • If you could highlight it and say OCR this, I think it is Cyrillic or some other language for example.
  • This technology could be used for a lot of other usages such as image tagging.
  • If we could auto-OCR that, we could have a crowd sourcing interface and detect text blocks.
    • This could be highlighted on the page you are seeing.
  • Scribe has their own engine of OCR or do they use a third-party OCR.
    • Use a framework for tagging images and provide this functionality.
    • I think this text is in this region and make these interfaces where people can power through large quantities and verify someone's transcription.
  • Look up open source projects that would best suit our needs! (Should we use Scribe or what do we need to do? Web or mobile?)
    • OCR is a bit expensive, but it can run on smartphones.
  • Play around with the machine learning algorithm and know which one is the title.

2016-11-01

  • If we want to develop a Java backend, we would use a Text OCR.

    • Originally in C++, but we can use a Java wrapper.
  • We don't know if we don't do a iPhone app development.

    • Assuming we are doing a phone app, we would rather use a Swift OCR tool.
  • How we get the pictures is still a question and get the pipeline really high throughput in ways to take advantage of economies of scale for crowdsourcing.

  • They have other pilot projects where people catalog a bunch of books at once, but they haven't done crowdsourcing to handle other languages.

  • They have tried to make workflows involving tools and software packages with barcode scanners to make things faster.

    • Communication Studies has this big collection of television news recordings going back to the 70s, and they need to digitize the tape and they have an app of using that.
    • Customized version of File Maker Pro.
      • Have pre-printed barcodes that stick on the tapes and that goes up to the server.
  • Talk to the cataloguers to see what barcodes they use.

    • For now, continue to experiment and assume we need some sort of OCR.
    • Some tagging transcription interface like Scribe.
      • Way to handle the images easier and give a bunch of functions you can call like drag and drop layout.
      • Don't have a recognition engine and if it recognizes a language, we will process it that way.
  • Let's say something is a different language, how will it recognize that?

    • Support the most popular languages so we can handle most of the books.
    • It will take a lot of time to train Tesseract to learn these new languages.

Tesseract

  • We probably don't want to run it on all 50 languages and it needs to find text, but it won't necessarily OCR it correctly.
    • This is where the Scribe interface will come in.
    • The Scribe interface just gives you the text and one idea is that we can have Tesseract run on this and Scribe will let you draw a box around the text.
    • If we can work with the Scribe API to bound the box around Tesseract, then it would have what it thinks the text is in those boxes.
    • At this point, the user will verify this manually and write what the actual text is.
    • Tesseract will highlight these and this language is important, so run OCR on this again.
      • You can use it to rerun OCR on the language.

Layout of the project:

  • Image ingest scripts in Python
  • Tesseract OCR in Java or C++
  • Scribe in Ruby on Rails

It is probably better to have it as a web socket so they can directly go to the web location and we would have a very simple interface for this.

http://ec2.aws.com:8134/do_ocr

After server setup

  • Set it up so we can send an image and get some processed results sent back. Afterwards, we want to set up Scribe to annotate some basic images.

2016-11-08

  • Peter set up the AWS server
  • Keep working on the infrastructure and the best way to set these things up.
  • Top priority is getting Scribe working for now.

Java backend

  • Probably using Spring framework.
  • GitHub functionality already provided.

Expects you to cut around the text and there are three sources that look promising.

  • Find something that can make this project work.
  • Assuming that images are already assigned so one type.

Scribe Rails

  • Try to get it working in a simple state first, then worry about optimization afterwards!

2017-01-19

  • For Matt + Rails team, work on a dummy page and try to get it annotating stuff purely on the Scribe side.

  • OCR server is running and we just need to enhance it to return bounding boxes. Return that data into Scribe, so the OCR is the first person to transcribe the page. The librarian can see the bounding boxes that the text has drawn.

  • Need some logic to get images of books so we don't have to necessarily upload them.

    • Run some script using the API and insert that data and put the images themselves into the Scribe workflow.
  • Java server won't give info such as author, title, etc.

    • When you configure a Scribe service, you give it a config file.
    • Scribe can give bounding boxes, which you can do image decomposition to do it automatically (hopefully in the near future?)
  • Johnny's machine learning libraries:

    • Weka - machine learning library in Java
  • Show off Scribe interface ASAP and make it cool!


2017-02-02

Java Team Updates: Kenta set up API so that we can send requests to an API Endpoint.

  • Curl a local file on your computer and send it to the server
  • Use every language that is available and we can return a JSON object.
  • Check if scribeAPI for dimensions and things like that.
  • Tab delimited

Realistic goals for Rails team:

  • Learn inner workings of the sample project
  • Change UI to fit our needs of the books
    • Author, Title, (TBD)

Try to set up Scribe so that everything will be transcribed.

  • Build a machine learning classifier? (Future plan?)
  • Try to go into this API and everything would have the box around it already and wouldn't be labeled as title, author, or publisher.
  • Get data from OCR into Scribe in the first place.
  • To troubleshoot the importing, set up a test site, see where the database changes.
  • Figuring out the output step and scribe does something to output the results and we want to look at the files now.

Possible future goals:

  • Add in image detection and classification tools.

2017-02-17

Kenta

  • API is done
  • Testing would be nice.
  • Switch over to scribe and figuring out how to set it up.
  • See images with the Spring server is on Ubuntu so that should be good.

Matthew

  • Updated content
  • Changed parameters to our needs
  • Need arbitrary images to test before running OCR
  • Ultimate goal is to get OCR working

Peter

  • Uploaded test book covers
  • We need to figure out how to run the backend server and actually look at the images.
  • Images are on the server now to test with and we were wondering how to actually see them.
  • Once we put them into scribe
  • Build infrastructure so that the images are somewhere on the server and use the REST service so that we can OCR them and other things can access them.
  • Figure out from reversing scribe documentation.
  • How should these boxes go in the internal storage, and probably go into the Mongo database.

Goals:

  • Try to put it on EC2 demo.
  • Figure out if scribeAPI has an open API to handle requests from Spring server.

2017-03-02

Machine Learning:

  • Johnny should lead the charge because he does this for his research
  • If we wanted to mess around with it on our own, we can mess around with easy APIs to throw at the Google Image API
  • Try for the remainder of the quarter to get some initial direction from Johnny.
  • Come up with a machine classifier to guess what the bounding boxes are.
    • Will involve training data to get locations, sizes, and number of characters.
    • To have some degree of success, we need to bootstrap some training data and have a dataset of book covers and metadata about the books, which Amazon gives you.
      • If you run OCR on this cover image, then you know the name of the author is here on the image.
  • Depending on how much money and time, you can do it yourself, you can pay someone to label it for you, or bootstrap from existing datasets

Online books and the full text is a way of keeping track of the books, authors, and titles.

  • These book cover images can have an ID that we can match to a record somewhere else, and we should have all of this data combined together.
  • Tesseract is already pretty good at guessing this language, but how would you augment the machine learning model to take into account multiple languages.
    • If it is hard to segment the letters out, this would be challenging.

Training Sets

  • Figure out a way how to train the data so that we get the machine learning very solid.

Feel free to put in tasks for this.


2017-03-10

Updates

  • Photo uploading - Jeff doesn't have the image uploading yet, but he found the code that corresponds to that and there are three methods of loading the subjects and redirecting that to our own database.
  • Jasmine - calling the Tesseract API. Where do you call it from?
    • How do we actually invoke this?
    • Do it before Rails gets involved at all so use a script or server interface to run it on a bunch of images.
    • Pre-analyze all these things and it is NOT a real-time analysis anymore and bash processing or Python should be fine.
    • Real-time when a new image is uploaded and we could call it and pass it into Scribe.
  • How would we know if there is a new file showing up?
    • Not sure yet. Might be another REST call to do this.

Johnny - if we have 10,000 images, we could store it in a database and use the interface to load the image and do annotations on those images.

  • Technically we don't even need the REST API but it is nice to have!
  • Get that working first and then eventually put real-time updates.

Peter - have it decoupled and then have metadata analysis with annotation interface and hopefully we can use it.

Ankur - we need what the response for the Tesseract actually means. A bunch of slashes and numbers and figure out what it actually means.

  • Scribe doesn't have to call Tesseract anymore. It just needs information processed by Tesseract.
  • Figure out where the database is stored.

We should know what Scribe does to zoomify the image and figure out what they are putting on s3.

Spring is a Java-based server sort of like Tomcat. Where does it put its files?

  • The project has a resource folder where we put all the files.

Publicly accessible Box URLs?

  • We want a folder with a bunch of images in it that is accessible through a hyperlink!
  • We should have a Scribe interface and we can use a link to find wherever the file is and choose to run Tesseract on them.

Need the test images somewhere to make it accessible. Put it on a Box or Google Folder.

Machine Learning

  • Use Scribe to annotate these and figure out Title, Author, Publisher, etc.
  • Should have Cataloging page and a Title page and possibly the number of pages.
  • If humans come in, give them the box to do it manually.
  • Could possibly use Clarifai API
  • Perform classification tasks as a proof-of-concept, which hopefully could gain traction.
  • Data project at first and then it will become a software engineering project.

Software Frameworks

  • Once they finish the work, we have the algorithm in Java before we store it into the database.
  • Either a separate Python service or we can build the classification in Java. * Java would be done using Waka

TODO:

  • Figure out where all the metadata is going to and do more digging through the source code.

2017-04-11

Ankur Updates

  • Script that keeps going every day and checks
  • Added the feature where it OCR's each new image it finds but every time it does OCR, it crashes half the time and we would have to restart the OCR service.
    • Mostly OCR crashing, so no issue with Scribe.
    • Sometimes, Scribe also crashes but half the time, it crashes.
  • Good to troubleshoot it a bit and see if there were any log files that would be helpful.

Jasmine Updates

  • Get the OCR data into Scribe and writing the Ruby script for the machine learning part.
  • Figuring out how these massive codebases work and how to get your data into it and doesn't break too many things.

Matthew, Jeffrey, DaLarm Task Updates

  • devise gem is already used in Scribe but it is too much to extend for the purposes of our GUI.
  • Bulk uploader is a browser dependent feature.
    • Get it working for one image right now
  • Make login usable
  • Add a feature to specify the language.

Export format is not as important as what the mechanism is or what the interface is to kick off the process of exporting out of Scribe.

Jasmine

  • Get the metadata and then upload these photos to Scribe and she will talk to Johnny more and build a classifier for that.

Goals

  • See how much code we can reuse and become a standard things on whatever images you have and see if you can pull out text and stuff.
  • Figure out if we can ingest more data into Scribe and Peter will go talk to the cataloguers to see if our project is something that is viable for them.

2017-04-25

Image tagging updates

  • Bottleneck of figuring out if we are going to use the OCR output at all.

Ankur

  • Looking for where to put in the data for the OCR is, but had some trouble.

Jasmine

  • Started looking into it last week as well, but Jasmine was busy.

DaLarm, Jeffrey, Matthew

  • Image uploader: having issues with uploading but password authentication is good.

Peter

  • Looking at code to put into database that scribe uses and it isn't fun to work with....

Jasmine

  • If we use Django for an image uploader, would it be easier?

    • Look for ways to avoid fighting with scribe
  • New zooniverse framework might be more supportive than scribe and could be worth checking out.

  • We would not lose a whole lot if we switch over to it, but then again, it might not be easier to work with.

    • We need an API interface to work directly with scribe.
    • See how the JavaScript is sending it back to an API endpoint for scribe if possible, and that's how it is putting in all the marked zones. However, this is NOT user-friendly and it is tough to reverse engineer it.
  • Ruby and MongoDB frameworks abstract a lot of details, so we need to figure out how to fix these issues.

  • See if there is a way to hack the marking interface for scribe to call the OCR from there

  • Look at alternatives like zooniverse and see if there is a better option. Compare with our current pieces and we should investigate if we can call the OCR from within the marking interface directly (exploratory work).

    • Don't want to fight with the Ruby framework and the MongoDB anymore and try to work towards some kind of demo to show Todd.
  • What do we need to get to the training phase?

    • Jasmine: Just a lot of data.
    • Pretty much have to download it and use it to train the network or whatever model we are using.
  • From OCR, we would lose the region information, but people could draw boxes around things (which is especially good for demo purposes).

    • It would be fun to see drawings in real-time for the OCR demo.

Demo

  • Last week of May or first week of June.

2017-05-17

Highlights:

a) Ankur figured out a way to feed the OCR data into Scribe, by POSTing data to the the classifications/ endpoint. The code is in scribeAPI/ocrBot.rb but it still needs some fine-tuning.

b) Jasmine, Ankur and Pete investigated how to export the marking/transcription data from Scribe. The options are

- to use the (experimental) data exports feature (https://github.com/zooniverse/scribeAPI/wiki/Data-Exports), or

- extract it from the DB (it’s in the scribe_api_development/subjects table), either via a client library or processing the data found at http://ec2-54-173-153-28.compute-1.amazonaws.com:8000/subjects

c) We should plan to have Scribe and hopefully the OCR feature ready to demo at a show-and-tell in early June — library admins are excited about it, but need to see a viable product to give more funding to summer BuildUCLA and next year’s program.

Clone this wiki locally