AVG KMS: Plan 2023Q4
The Plan
Primary Goals
- The goal initially is to build up our database with info about our documents.
- Every feature added in the future needs to have this derived info
- Because of this:
- The first feature will be a case-closed submission client and server
- Ensures all new closed case files are tagged and organized correctly
- A server that brokers all data requests about our Box store needs to be made.
- A client that allows colleagues to make sense of this data is then required.
- A database that stores all learned information about the documents is needed.
- The first feature will be a case-closed submission client and server
- These three components then will need to have their own features developed to:
- Increase utility of the system for consultants and managers.
- Analyze the data so models can be developed or integrated to automate further.
- Backfill old case files with a more usable organization structure and metadata.
Before the Really Fun Stuff
- Before we get into the really desirable features such as:
- Training LLMs (Large Language Models ie ChatGPT/Copilot).
- Using AI embeddings to generate content.
- AI embeddings for suggestions on general case knowledge.
- The main features that should be developed before year's end:
- Submissions Feature
- Automated Tagging Feature
- Search Feature
- This plan considers what features implement the most pre-requisites first.
Features in Detail
Case Submission Feature MVP
Case Submission Feature: The Why
- Teams has an SDK (Software Development Kit) for installable plugins on Teams.
- It's easy to develop for with components we can compose relatively easily.
- Teams is a familiar interface for most people in this organization.
- Less onboarding problems
- Therefore suited as a client for the KMS (Knowledge Management System).
- In short it's a good first feature because:
- It requires all three components that will need to be developed.
- Normalizes the workflow and data for all future cases.
- Onboards the staff with an easy to use feature.
Case Submission Feature: The What
- Should be the first feature to implement.
- Will give an interface to properly submit closed case files into box and
- feed the database with the correct metadata.
- Ensures all new case files are correctly submitted making our lives easier.
- Requires file handling and authentication functions that future features need.
- Requires a database with the first schemas needed to both submit and...
- ...later make requests for to make the system useful for consultants.
Case Submission Feature MVP: The How
- The case submission Feature will act as the client that consultants and managers use.
- The client will communicate with a Python server running on Azure functions.
- This server is the backend to the knowledge management system or KMS-server.
- The KMS-server will...
- Handle requests from the Teams client.
- Submit new case files to Box with user-selectable metadata like tags.
- Update the database with file references, metadata such as tags and user.
- The Managed Database by Azure is there to store and return all derived data.
Case Submission Feature MVP: The plan
- Ekin has already prepared the UI for the Teams plugin
- Needs networking features to connect to the KMS server.
- Once the server is ready one to two weeks should be needed for it to work.
- Marcus is working on the KMS server which will need these functions:
- Box API authentication as a service user and authenticated as Box users (done).
- Box API file management can create, read, but NOT delete or modify documents.
- Only Box modifications this server will be capable of is the metadata.
- Should be done in one week from the offsite.
- The HTTP communication portion of the server that brokers client requests.
- Will make use of the Box API functionality that's soon to be completed.
- Needs to be connected to the Teams Plugin Client
- Should take about two weeks from offsite.
- An azure managed PostgreSQL database to store all derived metadata about Box
- Working with Matti to setup administration privileges to setup the database.
- In short, this feature should be complete by end of October.
Automated Tagging Feature
Automated Tagging Feature: The Why
- There's an estimated 2TB of case documents, vast majority are untagged.
- Impossible to perform manually in a timely manner.
- The KMS-server & database will have a trained model to create tag inferences.
- This means we can slowly start letting the server automatically tag old files.
- But first we should have some manual interventions in place.
- (Discuss this workflow)
Automated Tagging Feature: The What
- With a pre-existing set of properly tagged documents we can...
- Train a model that the KMS-server will run to suggest tags for old documents.
- The KMS-server will periodically classify a new batch of documents.
- Then the KMS-server will update its findings on the database.
- These suggested tags can live on the database till the accuracy is validated.
- Once we're happy with the results in the database, the server will update Box.
Automated Tagging Feature: The How
- All currently tagged documents need to have their text content stored in database.
- This will be used to train a model efficiently, off of Box's API.
- Then that text content with the in-place tags will be used to train a model.
- This model is known as a classifier and will use the Naive Bayes algorithm.
- It essentially correlates words frequency with pre-defined labels or tags.
- Still used today to classify spam and is a proven text classification algo.
- Relatively easy to implement algorithm.
- Once trained & validated, the model will be applied by the KMS-server in batches.
- The suggested tags will live on the database till we're happy with the accuracy.
- Some discussion needed on how we will determine "good enough" accuracy.
- This also includes potentially developing an approval interface in Teams.
- When the accuracy is good enough, KMS-server will update Box with tags.
- Will be done in batches to avoid going over daily budgeting of computation.
Automated Tagging Feature: The Plan (Part 1)
- The Database
- Needs to be updated to include suggested tags vs manual ones.
- Needs to add text contents for each document used for training.
- This will be useful for other features in the future such as search.
- These database updates should take about a week.
- Training the Model
- With tags and text content we can start to train the model.
- Some rounds of iteration may be needed to improve accuracy before deployment.
- Might also need an approval interface and process that managers can use.
- If so, that will be another feature to be developed in early 2024.
- It's hard to guage development time of this part
- Probabilistic models need iteration till error rate is low enough.
- An early version could be developed in 1-2 weeks.
- ...but iterated improvements could be necessary.
- Discuss what "good enough" is & if manual approval should be implemented.
Automated Tagging Feature: The Plan (Part 2)
- Deploy Model to KMS-Server
- The KMS-server is where this model will live in production.
- Batches of documents will be indexed, probably by date (most recent first).
- Batches will be sized to avoid exceeding daily budgetary limits on azure.
- Daily, new batches will be created and tagged.
- Integrating the model into the server is roughly 1 week of development.
Search Feature
Search Feature: The Why
- Currently Box-based search for case files is not an efficient or pleasant experience.
- Box is not context sensitive enough.
- Tags will certainly help, but only so much.
- By combining the Box search with queries on our database - search will improve.
Search Feature: The What
- The goal is to integrate a search engine (i.e Elastic Search or Algolia) with:
- The Box API that already exists for our Box files
- Our document derived database
- The search results of these two combined will be far more nuanced.
- A naive combination of the results is the initial target.
- Tons of room for iterated improvements later on via ranking and machine learning.
- The interface to search would be on the Teams Plugin as another UI.
Search Feature: The How
- The server needs to integrate with the Box API to request search results.
- 1 week of development time
- A search
- Instead of spending time crafting a search engine...
- Try integrating one of the big name search engines as a service:
- Elastic Search being most prominent, another up-and-coming is Algolia.
- Only problem is data custody, need to discuss.
- Try integrating one of the big name search engines as a service:
- Otherwise we can evaluate embedded search engines, would take more time.
- Depending on our solution, evaluation could take as much as 3 weeks
- Likewise, integration after evaluation as much as 2 weeks.
Roadmap
- These 3 features involving 3 to 4 components, form a roadmap.
- Some evaluation could alter the timetable due to...
- Unknowns about external services such as Box, Teams SDK, Azure, etc.
- Non-deterministic nature of probabilistic models.
- So far the biggest time sinks have been figuring out how to authenticate Box.
- Focus will be on getting a functioning version out the door fast.
- Iteration to improve the experience can and probably should be made in 2024.
Future Topics: LLM
- Microsoft is hyping up Copilot being used to analyze documents improving ChatGPT.
- Very interesting possibility to decrease development and admin time.
- Probably not as smooth an integration experience as we'd like.
- Having derived and labelled data in our databases should help significantly.
- Results in a chat-bot experience with specialized correlations from case files.
- Beware of AI Hallucinations.
Future Topics: Iteration
- To know how to improve this system, actual user experiences will be needed.
- Give these features some time with staff and involve them in day-to-day workflows.
- The feedback should dictate what improvements to prioritize.
Future Topics: How about You?
-
Ultimately this is a tool to improve workflows at this firm.
-
A single employee can only speculate so much which workflow improvements are most needed.
-
Discuss what future goals this project should evaluate and attempt to reach.
-
The next component is the KMS server
-
Something needs to manage the authentication & data transfers with Box
-
Putting this on the server side makes most sense
- Batch jobs that need to happen in the background for data analysis
-
The Teams Plugin will act as the client to this server.