You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue outlines the roadmap for DocArray v2 (this is an internal name, the actual version will still be 0.x.y).
If you want to get a general overview of why we are doing this rewrite, and what our general vision is, check out the poject's readme and this blog post.
But in a nutshell: We are building a library for representing, sending, and storing multimodal data, with a deep integration with pydantic, and mutimodal ML and Neural Search as flagship use cases.
Roadmap
Below you can find the rough roadmap for this rewrite.
We plan to release alpha versions and dev update blogs for every milestone that we reach, with smaller updates along the way.
As we are at the beginning of this effort, the later stages of this roadmap are not fully fleshed out yet, so take this issue as a living document!
alpha-v0.1.0
Target timeline: Before end of year 2022
What's inside:
DocArray is a library that that lets you represent, send, and work on multimodal data.
The first alpha version will tackle the basic aspects of all three of these, but with a limited feature set.
We consider the problems that are tackled in the first alpha version as essential to the future of DocArray.
The implementation will be divided into three different phases:
Data representation (Target timeline: Done)
i. Support basic data types for image and text data
Implement find() on DocumentArray level. Basic implementation that can perform search on root-level embeddings (no support for search on nested levels; this will come later) feat: find function #931
Use case: Machine learning, training (alpha-v0.0.n+1) (Target timeline: Dec 15 2022)
Here we just round off the stuff we will have started earlier.
Support tensor types for more ML frameworks (Target timeline: Mar 30 2023)
HuggingFace safe tensor
Scipy
Jax (potentially)
Sparse tensors for all of the above
Potential features
There are a number of features and use cases that we are thinking about, but are not yet sure if and how they should find their way into the library.
Even if we decide to implement these features, they might not make it into one of the alpha versions. But since we are laying the foundation for everything else to come, we want to consider these from the start.
If you have input on these, please let us know!
Support for MongoDB: This could be an interesting candidate for a Document Store backend; it does not have vector search capabilities, but the Document focused design seems like a good fit.
Support for S3 storage: This is another option, but it might not fit into our Document Store concept, since it is usually more of a source of data rather than something you continually work with and modify. We are interested to know about different usage patterns and ideas about how to integrate this into DocArray.
Native support for Jax: If you are a a user of Jax, let us know! We are considering expanding our ML framework / tensor support to include it natively.
DocArray v2
This issue outlines the roadmap for DocArray v2 (this is an internal name, the actual version will still be 0.x.y).
If you want to get a general overview of why we are doing this rewrite, and what our general vision is, check out the poject's readme and this blog post.
But in a nutshell: We are building a library for representing, sending, and storing multimodal data, with a deep integration with pydantic, and mutimodal ML and Neural Search as flagship use cases.
Roadmap
Below you can find the rough roadmap for this rewrite.
We plan to release alpha versions and dev update blogs for every milestone that we reach, with smaller updates along the way.
As we are at the beginning of this effort, the later stages of this roadmap are not fully fleshed out yet, so take this issue as a living document!
alpha-v0.1.0
Target timeline: Before end of year 2022
What's inside:
DocArray is a library that that lets you represent, send, and work on multimodal data.
The first alpha version will tackle the basic aspects of all three of these, but with a limited feature set.
We consider the problems that are tackled in the first alpha version as essential to the future of DocArray.
The implementation will be divided into three different phases:
i. Support basic data types for image and text data
strTensorfor numpy tensorsImageURIType: ImageURI #784TextURIType: TextURI #785EmbeddingType: Embedding #786ii. Provide pre-built Documents
ImagePre-built: Image #787TextPre-built: Text #788iii. Basic implementation of DocumentArray
Use case: Vector search system (
alpha-v0.0.1) (Target timeline: Dec 15 2022)find()on DocumentArray level. Basic implementation that can perform search on root-level embeddings (no support for search on nested levels; this will come later) feat: find function #931Use case: Machine learning, training (
alpha-v0.0.n+1) (Target timeline: Dec 15 2022)Nested data (
alpha-v0.0.n+1) (Target timeline: Dec 31 2022)alpha-v0.2.0
Target timeline: Feb 15 2023
The plan for the second alpha version (and following) is to iterate on the basic ideas introduced in
alpha-v0.1.0.For now this means:
Util methods (Target timeline: Feb 15 2023)
Support for more data types (Target timeline: Jan 15 2023)
Tensorflow support (Target timeline: Feb 15 2023)
Support for LegacyDocument.
alpha-v0.3.0
Target timeline: Feb 28 2023
Data visualization (Target timeline: Feb 28 2023)
More serialization options (Target timeline: Feb 28 2023)
Support parallel processing and array like access on DocumentArray (Target timeline: Feb 28 2023)
alpha-v0.4.0
Target timeline: Mar 15 2023
This version will focus on introducing vector databases (and potentially other data storage options) into the library.
Implement
DocumentStoreclass (Target timeline: Feb 28 2023) feat: hnswlib document index #1124Support the following storage backends (already supported in legacy DocArray): (Target timeline: Mar 15 2023)
Nested access on Document, DocumentArray, and DocumentStore
findon nested data/documents feat: nested attribute access infind()#1176Support for push()/pull() to hub
Release version
Target timeline: third weeks of April 2023
Support for reading from another data format:
Better dev life experience:
Post-release
Here we just round off the stuff we will have started earlier.
Potential features
There are a number of features and use cases that we are thinking about, but are not yet sure if and how they should find their way into the library.
Even if we decide to implement these features, they might not make it into one of the alpha versions. But since we are laying the foundation for everything else to come, we want to consider these from the start.
If you have input on these, please let us know!
You can start a discussion on Github Discussions, or join our Discord server.
Changelog:
idand move alpha-0.3.0 target dateid. Delaymap, apply, etc...