Meta: DocArray v2 Roadmap

# DocArray v2

This issue outlines the roadmap for DocArray v2 (this is an internal name, the actual version will still be 0.x.y).

If you want to get a general overview of why we are doing this rewrite, and what our general vision is, check out the poject's [readme](https://github.com/docarray/docarray/tree/feat-rewrite-v2#readme) and [this blog post](https://github.com/docarray/notes/blob/main/blog/01-announcement.md).

**But in a nutshell**: We are building a library for representing, sending, and storing multimodal data, with a deep integration with pydantic, and mutimodal ML and Neural Search as flagship use cases.

# Roadmap

Below you can find the rough roadmap for this rewrite.

We plan to release alpha versions and dev update blogs for every milestone that we reach, with smaller updates along the way.

As we are at the beginning of this effort, the later stages of this roadmap are not fully fleshed out yet, so take this issue as a living document!

## alpha-v0.1.0

**Target timeline:** Before end of year 2022

**What's inside:**

DocArray is a library that that lets you represent, send, and work on multimodal data.
The first alpha version will tackle the basic aspects of all three of these, but with a limited feature set.

We consider the problems that are tackled in the first alpha version as essential to the future of DocArray.

The implementation will be divided into three different phases:

1. **Data representation** (**Target timeline:** Done)
    i. Support basic data types for image and text data       
- [x] `str`
- [x] `Tensor` for numpy tensors
- [x] `ImageURI` https://github.com/docarray/docarray/issues/784
- [x]  `TextURI` https://github.com/docarray/docarray/issues/785
- [x] `Embedding` https://github.com/docarray/docarray/issues/786
    ii. Provide pre-built Documents
- [x] `Image` https://github.com/docarray/docarray/issues/787
- [x] `Text` https://github.com/docarray/docarray/issues/788
    iii. Basic implementation of DocumentArray

2. **Use case: Vector search system** (`alpha-v0.0.1`) (**Target timeline:** Dec 15 2022)
    - [x] Ensure compatibility with **FastAPI** https://github.com/docarray/docarray/issues/838
    - [x] Implement **`find()`** on DocumentArray level. Basic implementation that can perform search on root-level embeddings (no support for search on nested levels; this will come later) https://github.com/docarray/docarray/pull/931

3. **Use case: Machine learning, training** (`alpha-v0.0.n+1`) (**Target timeline:** Dec 15 2022)
    - [x] Support for **PyTorch** tensor data type https://github.com/docarray/docarray/issues/783
    - [x] Support for torch tensors and numpy in column-wise mode ("stacked mode") (other frameworks will follow later) https://github.com/docarray/docarray/pull/886
    - [x] Ensure compatibility with pytorch modules and pytorch lightning

4. **Nested data** (`alpha-v0.0.n+1`) (**Target timeline:** Dec 31 2022)
    - [x] Nested access on DocumentArray ("access paths") https://github.com/docarray/docarray/issues/957

## alpha-v0.2.0

**Target timeline:** Feb 15 2023

The plan for the second alpha version (and following) is to iterate on the basic ideas introduced in `alpha-v0.1.0`.
For now this means:

1. Util methods (**Target timeline:** Feb 15 2023)
    - [x] filter with query language #1051
    - [x] reduce #1076
    - [x] array like access with the getitem call https://github.com/docarray/docarray/pull/1074

2. Support for **more data types** (**Target timeline:** Jan 15 2023)
    - [x] Video https://github.com/docarray/docarray/pull/972
    - [x] Audio https://github.com/docarray/docarray/pull/940
    - [x] 3D meshes https://github.com/docarray/docarray/pull/925
    - [x] support bytes field in current type

3. **Tensorflow** support (**Target timeline:** Feb 15 2023)
    - [x] Tensforflow tensor type https://github.com/docarray/docarray/pull/1064
    - [x] Tensorflow embedding type https://github.com/docarray/docarray/pull/1098

4. Support for LegacyDocument.
   - [x] Provide legacy Document https://github.com/docarray/docarray/pull/1090

## alpha-v0.3.0

**Target timeline:** Feb 28 2023

1. Data visualization (**Target timeline:** Feb 28 2023)
    - [x] Pretty print and summary of Document and DocumentArray https://github.com/docarray/docarray/pull/1043
    - Plotting for
      - [x] Image https://github.com/docarray/docarray/pull/1136
      - [x] Audio https://github.com/docarray/docarray/pull/1136
      - [x] 3D meshes https://github.com/docarray/docarray/pull/1113
      - [x] Video https://github.com/docarray/docarray/pull/1136
  
2. More serialization options (**Target timeline:** Feb 28 2023)
      - [x] base64
      - [x] bytes 
      - [x] bytes in streaming mode (see https://docs.docarray.org/fundamentals/documentarray/serialization/#from-to-bytes)
      - [x] json 


3.  Support parallel processing and array like access on DocumentArray (**Target timeline:** Feb 28 2023)
    - [x] map https://github.com/docarray/docarray/pull/1187


## alpha-v0.4.0

**Target timeline:** Mar 15 2023

This version will focus on introducing vector databases (and potentially other data storage options) into the library.

1. Implement **`DocumentStore`** class (**Target timeline:** Feb 28 2023) https://github.com/docarray/docarray/pull/1124
2. Support the following storage backends (already supported in legacy DocArray): (**Target timeline:** Mar 15 2023)
    - [x] ElasticSearch
    - [x] Qdrant
    - [x] Weaviate
    - [x] HNSW + SQLite

3. Nested access on Document, DocumentArray, and DocumentStore
    - [x] `find` on nested data/documents https://github.com/docarray/docarray/pull/1176


5. Support for push()/pull() to hub
## Release version

**Target timeline:** third weeks of April 2023

1. Support for reading from another data format:
   - [x] csv https://github.com/docarray/docarray/pull/1144
   - [x] pandas https://github.com/docarray/docarray/pull/1161

2. Better dev life experience:
   - [ ] #1236
   - [ ] Pycharm plugin + Fix pycharm problem
   - [x] #1237

## Post-release
Here we just round off the stuff we will have started earlier.

1. Support tensor types for **more ML frameworks** (**Target timeline:** Mar 30 2023)
    - [ ] HuggingFace safe tensor
    - [ ] Scipy
    - [x] Jax (potentially)
    - [ ] Sparse tensors for all of the above


# Potential features

There are a number of features and use cases that we are thinking about, but are not yet sure if and how they should find their way into the library.

Even if we decide to implement these features, they might not make it into one of the alpha versions. But since we are laying the foundation for everything else to come, we want to consider these from the start.

*If you have input on these, please let us know!*

- **Support for MongoDB:** This could be an interesting candidate for a Document Store backend; it does not have vector search capabilities, but the Document focused design seems like a good fit.
- **Support for S3 storage:** This is another option, but it might not fit into our Document Store concept, since it is usually more of a source of data rather than something you continually work with and modify. We are interested to know about different usage patterns and ideas about how to integrate this into DocArray.
- **Native support for Jax**: If you are a a user of Jax, let us know! We are considering expanding our ML framework / tensor support to include it natively.

You can start a discussion on [Github Discussions](https://github.com/docarray/docarray/discussions/categories/docarray-v2), or join our [Discord server]( https://github.com/docarray/docarray/pull/new/docs-readme-discord).

**Changelog:**
- Jan 11/23: Move nested access to storage backend section and add Jina support
- Jan 26/23: De-prioritize map/batch/reduce/... operations and adjust timeline accordingly
- Jan 27/23: remove "Jina support" as it is a Jina concern, not a DocArray concern. Add access by `id` and move alpha-0.3.0 target date
- Feb 07/23: rearrange ROADMAP. Remove access by `id`. Delay `map, apply, etc...`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta: DocArray v2 Roadmap #780

DocArray v2

Roadmap

alpha-v0.1.0

alpha-v0.2.0

alpha-v0.3.0

alpha-v0.4.0

Release version

Post-release

Potential features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Meta: DocArray v2 Roadmap #780

Description

DocArray v2

Roadmap

alpha-v0.1.0

alpha-v0.2.0

alpha-v0.3.0

alpha-v0.4.0

Release version

Post-release

Potential features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions