Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 2, 2025

This PR updates the README to better explain the features that we plan on supporting:

Rendered version: https://github.com/andygrove/datafusion-ray/blob/new-readme/README.md

@robtandy
Copy link
Contributor

robtandy commented Mar 2, 2025

Thank you @andygrove !

A few suggestions:

  • Can we call Greedy, Streaming? I think its more communicative about how it functions and conveys its major feature in the name.

  • for the code snippet, I think it should read:

    import ray
    from datafusion_ray import DFRayContext
    
    ray.init()
    session = DFRayContext()
    df = session.sql("SELECT * FROM my_table WHERE value > 100")
    df.show()
  • I'm not sure about the trade offs as written. I think its possible that, depending on the query, a batch mode could be faster than a streaming mode for smaller queries due to less overhead. We'll have to implement the batch mode to define this more clearly.

  • We should indicate that the batch mode is planned for 0.2.0 and 0.1.0 will include Streaming only

@robtandy
Copy link
Contributor

robtandy commented Mar 2, 2025

For the code snippet, i forgot to include ray.init(runtime_env=df_ray_runtime_env)

We should have, I think,

import ray
from datafusion_ray import DFRayContext, df_ray_runtime_env

ray.init(runtime_env=df_ray_runtime_env)
session = DFRayContext()
df = session.sql("SELECT * FROM my_table WHERE value > 100")
df.show()

As df_ray_runtime_env is necessary to set up logging correctly in Ray workers

@andygrove
Copy link
Member Author

Thank you @andygrove !

A few suggestions:

* Can we call `Greedy`, `Streaming`?   I think its more communicative about how it functions and conveys its major feature in the name.

* for the code snippet, I think it should read:
  ```python
  import ray
  from datafusion_ray import DFRayContext
  
  ray.init()
  session = DFRayContext()
  df = session.sql("SELECT * FROM my_table WHERE value > 100")
  df.show()
  ```

* I'm not sure about the trade offs as written.   I think its possible that, depending on the query,  a batch mode could be faster than a streaming mode for smaller queries due to less overhead.  We'll have to implement the batch mode to define this more clearly.

* We should indicate that the batch mode is planned for `0.2.0` and `0.1.0` will include Streaming only

I made the following updates:

  • Use Streaming and Batch terminology
  • Added note that batch is not implemented yet, with link to the tracking issue
  • Updated code example

Do you have specific suggestions for updating the trade-offs?

@andygrove andygrove merged commit 8e1a56a into apache:main Mar 2, 2025
1 check passed
@andygrove andygrove deleted the new-readme branch March 2, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants