Plans are worthless, but planning is everything
ideas
- [work] ask about databricks & sqlalchemy
- [ml] do you eat healthy, recommend products, calories
challenges
- [work][future][todo] Whether aAA needs new repo for his work or could be put to repo commons for pipelines?
learnings
- parquet file could very compressed size
- ~ 4_000_000 records to 40 MiB
- [databricks]
- External tables
- Volumes = mount + per user management
- Capture and view data lineage with Unity Catalog
- Delta
- Delta Lake
- Open source project that enables building a Lakehouse architecture on top of data lakes.
- Provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
- (https://docs.delta.io/latest/quick-start.html#language-python)[https://docs.delta.io/latest/quick-start.html#language-python]
- (https://docs.delta.io/latest/index.html)[https://docs.delta.io/latest/index.html]
- get schema from databrics catalog with Delta Lake
DeltaTable.forName(spark, f"{source_schema}.{tableName}").toDF().schema
finds
- Blog Peter Baumgartner https://www.peterbaumgartner.com/blog/
- Ways I Use Testing as a Data Scientist https://peterbaumgartner.com/blog/testing-for-data-science/
Thanks for reading this ❤️
Love,
KK