1) Why SQL matters in agriculture analytics
2) What is a table, row, column? What is a key?
3) The data story as an illustrative simple field trial dataset
4) Create the schema (PostgreSQL)
5) Insert a small sample dataset
6) SQL queries in increasing difficulty
7) Quick recap of common mistakes

In this post we move from SQL fundamentals to a practical farm trial workflow. We cover tables and keys, schema design, data loading, core queries, quality checks, and clear interpretation limits.

1) Why SQL matters in agriculture analytics

As a statistician working with agriculture and breeding data, I can tell you that SQL is one of the most useful skills you can learn if you work with yield, field trials, weather, soil, or farm performance.

If spreadsheets are your current home, think of SQL as the next level for at least two main reasons:

A spreadsheet is great for a few tabs and a few thousand rows.
SQL databases are better when data grows, when multiple people use the same data, and when you need repeatable analysis.

In agriculture, this happens quickly. Examples:

One farm becomes multiple farms.
One season becomes multi-year comparisons.
One file becomes many sources (harvests, weather, field metadata, treatments).

Without SQL, people copy-paste and make accidental mistakes. With SQL, you ask clear questions and get reproducible answers.

Another reason this matters in field trial work is that your SQL query becomes a transparent analysis recipe. If someone asks “how did you compute that treatment average?”, you can show the exact logic instead of manually reconstructing spreadsheet steps.

A few real questions SQL helps you answer are:

Which treatment had the best average yield in 2025?
Did the response differ by region?
Which fields are missing weather data?
Are there duplicate harvest records?

I also want to prevent a common beginner frustration such as writing queries that “kind of work” but return the wrong rows. I will show you exactly how to avoid the classic pitfalls (especially forgetting WHERE, confusing NULL, and using the wrong type of join).

2) What is a table, row, column? What is a key?

Before writing SQL, let us define the core concepts in plain language.

Table

A table is like a spreadsheet tab.

Example fields table. It stores one type of thing (fields).

Row

A row is one record.

Example one row in fields might be field_id = 101, field_name = 'North 01'.

Column

A column is one variable.

Example soil_type is a column.

Primary key

A primary key is a column (or combination of columns) that uniquely identifies each row.

Example farm_id in farms. No duplicates, no missing values. Think of it like a unique ID card.

Foreign key

A foreign key links one table to another.

Example fields.farm_id points to farms.farm_id. It enforces consistency every field must belong to a real farm.

A possible analogy is to think about filing cabinets.

Primary key = unique folder number inside one cabinet. Foreign key = reference to a folder in another cabinet.

This is what allows us to connect tables correctly with joins.

Foreign key relationship diagram

Figure. Foreign key relationship example. fields.farm_id must match an existing farms.farm_id, creating a reliable one-to-many link for joins.

3) The data story as an illustrative simple field trial dataset

Let us imagine a realistic but small trial program. You manage three farms in different regions. Each farm has fields. In each field, you test fertilizer treatments across seasons. You also collect daily weather at farm level.

Our dataset has six tables

farms one row per farm.
fields one row per field (each field belongs to a farm).
treatments one row per treatment level.
trial_plots one row per field treatment season plot assignment.
harvests one row per harvested plot result.
weather_daily one row per farm day weather record.

This structure is very common in production analytics, and can be seen as

“Entity” tables (farm, field, treatment) describe the system. “Design/event” tables (trial_plots, harvests, daily weather) store assignment and repeated observations over time.

Note that this structure is not only “database theory”. It directly affects analysis quality because

if farm/field metadata are separated cleanly from events, joins are simpler and less error-prone.
if everything is in one giant table, duplicates and inconsistent updates become much more likely.
good schema design is one of the easiest ways to reduce analysis mistakes before they happen.

3.1) How to design an SQL schema (using this farm example)

If you are just starting, schema design may feel abstract. A practical way to think is to answer these questions first

What decisions/questions do I need to answer?
What “things” exist in my system?
What events happen over time?
What should be unique, and what should be linked?

In this farm trial example, we separate entities from events and focus on treatment response, missing records, and weather context.

Schema mind map diagram

Figure. Schema design mind map for this tutorial. Blue labels indicate the variables that link tables.

How each table looks (and how they connect)

farms (parent table)

farm_id	farm_name	region
1	Green Valley Farm	North
2	Riverbend Farm	Central
3	Sunrise Farm	South

fields (each field belongs to a farm through farm_id)

field_id	farm_id	field_name	area_ha	soil_type
1	1	North-01	12.50	clay_loam
2	1	North-02	10.00	sandy_loam
3	2	Central-01	15.20	silt_loam
4	2	Central-02	9.80	clay
5	3	South-01	14.10	sandy
6	3	South-02	11.40	loam

treatments (reference table for treatment definitions)

treatment_id	treatment_name	fertilizer_kg_ha
1	Control	0
2	Nitrogen_Low	60
3	Nitrogen_Medium	120
4	Nitrogen_High	180

trial_plots (design/event table linking field_id + treatment_id + season_year)

plot_id	field_id	treatment_id	season_year	replicate_no
1	1	1	2024	1
2	1	3	2024	1
3	1	1	2025	1
4	1	4	2025	1
10	3	2	2024	1
11	3	1	2025	1
12	3	3	2025	1
18	5	2	2024	1
24	6	3	2025	1

harvests (event table linked to trial_plots by plot_id)

harvest_id	plot_id	yield_t_ha	harvest_date
1	1	6.20	2024-10-05
2	2	7.00	2024-10-05
3	3	6.10	2025-10-06
4	4	7.40	2025-10-06
10	10	7.60	2024-10-10
11	11	6.90	2025-10-11
12	12	7.90	2025-10-11
18	18	5.60	2024-10-12
24	24	6.20	2025-10-13

weather_daily (event table linked to farms by farm_id)

weather_id	farm_id	day	rainfall_mm	tmean_c
1	1	2024-06-01	12.4	18.5
2	1	2024-06-02	0.0	19.1
4	1	2025-06-01	18.0	17.9
7	2	2024-06-01	8.1	20.0
8	2	2024-06-02	1.2	21.3
10	2	2025-06-01	10.5	19.4
11	2	2025-06-02	3.0	NULL
13	3	2024-06-01	2.0	23.8
16	3	2025-06-01	1.0	24.1

Link summary

fields.farm_id links with farms.farm_id.
trial_plots.field_id links with fields.field_id.
trial_plots.treatment_id links with treatments.treatment_id.
harvests.plot_id links with trial_plots.plot_id.
weather_daily.farm_id links with farms.farm_id.

The core idea is that trial_plots stores treatment assignment at plot level, harvests stores measured outcomes for those plots, and weather_daily connects environmental context to the same farms.

Step A define table grain first

Grain means “what one row represents”.

Entity tables (farms, fields, treatments) one row per entity. Design/event tables (trial_plots, harvests, weather_daily) one row per assignment or time based observation.

If grain is unclear, everything after that gets messy (duplicates, wrong joins, wrong counts).

Grain and uniqueness cheatsheet for trial_plots, harvests, and weather_daily

Figure. Quick grain and uniqueness map showing what one row represents and the matching UNIQUE(...) rule.

Step B choose primary keys and foreign keys

Use the same linking variables shown above and enforce them with primary keys (IDs in parent tables) and foreign keys (matching IDs in child/event tables). This gives you safe joins and enforces referential integrity. Performance note (PostgreSQL): foreign keys enforce integrity, but they do not automatically create indexes on the referencing columns. For large farm/event tables, it is often a good idea to index the foreign key columns you join on.

Step C define business uniqueness with constraints

Primary keys identify rows technically, but you also need domain level uniqueness.

Examples for this project

in trial_plots, you usually expect one row per field_id + treatment_id + season_year + replicate_no.
in harvests, you usually expect one row per plot_id.
in weather_daily, you usually expect one row per farm_id + day.

That can be enforced with unique constraints

ALTER TABLE trial_plots
ADD CONSTRAINT uq_trialplot_field_treat_year_rep
UNIQUE (field_id, treatment_id, season_year, replicate_no);

ALTER TABLE harvests
ADD CONSTRAINT uq_harvest_plot
UNIQUE (plot_id);

ALTER TABLE weather_daily
ADD CONSTRAINT uq_weather_farm_day
UNIQUE (farm_id, day);

This protects your analysis from duplicate event records.

Step D pick data types and validation rules

Use types and constraints to prevent bad data early.

DATE for dates (harvest_date, day).
numeric types for quantities (yield_t_ha, rainfall_mm, area_ha).
NOT NULL for required fields.
CHECK for physical logic (for example non negative yield_t_ha).

Your schema is the first quality control layer.

Step E separate raw ingestion from clean analytics

A practical production pattern

Load external files into staging tables.
Validate and standardize.
Insert into core relational tables (farms, fields, trial_plots, harvests, weather_daily).
Build analysis views/CTEs from the clean core.

This keeps raw import problems away from final reporting.

Step F design for change

Field trial data evolves. Plan for that.

A practical habit is to sketch the schema on a piece of paper first. Writing it by hand is often the best way to think through table grain and links before coding.

Schema extension sketch add new event tables linked by keys

Figure. Extension ready schema sketch with added event tables (for example pest_scouting and soil_tests).

New treatment levels can be added without redesigning trial_plots or harvests. new farms/fields can be inserted without touching historical records. additional event tables (for example pest scouting, soil tests) can be linked by keys.

Good schema design makes future extension easier and safer.

Mini design checklist (farm trial version)

Is each table grain explicit? Are primary and foreign keys defined? Are business level uniqueness constraints present? Are required columns NOT NULL where appropriate? Are sanity checks (CHECK) defined for critical numeric values? Can you answer your main trial questions with straightforward joins?

4) Create the schema (PostgreSQL)

Below is a clean schema for the six required tables. I will use PostgreSQL syntax and keep it beginner-friendly.

-- Optional cleanup if you are re-running this tutorial script.
-- Drop child tables first (they depend on parent tables).
DROP TABLE IF EXISTS harvests;
DROP TABLE IF EXISTS trial_plots;
DROP TABLE IF EXISTS weather_daily;
DROP TABLE IF EXISTS fields;
DROP TABLE IF EXISTS treatments;
DROP TABLE IF EXISTS farms;

-- 1) Farms table
CREATE TABLE farms (
    farm_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    farm_name TEXT NOT NULL,
    region TEXT NOT NULL
);

-- 2) Fields table
CREATE TABLE fields (
    field_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    farm_id INT NOT NULL,
    field_name TEXT NOT NULL,
    area_ha NUMERIC(8,2) NOT NULL,
    soil_type TEXT NOT NULL,
    CONSTRAINT fk_fields_farm
        FOREIGN KEY (farm_id) REFERENCES farms (farm_id),
    CONSTRAINT chk_positive_area
        CHECK (area_ha > 0)
);

-- 3) Treatments table
CREATE TABLE treatments (
    treatment_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    treatment_name TEXT NOT NULL,
    fertilizer_kg_ha NUMERIC(8,2) NOT NULL,
    CONSTRAINT chk_nonnegative_fert
        CHECK (fertilizer_kg_ha >= 0)
);

-- 4) Trial plots table (plot level treatment assignment)
CREATE TABLE trial_plots (
    plot_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    field_id INT NOT NULL,
    treatment_id INT NOT NULL,
    season_year INT NOT NULL,
    replicate_no INT NOT NULL DEFAULT 1,
    block_no INT,
    CONSTRAINT fk_trialplots_field
        FOREIGN KEY (field_id) REFERENCES fields (field_id),
    CONSTRAINT fk_trialplots_treatment
        FOREIGN KEY (treatment_id) REFERENCES treatments (treatment_id),
    CONSTRAINT uq_trialplot_field_treat_year_rep
        UNIQUE (field_id, treatment_id, season_year, replicate_no)
);

Note: block_no is optional and represents blocking (for example in an RCBD).
If you do not use blocks yet, you can leave it NULL and ignore it in queries.


-- 5) Harvests table
CREATE TABLE harvests (
    harvest_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    plot_id INT NOT NULL,
    yield_t_ha NUMERIC(8,2) NOT NULL,
    harvest_date DATE NOT NULL,
    CONSTRAINT fk_harvests_plot
        FOREIGN KEY (plot_id) REFERENCES trial_plots (plot_id),
    CONSTRAINT uq_harvest_plot
        UNIQUE (plot_id),
    CONSTRAINT chk_nonnegative_yield
        CHECK (yield_t_ha >= 0)
);

-- 6) Daily weather table
CREATE TABLE weather_daily (
    weather_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    farm_id INT NOT NULL,
    day DATE NOT NULL,
    rainfall_mm NUMERIC(8,2) NOT NULL,
    tmean_c NUMERIC(5,2),
    CONSTRAINT fk_weather_farm
        FOREIGN KEY (farm_id) REFERENCES farms (farm_id),
    CONSTRAINT uq_weather_farm_day
        UNIQUE (farm_id, day),
    CONSTRAINT chk_nonnegative_rain
        CHECK (rainfall_mm >= 0)
);

-- Optional indexes for faster joins on bigger datasets (PostgreSQL)
CREATE INDEX IF NOT EXISTS idx_fields_farm_id ON fields (farm_id);
CREATE INDEX IF NOT EXISTS idx_trial_plots_field_id ON trial_plots (field_id);
CREATE INDEX IF NOT EXISTS idx_trial_plots_treatment_id ON trial_plots (treatment_id);
CREATE INDEX IF NOT EXISTS idx_harvests_plot_id ON harvests (plot_id);
CREATE INDEX IF NOT EXISTS idx_weather_daily_farm_day ON weather_daily (farm_id, day);

Line by line explanation (key points)

GENERATED ALWAYS AS IDENTITY auto incrementing IDs.
PRIMARY KEY unique row identity.
NOT NULL value must be present.
FOREIGN KEY (...) REFERENCES ... links tables and protects data integrity.
CHECK (yield_t_ha >= 0) prevents impossible negative yield values.

PostgreSQL vs Oracle notes (schema)

PostgreSQL and Oracle both support identity columns (GENERATED ... AS IDENTITY). Older PostgreSQL tutorials often use SERIAL; Oracle does not use SERIAL. TEXT is PostgreSQL specific; in Oracle you usually use VARCHAR2(...).

5) Insert a small sample dataset

Now we load enough rows to practice filtering, joining, grouping, and quality checks.

In production work this is not the most common way to load data. Most teams load SQL tables from CSV files, data pipelines, ETL or ELT tools, application jobs, or direct connectors from source systems. Here we use direct INSERT statements only to simplify the idea and keep the tutorial fully reproducible in one place.

-- Farms
INSERT INTO farms (farm_name, region) VALUES
('Green Valley Farm', 'North'),
('Riverbend Farm', 'Central'),
('Sunrise Farm', 'South');

-- Fields
INSERT INTO fields (farm_id, field_name, area_ha, soil_type) VALUES
(1, 'North-01',   12.50, 'clay_loam'),
(1, 'North-02',   10.00, 'sandy_loam'),
(2, 'Central-01', 15.20, 'silt_loam'),
(2, 'Central-02',  9.80, 'clay'),
(3, 'South-01',   14.10, 'sandy'),
(3, 'South-02',   11.40, 'loam');

-- Treatments
INSERT INTO treatments (treatment_name, fertilizer_kg_ha) VALUES
('Control', 0),
('Nitrogen_Low', 60),
('Nitrogen_Medium', 120),
('Nitrogen_High', 180);

Note: the harvests inserts below assume the trial_plots rows were inserted into an empty table in this exact order, so plot_id becomes 1..24.
If you change insert order or re-run partially, your IDs can differ.


-- Trial plots
INSERT INTO trial_plots (field_id, treatment_id, season_year, replicate_no) VALUES
(1, 1, 2024, 1),
(1, 3, 2024, 1),
(1, 1, 2025, 1),
(1, 4, 2025, 1),

(2, 1, 2024, 1),
(2, 3, 2024, 1),
(2, 1, 2025, 1),
(2, 4, 2025, 1),

(3, 1, 2024, 1),
(3, 2, 2024, 1),
(3, 1, 2025, 1),
(3, 3, 2025, 1),

(4, 1, 2024, 1),
(4, 2, 2024, 1),
(4, 1, 2025, 1),
(4, 3, 2025, 1),

(5, 1, 2024, 1),
(5, 2, 2024, 1),
(5, 1, 2025, 1),
(5, 3, 2025, 1),

(6, 1, 2024, 1),
(6, 2, 2024, 1),
(6, 1, 2025, 1),
(6, 3, 2025, 1);

-- Harvests
INSERT INTO harvests (plot_id, yield_t_ha, harvest_date) VALUES
(1, 6.20, '2024-10-05'),
(2, 7.00, '2024-10-05'),
(3, 6.10, '2025-10-06'),
(4, 7.40, '2025-10-06'),

(5, 5.50, '2024-10-07'),
(6, 6.40, '2024-10-07'),
(7, 5.40, '2025-10-07'),
(8, 6.80, '2025-10-07'),

(9, 7.10, '2024-10-10'),
(10, 7.60, '2024-10-10'),
(11, 6.90, '2025-10-11'),
(12, 7.90, '2025-10-11'),

(13, 6.80, '2024-10-09'),
(14, 7.20, '2024-10-09'),
(15, 6.70, '2025-10-10'),
(16, 7.60, '2025-10-10'),

(17, 4.90, '2024-10-12'),
(18, 5.60, '2024-10-12'),
(19, 4.70, '2025-10-12'),
(20, 5.90, '2025-10-12'),

(21, 5.30, '2024-10-13'),
(22, 5.90, '2024-10-13'),
(23, 5.10, '2025-10-13'),
(24, 6.20, '2025-10-13');

-- Weather daily (with some missing tmean_c to teach NULL handling)
INSERT INTO weather_daily (farm_id, day, rainfall_mm, tmean_c) VALUES
(1, '2024-06-01', 12.4, 18.5),
(1, '2024-06-02',  0.0, 19.1),
(1, '2024-06-03',  5.2, NULL),
(1, '2025-06-01', 18.0, 17.9),
(1, '2025-06-02',  2.5, 18.4),
(1, '2025-06-03',  0.0, 20.2),

(2, '2024-06-01',  8.1, 20.0),
(2, '2024-06-02',  1.2, 21.3),
(2, '2024-06-03',  0.0, 22.1),
(2, '2025-06-01', 10.5, 19.4),
(2, '2025-06-02',  3.0, NULL),
(2, '2025-06-03',  0.0, 20.1),

(3, '2024-06-01',  2.0, 23.8),
(3, '2024-06-02',  0.0, 24.5),
(3, '2024-06-03',  0.0, 25.0),
(3, '2025-06-01',  1.0, 24.1),
(3, '2025-06-02',  0.0, NULL),
(3, '2025-06-03',  0.0, 24.9);

6) SQL queries in increasing difficulty

From this point, every query has SQL code line by line explanation a small illustrative expected result

6.1 SELECT columns and SELECT *

Query A specific columns

SELECT farm_id, farm_name, region
FROM farms;

Line by line

SELECT farm_id, farm_name, region choose exactly the columns you want.
FROM farms read rows from the farms table.

Expected result (illustrative)

farm_id	farm_name	region
1	Green Valley Farm	North
2	Riverbend Farm	Central
3	Sunrise Farm	South

Query B all columns with `*`

SELECT *
FROM treatments;

Line by line

SELECT * return all columns in the table.
FROM treatments source table.

Expected result (illustrative)

treatment_id	treatment_name	fertilizer_kg_ha
1	Control	0
2	Nitrogen_Low	60
3	Nitrogen_Medium	120
4	Nitrogen_High	180

Note that SELECT * is fine for quick exploration. For analysis scripts, prefer explicit columns so your query is stable if table structure changes.

6.2 WHERE filters (numbers, text, dates) and AND/OR

Query C numeric filter

SELECT field_id, field_name, area_ha
FROM fields
WHERE area_ha >= 12.0;

Line by line

SELECT ... keep only key columns for this question.
FROM fields table with field attributes.
WHERE area_ha >= 12.0 keep only fields at least 12 hectares.

Expected result (illustrative)

field_id	field_name	area_ha
1	North-01	12.50
3	Central-01	15.20
5	South-01	14.10

Query D year + date + AND

SELECT
    h.harvest_id,
    p.field_id,
    p.season_year,
    h.yield_t_ha,
    h.harvest_date
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
WHERE p.season_year = 2025
  AND h.harvest_date >= '2025-10-10';

Line by line

p.season_year = 2025 keep records from 2025 season.
AND both conditions must be true.
h.harvest_date >= '2025-10-10' keep later harvest dates.

Expected result (illustrative)

harvest_id	field_id	season_year	yield_t_ha	harvest_date
12	3	2025	7.90	2025-10-11
16	4	2025	7.60	2025-10-10
20	5	2025	5.90	2025-10-12

(…more rows match this filter in the sample data.)

Query E OR condition

SELECT farm_id, farm_name, region
FROM farms
WHERE region = 'North'
   OR region = 'South';

Line by line

region = 'North' OR region = 'South' either condition can match.

Expected result (illustrative)

farm_id	farm_name	region
1	Green Valley Farm	North
3	Sunrise Farm	South

A common mistake warning would be that OR can return many more rows than expected. Always sanity check row counts after broad filters.

PostgreSQL vs Oracle notes (WHERE)

Text comparison with = works similarly in both. Date literals are safest as DATE 'YYYY MM DD' in Oracle. PostgreSQL accepts 'YYYY MM DD' directly for a DATE column.

6.3 ORDER BY and LIMIT

Query F top yields

SELECT
    h.harvest_id,
    p.field_id,
    p.season_year,
    h.yield_t_ha
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
ORDER BY yield_t_ha DESC
LIMIT 5;

Line by line

ORDER BY yield_t_ha DESC sort from highest yield to lowest.
LIMIT 5 keep only first 5 rows after sorting.

Expected result (illustrative)

harvest_id	field_id	season_year	yield_t_ha
12	3	2025	7.90
10	3	2024	7.60
16	4	2025	7.60
4	1	2025	7.40
14	4	2024	7.20

PostgreSQL vs Oracle notes (ORDER BY/LIMIT)

PostgreSQL LIMIT 5. Oracle FETCH FIRST 5 ROWS ONLY (after ORDER BY).

6.4 Aggregates COUNT, AVG, SUM, GROUP BY, HAVING

SQL execution flow where filters rows and HAVING filters groups

Figure. Query flow reminder WHERE filters rows before grouping, HAVING filters groups after GROUP BY.

Query G average yield by treatment

SELECT
    t.treatment_name,
    COUNT(*) AS n_harvests,
    ROUND(AVG(h.yield_t_ha), 2) AS avg_yield_t_ha,
    ROUND(SUM(h.yield_t_ha), 2) AS total_yield_t_ha
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
JOIN treatments t
    ON p.treatment_id = t.treatment_id
GROUP BY t.treatment_name
ORDER BY avg_yield_t_ha DESC;

Line by line

COUNT(*) number of rows in each group.
AVG(...) average yield in each group.
SUM(...) total yield in each group.
JOIN trial_plots recover treatment assignment for each harvested plot.
JOIN treatments bring treatment names.
GROUP BY t.treatment_name one summary row per treatment.
ORDER BY avg_yield_t_ha DESC best average first.

Expected result (illustrative, computed from the sample inserts)

treatment_name	n_harvests	avg_yield_t_ha	total_yield_t_ha
Nitrogen_High	2	7.10	14.20
Nitrogen_Medium	6	6.83	41.00
Nitrogen_Low	4	6.57	26.30
Control	12	5.89	70.70

Do not rank treatments using only avg_yield_t_ha. Also check n_harvests because very small sample sizes can look “best” by chance. In this toy dataset, Nitrogen_High has high mean but only 2 rows, so uncertainty is larger. In real programs, always inspect balance before drawing conclusions.

Query H HAVING after GROUP BY

SELECT
    p.season_year,
    COUNT(*) AS n_records,
    ROUND(AVG(h.yield_t_ha), 2) AS avg_yield
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
GROUP BY p.season_year
HAVING AVG(h.yield_t_ha) > 6.0
ORDER BY p.season_year;

Line by line

GROUP BY season_year one row per season.
HAVING AVG(yield_t_ha) > 6.0 filter groups after aggregation.
WHERE filters rows before grouping; HAVING filters grouped results.

Expected result (illustrative, computed from the sample inserts)

season_year	n_records	avg_yield
2024	12	6.29
2025	12	6.39

Join row explosion from many-to-many keys and fix by matching grain

Figure. Join row explosion many to many joins multiply rows and aggregate to matching grain before joining.

6.5 JOINs INNER JOIN and LEFT JOIN

Before the queries, quick definitions are that INNER JOIN keep only matching rows in both tables. LEFT JOIN keep all rows from the left table, even if there is no match on the right.

INNER JOIN versus LEFT JOIN matched rows versus keep all left behavior

Figure. INNER JOIN keeps matched keys only. LEFT JOIN keeps all left table rows and fills unmatched right side values with NULL.

Query I INNER JOIN (matched rows only)

SELECT
    h.harvest_id,
    f.field_name,
    fm.farm_name,
    t.treatment_name,
    p.season_year,
    h.yield_t_ha
FROM harvests h
INNER JOIN trial_plots p
    ON h.plot_id = p.plot_id
INNER JOIN fields f
    ON p.field_id = f.field_id
INNER JOIN farms fm
    ON f.farm_id = fm.farm_id
INNER JOIN treatments t
    ON p.treatment_id = t.treatment_id
ORDER BY h.harvest_id;

Line by line

Start from harvests because that is the main event table.
Join to trial_plots to recover assignment metadata.
Join to fields using field_id.
Join to farms through fields.
Join to treatments using treatment_id.
Result one rich analysis table with names instead of only IDs.

Expected result (illustrative)

harvest_id	field_name	farm_name	treatment_name	season_year	yield_t_ha
1	North-01	Green Valley Farm	Control	2024	6.20
2	North-01	Green Valley Farm	Nitrogen_Medium	2024	7.00
3	North-01	Green Valley Farm	Control	2025	6.10

Query J LEFT JOIN (keep all fields even if no 2026 harvest)

SELECT
    f.field_id,
    f.field_name,
    h.harvest_id,
    p.season_year,
    h.yield_t_ha
FROM fields f
LEFT JOIN trial_plots p
    ON f.field_id = p.field_id
   AND p.season_year = 2026
LEFT JOIN harvests h
    ON p.plot_id = h.plot_id
ORDER BY f.field_id;

Line by line

Left table is fields, so every field must appear.
Right side starts with trial_plots filtered to 2026 in the ON clause.
Since we have no 2026 rows, plot/harvest columns become NULL.
This is useful to detect missing records.

Expected result (illustrative)

field_id	field_name	harvest_id	season_year	yield_t_ha
1	North-01	NULL	NULL	NULL
2	North-02	NULL	NULL	NULL
3	Central-01	NULL	NULL	NULL

Common mistake warning

If you put p.season_year = 2026 in WHERE instead of ON, you can accidentally turn your LEFT JOIN into INNER JOIN behavior.

6.6 CTE (`WITH`) to build a clean analysis table

A CTE (Common Table Expression) is a temporary named query. It helps you write readable SQL in steps.

CTE flow diagram

Figure. CTE flow aggregate weather first, then join into an analysis ready table.

Query K build a clean table for yield analysis

WITH weather_by_farm_year AS (
    SELECT
        farm_id,
        EXTRACT(YEAR FROM day)::INT AS season_year,
        ROUND(SUM(rainfall_mm), 2) AS total_rain_mm,
        ROUND(AVG(tmean_c), 2) AS avg_tmean_c
    FROM weather_daily
    GROUP BY farm_id, EXTRACT(YEAR FROM day)
),
analysis_table AS (
    SELECT
        h.harvest_id,
        p.plot_id,
        p.season_year,
        h.harvest_date,
        h.yield_t_ha,
        f.field_id,
        f.field_name,
        f.area_ha,
        f.soil_type,
        fm.farm_id,
        fm.farm_name,
        fm.region,
        t.treatment_id,
        t.treatment_name,
        t.fertilizer_kg_ha,
        w.total_rain_mm,
        w.avg_tmean_c
    FROM harvests h
    JOIN trial_plots p
        ON h.plot_id = p.plot_id
    JOIN fields f
        ON p.field_id = f.field_id
    JOIN farms fm
        ON f.farm_id = fm.farm_id
    JOIN treatments t
        ON p.treatment_id = t.treatment_id
    LEFT JOIN weather_by_farm_year w
        ON fm.farm_id = w.farm_id
       AND p.season_year = w.season_year
)
SELECT *
FROM analysis_table
ORDER BY harvest_id;

Line by line

weather_by_farm_year aggregate weather from daily to farm year level.
analysis_table join harvest + field + farm + treatment + weather summary.
LEFT JOIN with weather keep harvest rows even if weather summary is missing.
Final SELECT * FROM analysis_table inspect the clean combined dataset.

Expected result (illustrative)

harvest_id	season_year	farm_name	field_name	treatment_name	yield_t_ha	total_rain_mm	avg_tmean_c
1	2024	Green Valley Farm	North-01	Control	6.20	17.60	18.80
2	2024	Green Valley Farm	North-01	Nitrogen_Medium	7.00	17.60	18.80
3	2025	Green Valley Farm	North-01	Control	6.10	20.50	18.83

PostgreSQL vs Oracle notes (CTE)

CTE syntax is very similar. PostgreSQL casting often uses INT; Oracle uses CAST(... AS NUMBER) style.

6.7 Basic statistics in PostgreSQL

We can compute basic descriptive statistics directly in SQL.

Query L mean, SD, min, max, median by treatment

SELECT
    t.treatment_name,
    COUNT(*) AS n,
    ROUND(AVG(h.yield_t_ha), 2) AS mean_yield,
    ROUND(STDDEV_SAMP(h.yield_t_ha), 3) AS sd_yield,
    ROUND(MIN(h.yield_t_ha), 2) AS min_yield,
    ROUND(MAX(h.yield_t_ha), 2) AS max_yield,
    ROUND(
      (PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY h.yield_t_ha))::NUMERIC,
      2
    ) AS median_yield
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
JOIN treatments t
    ON p.treatment_id = t.treatment_id
GROUP BY t.treatment_name
ORDER BY mean_yield DESC;

Line by line

AVG arithmetic mean.
STDDEV_SAMP sample standard deviation.
MIN and MAX range endpoints.
PERCENTILE_CONT(0.5) WITHIN GROUP median.
Grouping by treatment gives one summary row per treatment.

Expected result (illustrative, computed from the sample inserts)

treatment_name	n	mean_yield	sd_yield	min_yield	max_yield	median_yield
Nitrogen_High	2	7.10	0.424	6.80	7.40	7.10
Nitrogen_Medium	6	6.83	0.802	5.90	7.90	6.70
Nitrogen_Low	4	6.57	0.974	5.60	7.60	6.55
Control	12	5.89	0.846	4.70	7.10	5.80

PostgreSQL vs Oracle notes (stats)

AVG, MIN, MAX, COUNT are equivalent. STDDEV_SAMP exists in both. PERCENTILE_CONT exists in both, but formatting/casting style differs slightly.

6.8 Handling missingness `NULL`, `IS NULL`, `COALESCE`

NULL means missing/unknown. It is not the same as zero.

NULL handling diagram

Figure. IS NULL is correct for missing checks; = NULL is not; COALESCE is useful for fallback values.

Query M find missing temperature rows

SELECT weather_id, farm_id, day, rainfall_mm, tmean_c
FROM weather_daily
WHERE tmean_c IS NULL
ORDER BY weather_id;

Line by line

IS NULL is the correct way to test missing values.
tmean_c = NULL is wrong and will not work as expected.

Expected result (illustrative)

weather_id	farm_id	day	rainfall_mm	tmean_c
3	1	2024-06-03	5.20	NULL
11	2	2025-06-02	3.00	NULL
17	3	2025-06-02	0.00	NULL

Query N replace NULL with fallback value

SELECT
    weather_id,
    farm_id,
    day,
    rainfall_mm,
    COALESCE(tmean_c, 20.0) AS tmean_filled_c
FROM weather_daily
ORDER BY weather_id;

Line by line

COALESCE(tmean_c, 20.0) if tmean_c is NULL, use 20.0.
Very useful for simple reporting, but be careful for scientific analysis.

Expected result (illustrative)

weather_id	tmean_filled_c
1	18.5
2	19.1
3	20.0

Warning: replacing NULL can hide data quality issues. For inference models, decide imputation strategy carefully.

PostgreSQL vs Oracle notes (NULL)

COALESCE works in both. Oracle also has NVL, which is Oracle specific.

6.9 Quality checks for duplicates, impossible values, outliers

Query O duplicate checks in trial_plots

SELECT
    field_id,
    treatment_id,
    season_year,
    replicate_no,
    COUNT(*) AS n_rows
FROM trial_plots
GROUP BY field_id, treatment_id, season_year, replicate_no
HAVING COUNT(*) > 1;

Line by line

Group by the key combination that should be unique.
HAVING COUNT(*) > 1 returns only duplicate combinations.

Expected result (illustrative)

field_id	treatment_id	season_year	replicate_no	n_rows
(no rows)

Query P impossible values

SELECT
    h.harvest_id,
    p.field_id,
    p.treatment_id,
    p.season_year,
    h.yield_t_ha,
    h.harvest_date
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
WHERE h.yield_t_ha < 0
   OR h.yield_t_ha > 20;

Line by line

Yield lower than 0 is impossible.
Yield above 20 t/ha may be impossible for this crop context.
These thresholds depend on your agronomic reality.

Expected result (illustrative)

harvest_id	field_id	treatment_id	season_year	yield_t_ha	harvest_date
(no rows)

Query Q outlier flags using IQR rule

WITH q AS (
    SELECT
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY yield_t_ha) AS q1,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY yield_t_ha) AS q3
    FROM harvests
),
bounds AS (
    SELECT
      q1,
      q3,
      (q3 - q1) AS iqr,
      (q1 - 1.5 * (q3 - q1)) AS lower_bound,
      (q3 + 1.5 * (q3 - q1)) AS upper_bound
    FROM q
)
SELECT
    h.harvest_id,
    p.field_id,
    p.season_year,
    h.yield_t_ha,
    b.lower_bound,
    b.upper_bound
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
CROSS JOIN bounds b
WHERE h.yield_t_ha < b.lower_bound
   OR h.yield_t_ha > b.upper_bound
ORDER BY h.yield_t_ha;

Line by line

First CTE computes quartiles (q1, q3).
Second CTE computes IQR and outlier bounds.
Final query flags rows outside bounds.

Expected result (illustrative)

harvest_id	field_id	season_year	yield_t_ha	lower_bound	upper_bound
(maybe no rows in this tiny dataset)

Important interpretation note

Outlier flag does not mean “wrong”. It means “investigate”.

6.10 Mini analysis question

Question

Which fertilizer treatment improved yield the most, while controlling for farm/field as simply as possible? Simple strategy would be

for each field_id and season_year, find the control yield (fertilizer_kg_ha = 0).
compare non control treatments against that same field season control.
lastly average gains by treatment and region.

Why this question is interesting in real field trials

raw treatment means mix treatment effect with site/season differences.
field season control subtraction is a simple way to reduce that bias.
it still does not fully solve confounding, but it is already a stronger comparison than global raw means.

This is not a full causal model, but it is much better than comparing raw means alone. Important context as a data analyst is that even though we can do useful summaries and basic statistics in SQL, statistics is not SQL’s main objective. SQL is mainly designed to

store and organize structured data safely,
filter and join data efficiently,
produce clean analysis tables.

For deeper statistical work (models, diagnostics, visualization, resampling, mixed models), tools like R are usually better because they are designed for statistical inference and scientific workflows.

A practical workflow I would use is

Use SQL to extract and clean the analysis ready table.
Pull that table into R.
Do advanced statistics and plots in R.
Optionally write summarized outputs back to SQL tables.

When exporting SQL data for downstream analysis in R, do not use ROUND() in SQL.
Keep full numeric precision so model estimates, standard errors, and later summaries are not distorted by early rounding.

Example (R + PostgreSQL)

# install.packages(c("DBI", "RPostgres", "dplyr"))
library(DBI)
library(RPostgres)

con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "farm_trials_db",
  host = "localhost",
  port = 5432,
  user = "your_user",
  password = "your_password"
)

sql_query <- "
WITH weather_by_farm_year AS (
  SELECT
    farm_id,
    EXTRACT(YEAR FROM day)::INT AS season_year,
    SUM(rainfall_mm) AS total_rain_mm,
    AVG(tmean_c) AS avg_tmean_c
  FROM weather_daily
  GROUP BY farm_id, EXTRACT(YEAR FROM day)
)
SELECT
  h.harvest_id,
  p.season_year,
  h.yield_t_ha,
  f.field_name,
  fm.farm_name,
  fm.region,
  t.treatment_name,
  t.fertilizer_kg_ha,
  w.total_rain_mm,
  w.avg_tmean_c
FROM harvests h
JOIN trial_plots p ON h.plot_id = p.plot_id
JOIN fields f ON p.field_id = f.field_id
JOIN farms fm ON f.farm_id = fm.farm_id
JOIN treatments t ON p.treatment_id = t.treatment_id
LEFT JOIN weather_by_farm_year w
  ON fm.farm_id = w.farm_id
 AND p.season_year = w.season_year
"

analysis_df <- dbGetQuery(con, sql_query)

# Then continue in R:
# - descriptive stats
# - plots
# - models (for example: lm(), lme4::lmer(), etc.)

dbDisconnect(con)

This split keeps your workflow robust since we use SQL for data engineering, R for statistical depth.

Query R field season control adjusted gains

WITH baseline AS (
    SELECT
        p.field_id,
        p.season_year,
        AVG(h.yield_t_ha) AS control_yield_t_ha
    FROM harvests h
    JOIN trial_plots p
      ON h.plot_id = p.plot_id
    JOIN treatments t
      ON p.treatment_id = t.treatment_id
    WHERE t.fertilizer_kg_ha = 0
    GROUP BY p.field_id, p.season_year
),
comparisons AS (
    SELECT
        fm.region,
        fm.farm_name,
        f.field_name,
        p.field_id,
        p.season_year,
        t.treatment_name,
        t.fertilizer_kg_ha,
        h.yield_t_ha,
        b.control_yield_t_ha,
        (h.yield_t_ha - b.control_yield_t_ha) AS gain_vs_control_t_ha
    FROM harvests h
    JOIN trial_plots p
      ON h.plot_id = p.plot_id
    JOIN treatments t
      ON p.treatment_id = t.treatment_id
    JOIN baseline b
      ON p.field_id = b.field_id
     AND p.season_year = b.season_year
    JOIN fields f
      ON p.field_id = f.field_id
    JOIN farms fm
      ON f.farm_id = fm.farm_id
    WHERE t.fertilizer_kg_ha > 0
)
SELECT
    region,
    treatment_name,
    COUNT(*) AS n_field_seasons,
    AVG(gain_vs_control_t_ha) AS avg_gain_t_ha,
    MIN(gain_vs_control_t_ha) AS min_gain_t_ha,
    MAX(gain_vs_control_t_ha) AS max_gain_t_ha
FROM comparisons
GROUP BY region, treatment_name
ORDER BY avg_gain_t_ha DESC;

Line by line

baseline one control value per field season.
comparisons compute gain = treatment yield control yield for same field season.
Final aggregation by region and treatment_name.
We report average and range of gains.

Expected result (illustrative, computed from the sample inserts)

region	treatment_name	n_field_seasons	avg_gain_t_ha	min_gain_t_ha	max_gain_t_ha
North	Nitrogen_High	2	1.35	1.30	1.40
South	Nitrogen_Medium	2	1.15	1.10	1.20
Central	Nitrogen_Medium	2	0.95	0.90	1.00
North	Nitrogen_Medium	2	0.85	0.80	0.90
South	Nitrogen_Low	2	0.65	0.60	0.70
Central	Nitrogen_Low	2	0.45	0.40	0.50

Interpretation would be that higher fertilizer rates show larger gains, especially in the North. This is still a simple within-field comparison inside a toy dataset.
If your trial is randomized, difference vs control is a reasonable estimator, but missingness and design imbalance can still bias summaries.

Note that confounding means other factors may influence both treatment assignment and yield (for example soil class, rainfall timing, pest pressure, planting date). Even with field season control subtraction, we are not proving causality.

6.11 A practical interpretation check

Before deciding that “Treatment X is better”, check balance across regions and years. If one treatment appears mostly in stronger environments, it can look better even when its true effect is smaller.

Query S treatment representation by region and season

SELECT
    fm.region,
    p.season_year,
    t.treatment_name,
    COUNT(*) AS n_plots
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
JOIN fields f
    ON p.field_id = f.field_id
JOIN farms fm
    ON f.farm_id = fm.farm_id
JOIN treatments t
    ON p.treatment_id = t.treatment_id
GROUP BY fm.region, p.season_year, t.treatment_name
ORDER BY fm.region, p.season_year, t.treatment_name;

Line by line

We count how many harvested plots appear for each treatment in each region season.
This is not a yield summary; it is a design summary.
If counts are highly unbalanced, interpret treatment means with caution.

Expected result (illustrative)

region	season_year	treatment_name	n_plots
Central	2024	Control	2
Central	2024	Nitrogen_Low	1
Central	2024	Nitrogen_Medium	1
North	2025	Control	2
North	2025	Nitrogen_High	2

Query T area weighted average yield by treatment

SELECT
    t.treatment_name,
    ROUND(
      SUM(h.yield_t_ha * f.area_ha) / NULLIF(SUM(f.area_ha), 0),
      2
    ) AS area_weighted_mean_yield_t_ha
FROM harvests h
JOIN trial_plots p
    ON h.plot_id = p.plot_id
JOIN fields f
    ON p.field_id = f.field_id
JOIN treatments t
    ON p.treatment_id = t.treatment_id
GROUP BY t.treatment_name
ORDER BY area_weighted_mean_yield_t_ha DESC;

Line by line

Multiply each yield by field area (yield * area).
Divide by total area for that treatment.
This gives an area weighted mean, which can be more realistic for farm level decisions.

Unweighted means treat each field equally. Weighted means give more influence to larger fields. Neither is always “right”; the right choice depends on your decision question. If you are making logistics or tonnage decisions, weighted summaries are often more relevant.

Trial note: in real field experiments, weighting should usually use plot area (or equal weights per plot/replicate), not the whole field area.
Here we use field.area_ha only as a simple demo of weighted means.

6.12 Data manipulation essentials

So far we used SQL mostly to read and summarize data. In real projects, you will also need to correct, append, and safely maintain tables.

These commands are the most useful next step after SELECT, JOIN, and GROUP BY.

Query U `UPDATE` to correct values

UPDATE harvests
SET yield_t_ha = 6.95
WHERE harvest_id = 11;

Line by line

UPDATE harvests target table to modify.
SET yield_t_ha = 6.95 new value.
WHERE harvest_id = 11 only this row.

Important warning would be that never run UPDATE without WHERE unless you truly want to modify all rows.

Query V `DELETE` to remove wrong rows

DELETE FROM harvests
WHERE harvest_id = 999;

Line by line

DELETE FROM remove rows from table.
WHERE ... define exactly which rows to remove.

Another important warning would be that always run the matching SELECT first to verify which rows will be deleted.

Query W `INSERT ... SELECT` to load transformed data

Assume weather_daily_staging is a temporary/staging table you created earlier (for example from a CSV import).

INSERT INTO weather_daily (farm_id, day, rainfall_mm, tmean_c)
SELECT
    farm_id,
    day,
    rainfall_mm,
    tmean_c
FROM weather_daily_staging;

Line by line

Target columns are listed explicitly.
Source rows come from another table or query.
This is very common in ETL pipelines.

Query X upsert in PostgreSQL (`INSERT ... ON CONFLICT`)

INSERT INTO treatments (treatment_id, treatment_name, fertilizer_kg_ha)
OVERRIDING SYSTEM VALUE
VALUES (4, 'Nitrogen_High', 180)
ON CONFLICT (treatment_id)
DO UPDATE SET
    treatment_name = EXCLUDED.treatment_name,
    fertilizer_kg_ha = EXCLUDED.fertilizer_kg_ha;

Line by line

Try to insert a row.
If key already exists (treatment_id), update instead.
Useful when syncing reference tables.

Note: OVERRIDING SYSTEM VALUE is needed here because treatment_id is an identity column defined as GENERATED ALWAYS.

Note that PostgreSQL commonly uses ON CONFLICT. Oracle workflows often use MERGE for similar behavior.

Query Y `CASE WHEN` for recoding inside queries

SELECT
    field_id,
    field_name,
    area_ha,
    CASE
        WHEN area_ha >= 12 THEN 'large_field'
        WHEN area_ha >= 10 THEN 'medium_field'
        ELSE 'small_field'
    END AS field_size_class
FROM fields;

Line by line

CASE creates a derived category.
Conditions are evaluated from top to bottom.
Very useful for reporting and rule based labels.

Query Z `NOT EXISTS` for missing link checks

SELECT
    f.field_id,
    f.field_name
FROM fields f
WHERE NOT EXISTS (
    SELECT 1
    FROM trial_plots p
    JOIN harvests h
      ON h.plot_id = p.plot_id
    WHERE p.field_id = f.field_id
);

Line by line

Start from all fields.
Keep only fields with no matching harvest rows.
This is one of the best QA patterns for relational data.

Safe workflow transactions for batch edits

Transaction workflow diagram

Figure. Safe edit timeline start transaction, make changes, check, then COMMIT or ROLLBACK.

BEGIN;

UPDATE harvests
SET yield_t_ha = yield_t_ha + 0.10
WHERE plot_id IN (
  SELECT plot_id
  FROM trial_plots
  WHERE season_year = 2025
    AND field_id IN (1, 2)
);

-- Check your result before finalizing:
SELECT
    h.*
FROM harvests h
JOIN trial_plots p
  ON h.plot_id = p.plot_id
WHERE p.season_year = 2025
  AND p.field_id IN (1, 2);

-- If correct:
COMMIT;

-- If not correct:
-- ROLLBACK;

For any non trivial UPDATE or DELETE, use a transaction. It gives you a safety net while learning and in production.

7) Quick recap of common mistakes

Before you finish, keep these five checks in mind

Always confirm your WHERE clause before running a query.
Use IS NULL for missing values, not = NULL.
In LEFT JOIN, avoid filtering right table columns in WHERE unless that is exactly your intent.
Keep row filters in WHERE and group filters in HAVING.
Run quality checks before interpreting results.

If you are new to SQL and reached this point, you have built a strong base. You can now design a small relational schema, load data, join tables safely, summarize outputs, handle missingness, and ask agronomic questions with reproducible logic.

SQL for beginners: an example in field trials

Table of contents

1) Why SQL matters in agriculture analytics

2) What is a table, row, column? What is a key?

Table

Row

Column

Primary key

Foreign key

3) The data story as an illustrative simple field trial dataset

3.1) How to design an SQL schema (using this farm example)

How each table looks (and how they connect)

Step A define table grain first

Step B choose primary keys and foreign keys

Step C define business uniqueness with constraints

Step D pick data types and validation rules

Step E separate raw ingestion from clean analytics

Step F design for change

Mini design checklist (farm trial version)

4) Create the schema (PostgreSQL)

Line by line explanation (key points)

PostgreSQL vs Oracle notes (schema)

5) Insert a small sample dataset

6) SQL queries in increasing difficulty

6.1 SELECT columns and SELECT *

Query A specific columns

Query B all columns with *

6.2 WHERE filters (numbers, text, dates) and AND/OR

Query C numeric filter

Query D year + date + AND

Query E OR condition

PostgreSQL vs Oracle notes (WHERE)

6.3 ORDER BY and LIMIT

Query F top yields

PostgreSQL vs Oracle notes (ORDER BY/LIMIT)

6.4 Aggregates COUNT, AVG, SUM, GROUP BY, HAVING

Query G average yield by treatment

Query H HAVING after GROUP BY

6.5 JOINs INNER JOIN and LEFT JOIN

Query I INNER JOIN (matched rows only)

Query J LEFT JOIN (keep all fields even if no 2026 harvest)

6.6 CTE (WITH) to build a clean analysis table

Query K build a clean table for yield analysis

PostgreSQL vs Oracle notes (CTE)

6.7 Basic statistics in PostgreSQL

Query L mean, SD, min, max, median by treatment

PostgreSQL vs Oracle notes (stats)

6.8 Handling missingness NULL, IS NULL, COALESCE

Query M find missing temperature rows

Query N replace NULL with fallback value

PostgreSQL vs Oracle notes (NULL)

6.9 Quality checks for duplicates, impossible values, outliers

Query O duplicate checks in trial_plots

Query P impossible values

Query Q outlier flags using IQR rule

6.10 Mini analysis question

Query R field season control adjusted gains

6.11 A practical interpretation check

Query S treatment representation by region and season

Query T area weighted average yield by treatment

6.12 Data manipulation essentials

Query U UPDATE to correct values

Query V DELETE to remove wrong rows

Query W INSERT ... SELECT to load transformed data

Query X upsert in PostgreSQL (INSERT ... ON CONFLICT)

Query Y CASE WHEN for recoding inside queries

Query Z NOT EXISTS for missing link checks

Safe workflow transactions for batch edits

7) Quick recap of common mistakes

Query B all columns with `*`

6.6 CTE (`WITH`) to build a clean analysis table

6.8 Handling missingness `NULL`, `IS NULL`, `COALESCE`

Query U `UPDATE` to correct values

Query V `DELETE` to remove wrong rows

Query W `INSERT ... SELECT` to load transformed data

Query X upsert in PostgreSQL (`INSERT ... ON CONFLICT`)

Query Y `CASE WHEN` for recoding inside queries

Query Z `NOT EXISTS` for missing link checks