Current maintainer:     Bruce Momjian (pgman@candle.pha.pa.us)
Last updated:           Wed Aug 24 15:43:15 EDT 2005
The most recent version of this document can be viewed at
http://www.postgresql.org/docs/faqs.TODO.html.
A hyphen, "-", marks changes that will appear in the upcoming 8.1 release.
A percent sign, "%", marks items that are easier to implement.
Bracketed items, "[]", have more detail.
This list contains all known PostgreSQL bugs and feature requests. If
you would like to work on an item, please read the Developer's FAQ
first.
This would allow an application inheriting a pooled connection to know the queries prepared in the current session.
Currently SIGTERM of a backend can lead to lock table corruption.
Currently all schemas are owned by the super-user because they are copied from the template1 database.
You can use any of the master/slave replication servers to use a standby server for data warehousing. To allow read/write queries to multiple servers, you need multi-master replication like pgcluster.
Currently, if a variable is commented out, it keeps the previous uncommented value until a server restarted. Logically, a reload should set the same values as a server restart.
This would add a function to load the SQL table from pg_hba.conf, and one to writes its contents to the flat file. The table should have a line number that is a float so rows can be inserted between existing rows, e.g. row 2.5 goes between row 2 and row 3.
All objects in the default database tablespace must have default tablespace specifications. This is because new databases are created by copying directories. If you mix default tablespace tables and tablespace-specified tables in the same directory, creating a new database from such a mixed directory would create a new database with tables that had incorrect explicit tablespaces. To fix this would require modifying pg_class in the newly copied database, which we don't currently do.
This item is difficult because a tablespace can contain objects from multiple databases. There is a server-side function that returns the databases which use a specific tablespace, so this requires a tool that will call that function and connect to each database to find the objects in each database for that tablespace.
It could start with a random tablespace from a supplied list and cycle through the list.
Currently only full WAL files are archived. This means that the most recent transactions aren't available for recovery in case of a disk failure. This could be triggered by a user command or a timer.
Doing this will allow administrators to know more easily when the archive contins all the files needed for point-in-time recovery.
This is useful for checking PITR recovery.
This would allow server log information to be easily loaded into a database for analysis.
Currently NUMERIC _rounds_ the result to the specified precision. This means division can return a result that multiplied by the divisor is greater than the dividend, e.g. this returns a value > 10:
SELECT (10::numeric(2,0) / 6::numeric(2,0))::numeric(2,0) * 6;
  The positive modulus result returned by NUMERICs might be considered
  inaccurate, in one sense.
If the TIMESTAMP value is stored with a time zone name, interval computations should adjust based on the time zone rules.
'1:30' MINUTE TO SECOND as '1 minute 30 seconds', and interpret '1:30' as '1 hour, 30 minutes'
Currently large objects entries do not have owners. Permissions can only be set at the pg_largeobject table level.
This requires the TOAST column to be stored EXTERNAL.
Current CURRENT_TIMESTAMP returns the start time of the current transaction, and gettimeofday() returns the wallclock time. This will make time reporting more consistent and will allow reporting of the statement start time.
Some special format flag would be required to request such accumulation. Such functionality could also be added to EXTRACT. Prevent accumulation that crosses the month/day boundary because of the uneven number of days in a month.
For example, to_char('1 month', 'mon') is meaningless. Basically, most date-related parameters to to_char() are meaningless for intervals because interval is not anchored to a date.
Currently locale can only be set during initdb. No global tables have locale-aware columns. However, the database template used during database creation might have locale-aware indexes. The indexes would need to be reindexed to match the new locale.
Right now only one encoding is allowed per database.
We can only auto-create rules for simple views. For more complex cases users will still have to write rules.
UPDATE already allows this (UPDATE...FROM) but we need similar functionality in DELETE. It's been agreed that the keyword should be USING, to avoid anything as confusing as DELETE FROM a FROM b.
Currently only the owner can TRUNCATE a table because triggers are not called, and the table is locked in exclusive mode.
Currently, queries prepared via the libpq API are planned on first execute using the supplied parameters --- allow SQL PREPARE to do the same. Also, allow control over replanning prepared queries either manually or automatically when statistics for execute parameters differ dramatically from those used during planning.
Currently LISTEN/NOTIFY information is stored in pg_listener. Storing such information in memory would improve performance.
This would allow an informational message to be added to the notify message, perhaps indicating the row modified or other custom information.
When enabled, this would allow errors in multi-statement transactions to be automatically ignored.
This would include resetting of all variables (RESET ALL), dropping of temporary tables, removing any NOTIFYs, cursors, open transactions, prepared queries, currval()s, etc. This could be used for connection pooling. We could also change RESET ALL to have this functionality. The difficult of this features is allowing RESET ALL to not affect changes made by the interface driver for its internal use. One idea is for this to be a protocol-only feature. Another approach is to notify the protocol when a RESET CONNECTION command is used.
This is not SQL-spec but many DBMSs allow it.
Currently non-global system tables must be in the default database tablespace. Global system tables can never be moved.
This might require some background daemon to maintain clustering during periods of low usage. It might also require tables to be only paritally filled for easier reorganization. Another idea would be to create a merged heap/index data file so an index lookup would automatically access the heap data too. A third idea would be to store heap rows in hashed groups, perhaps using a user-supplied hash function.
To do this, determine the ideal cluster index for each system table and set the cluster setting during initdb.
This requires the use of a savepoint before each COPY line is processed, with ROLLBACK on COPY failure.
The proposed syntax is:
GRANT SELECT ON ALL TABLES IN public TO phpuser; GRANT SELECT ON NEW TABLES IN public TO phpuser;
This requires using the row ctid to map cursor rows back to the original heap row. This become more complicated if WITH HOLD cursors are to be supported because WITH HOLD cursors have a copy of the row and no FOR UPDATE lock.
Because WITH HOLD cursors exist outside transactions, this allows them to be listed so they can be closed.
This is useful for returning the auto-generated key for an INSERT. One complication is how to handle rules that run as part of the insert.
This is basically the same as SET search_path.
Currently only constants are supported.
This requires the cached PL/PgSQL byte code to be invalidated when an object referenced in the function is changed.
This would be used for checking if the server is up.
pg_ctl can not read the pid file because it isn't located in the config directory but in the PGDATA directory. The solution is to allow pg_ctl to read and understand postgresql.conf to find the data_directory value.
This would allow non-psql clients to pull the same information out of the database as psql.
This is probably best done by combining pg_dump and pg_dumpall into a single binary.
Document differences between ecpg and the SQL standard and information about the Informix-compatibility module.
Right now all deferred trigger information is stored in backend memory. This could exhaust memory for very large trigger queues. This item involves dumping large queues into files.
This is currently possible by starting a multi-statement transaction, modifying the system tables, performing the desired SQL, restoring the system tables, and committing the transaction. ALTER TABLE ... TRIGGER requires a table lock so it is not ideal for this usage.
If the dump is known to be valid, allow foreign keys to be added without revalidating the data.
This was used in older releases to dump referential integrity constraints.
System tables are modified in many places in the backend without going through the executor and therefore not causing triggers to fire. To complete this item, the functions that modify system tables will have to fire triggers.
This could allow SQL written for other databases to run without modification.
This can be done using dblink and two-phase commit.
The main difficulty with this item is the problem of creating an index that can span more than one table.
MIN/MAX queries can already be rewritten as SELECT col FROM tab ORDER BY col {DESC} LIMIT 1. Completing this item involves doing this transformation automatically.
For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and col3 = 9, spin though the index checking for col1 and col3 matches, rather than just col1; also called skip-scanning.
Uniqueness (index) checks are done when updating a column even if the column is not modified by the UPDATE.
Rather than randomly accessing heap pages based on index entries, mark heap pages needing access in a bitmap and do the lookups in sequential order. Another method would be to sort heap ctids matching the index before accessing the heap rows.
This feature allows separate indexes to be ANDed or ORed together. This is particularly useful for data warehousing applications that need to query the database in an many permutations. This feature scans an index and creates an in-memory bitmap, and allows that bitmap to be combined with other bitmap created in a similar way. The bitmap can either index all TIDs, or be lossy, meaning it records just page numbers and each page tuple has to be checked for validity in a separate pass.
Such indexes could be more compact if there are only a few distinct values. Such indexes can also be compressed. Keeping such indexes updated can be costly.
One solution is to create a partial index on an IS NULL expression.
Currently no only one hash bucket can be stored on a page. Ideally several hash buckets could be stored on a single page and greater granularity used for the hash algorithm.
Posix_fadvise() can control both sequential/random file caching and free-behind behavior, but it is unclear how the setting affects other backends that also have the file open, and the feature is not supported on all operating systems.
O_DIRECT doesn't have the same media write guarantees as fsync, so it is in addition to the fsync method, not in place of it.
We could use a fixed row count and a +/- count to follow MVCC visibility rules, or a single cached value could be used and invalidated if anyone modifies the table. Another idea is to get a count directly from a unique index, but for this to be faster than a sequential scan it must avoid access to the heap to obtain tuple visibility information.
Currently indexes do not have enough tuple visibility information to allow data to be pulled from the index without also accessing the heap. One way to allow this is to set a bit to index tuples to indicate if a tuple is currently visible to all transactions when the first valid heap lookup happens. This bit would have to be cleared when a heap tuple is expired.
Larger local buffer cache sizes requires more efficient handling of local cache lookups.
Allow the background writer to more efficiently write dirty buffers from the end of the LRU cache and use a clock sweep algorithm to write other dirty buffers to reduced checkpoint I/O
One possible implementation is to start sequential scans from the lowest numbered buffer in the shared cache, and when reaching the end wrap around to the beginning, rather than always starting sequential scans at the start of the table.
For large table adjustements during vacuum, it is faster to reindex rather than update the index.
Moved tuples are invisible to other backends so they don't require a write lock. However, the read lock promotion to write lock could lead to deadlock situations.
This allows vacuum to target specific pages for possible free space without requiring a sequential scan.
Instead of sequentially scanning the entire table, have the background writer or some other process record pages that have expired rows, then VACUUM can look at just those pages rather than the entire table. In the event of a system crash, the bitmap would probably be invalidated.
This requires that more locks be acquired but this would reduce lock contention, improving concurrency.
On SMP machines, it is possible that locks might be released shortly, while on non-SMP machines, the backend should sleep so the process holding the lock can complete and release it.
i386-based SMP machines can generate excessive context switching caused by lock failure in high concurrency situations. This may be caused by CPU cache line invalidation inefficiencies.
This would prevent the overhead associated with process creation. Most operating systems have trivial process creation time compared to database startup overhead, but a few operating systems (WIn32, Solaris) might benefit from threading. Also explore the idea of a single session using multiple threads to execute a query faster.
It is unclear if this should be done inside the backend code or done by something external like pgpool. The passing of file descriptors to existing backends is one of the difficulties with a backend approach.
Currently, to protect against partial disk page writes, we write full page images to WAL before they are modified so we can correct any partial page writes during recovery. These pages can also be eliminated from point-in-time archive files.
If CRC check fails during recovery, remember the page in case a later CRC for that page properly matches.
This allows most full page writes to happen in the background writer. It might cause problems for applying WAL on recovery into a partially-written page, but later the full page will be replaced from WAL.
Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay.
Instead of guaranteeing recovery of all committed transactions, this would provide improved performance by delaying WAL writes and fsync so an abrupt operating system restart might lose a few seconds of committed transactions but still be consistent. We could perhaps remove the 'fsync' parameter (which results in an an inconsistent database) in favor of this capability.
Right now, if no index exists, ORDER BY ... LIMIT # requires we sort all values to return the high/low value. Instead The idea is to do a sequential scan to find the high/low value, thus avoiding the sort. MIN/MAX already does this, but not for LIMIT > 1.
CHECK constraints contain information about the distribution of values within the table. This is also useful for implementing subtables where a tables content is distributed across several subtables.
This would be beneficial when there are few distinct values.
Async I/O allows multiple I/O requests to be sent to the disk with results coming back asynchronously.
This would remove the requirement for SYSV SHM but would introduce portability issues. Anonymous mmap (or mmap to /dev/zero) is required to prevent I/O overhead.
Doing I/O to large tables would consume a lot of address space or require frequent mapping/unmapping. Extending the file also causes mapping problems that might require mapping only individual pages, leading to thousands of mappings. Another problem is that there is no way to _prevent_ I/O to disk from the dirty shared buffers so changes could hit disk before WAL is written.
This is probably not possible because 'gmake' and other compiler tools do not fully support quoting of paths with spaces.
This is possible if proper quoting is added to the makefiles for the install targets. Because PostgreSQL supports relocatable installs, it is already possible to install into a directory that doesn't contain spaces and then copy the install to a directory with spaces.
To fix this, the data needs to be converted to/from UTF16/UTF8 so the Win32 wcscoll() can be used, and perhaps other functions like towupper(). However, UTF8 already works with normal locales but provides no ordering or character set classes.