// SPDX-License-Identifier: GPL-2.0-or-later /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * Implementation of the Transmission Control Protocol(TCP). * * Authors: Ross Biro * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> * Mark Evans, <evansmp@uhura.aston.ac.uk> * Corey Minyard <wf-rch!minyard@relay.EU.net> * Florian La Roche, <flla@stud.uni-sb.de> * Charles Hedrick, <hedrick@klinzhai.rutgers.edu> * Linus Torvalds, <torvalds@cs.helsinki.fi> * Alan Cox, <gw4pts@gw4pts.ampr.org> * Matthew Dillon, <dillon@apollo.west.oic.com> * Arnt Gulbrandsen, <agulbra@nvg.unit.no> * Jorge Cwik, <jorge@laser.satlink.net> * * Fixes: * Alan Cox : Numerous verify_area() calls * Alan Cox : Set the ACK bit on a reset * Alan Cox : Stopped it crashing if it closed while * sk->inuse=1 and was trying to connect * (tcp_err()). * Alan Cox : All icmp error handling was broken * pointers passed where wrong and the * socket was looked up backwards. Nobody * tested any icmp error code obviously. * Alan Cox : tcp_err() now handled properly. It * wakes people on errors. poll * behaves and the icmp error race * has gone by moving it into sock.c * Alan Cox : tcp_send_reset() fixed to work for * everything not just packets for * unknown sockets. * Alan Cox : tcp option processing. * Alan Cox : Reset tweaked (still not 100%) [Had * syn rule wrong] * Herp Rosmanith : More reset fixes * Alan Cox : No longer acks invalid rst frames. * Acking any kind of RST is right out. * Alan Cox : Sets an ignore me flag on an rst * receive otherwise odd bits of prattle * escape still * Alan Cox : Fixed another acking RST frame bug. * Should stop LAN workplace lockups. * Alan Cox : Some tidyups using the new skb list * facilities * Alan Cox : sk->keepopen now seems to work * Alan Cox : Pulls options out correctly on accepts * Alan Cox : Fixed assorted sk->rqueue->next errors * Alan Cox : PSH doesn't end a TCP read. Switched a * bit to skb ops. * Alan Cox : Tidied tcp_data to avoid a potential * nasty. * Alan Cox : Added some better commenting, as the * tcp is hard to follow * Alan Cox : Removed incorrect check for 20 * psh * Michael O'Reilly : ack < copied bug fix. * Johannes Stille : Misc tcp fixes (not all in yet). * Alan Cox : FIN with no memory -> CRASH * Alan Cox : Added socket option proto entries. * Also added awareness of them to accept. * Alan Cox : Added TCP options (SOL_TCP) * Alan Cox : Switched wakeup calls to callbacks, * so the kernel can layer network * sockets. * Alan Cox : Use ip_tos/ip_ttl settings. * Alan Cox : Handle FIN (more) properly (we hope). * Alan Cox : RST frames sent on unsynchronised * state ack error. * Alan Cox : Put in missing check for SYN bit. * Alan Cox : Added tcp_select_window() aka NET2E * window non shrink trick. * Alan Cox : Added a couple of small NET2E timer * fixes * Charles Hedrick : TCP fixes * Toomas Tamm : TCP window fixes * Alan Cox : Small URG fix to rlogin ^C ack fight * Charles Hedrick : Rewrote most of it to actually work * Linus : Rewrote tcp_read() and URG handling * completely * Gerhard Koerting: Fixed some missing timer handling * Matthew Dillon : Reworked TCP machine states as per RFC * Gerhard Koerting: PC/TCP workarounds * Adam Caldwell : Assorted timer/timing errors * Matthew Dillon : Fixed another RST bug * Alan Cox : Move to kernel side addressing changes. * Alan Cox : Beginning work on TCP fastpathing * (not yet usable) * Arnt Gulbrandsen: Turbocharged tcp_check() routine. * Alan Cox : TCP fast path debugging * Alan Cox : Window clamping * Michael Riepe : Bug in tcp_check() * Matt Dillon : More TCP improvements and RST bug fixes * Matt Dillon : Yet more small nasties remove from the * TCP code (Be very nice to this man if * tcp finally works 100%) 8) * Alan Cox : BSD accept semantics. * Alan Cox : Reset on closedown bug. * Peter De Schrijver : ENOTCONN check missing in tcp_sendto(). * Michael Pall : Handle poll() after URG properly in * all cases. * Michael Pall : Undo the last fix in tcp_read_urg() * (multi URG PUSH broke rlogin). * Michael Pall : Fix the multi URG PUSH problem in * tcp_readable(), poll() after URG * works now. * Michael Pall : recv(...,MSG_OOB) never blocks in the * BSD api. * Alan Cox : Changed the semantics of sk->socket to * fix a race and a signal problem with * accept() and async I/O. * Alan Cox : Relaxed the rules on tcp_sendto(). * Yury Shevchuk : Really fixed accept() blocking problem. * Craig I. Hagan : Allow for BSD compatible TIME_WAIT for * clients/servers which listen in on * fixed ports. * Alan Cox : Cleaned the above up and shrank it to * a sensible code size. * Alan Cox : Self connect lockup fix. * Alan Cox : No connect to multicast. * Ross Biro : Close unaccepted children on master * socket close. * Alan Cox : Reset tracing code. * Alan Cox : Spurious resets on shutdown. * Alan Cox : Giant 15 minute/60 second timer error * Alan Cox : Small whoops in polling before an * accept. * Alan Cox : Kept the state trace facility since * it's handy for debugging. * Alan Cox : More reset handler fixes. * Alan Cox : Started rewriting the code based on * the RFC's for other useful protocol * references see: Comer, KA9Q NOS, and * for a reference on the difference * between specifications and how BSD * works see the 4.4lite source. * A.N.Kuznetsov : Don't time wait on completion of tidy * close. * Linus Torvalds : Fin/Shutdown & copied_seq changes. * Linus Torvalds : Fixed BSD port reuse to work first syn * Alan Cox : Reimplemented timers as per the RFC * and using multiple timers for sanity. * Alan Cox : Small bug fixes, and a lot of new * comments. * Alan Cox : Fixed dual reader crash by locking * the buffers (much like datagram.c) * Alan Cox : Fixed stuck sockets in probe. A probe * now gets fed up of retrying without * (even a no space) answer. * Alan Cox : Extracted closing code better * Alan Cox : Fixed the closing state machine to * resemble the RFC. * Alan Cox : More 'per spec' fixes. * Jorge Cwik : Even faster checksumming. * Alan Cox : tcp_data() doesn't ack illegal PSH * only frames. At least one pc tcp stack * generates them. * Alan Cox : Cache last socket. * Alan Cox : Per route irtt. * Matt Day : poll()->select() match BSD precisely on error * Alan Cox : New buffers * Marc Tamsky : Various sk->prot->retransmits and * sk->retransmits misupdating fixed. * Fixed tcp_write_timeout: stuck close, * and TCP syn retries gets used now. * Mark Yarvis : In tcp_read_wakeup(), don't send an * ack if state is TCP_CLOSED. * Alan Cox : Look up device on a retransmit - routes may * change. Doesn't yet cope with MSS shrink right * but it's a start! * Marc Tamsky : Closing in closing fixes. * Mike Shaver : RFC1122 verifications. * Alan Cox : rcv_saddr errors. * Alan Cox : Block double connect(). * Alan Cox : Small hooks for enSKIP. * Alexey Kuznetsov: Path MTU discovery. * Alan Cox : Support soft errors. * Alan Cox : Fix MTU discovery pathological case * when the remote claims no mtu! * Marc Tamsky : TCP_CLOSE fix. * Colin (G3TNE) : Send a reset on syn ack replies in * window but wrong (fixes NT lpd problems) * Pedro Roque : Better TCP window handling, delayed ack. * Joerg Reuter : No modification of locked buffers in * tcp_do_retransmit() * Eric Schenk : Changed receiver side silly window * avoidance algorithm to BSD style * algorithm. This doubles throughput * against machines running Solaris, * and seems to result in general * improvement. * Stefan Magdalinski : adjusted tcp_readable() to fix FIONREAD * Willy Konynenberg : Transparent proxying support. * Mike McLagan : Routing by source * Keith Owens : Do proper merging with partial SKB's in * tcp_do_sendmsg to avoid burstiness. * Eric Schenk : Fix fast close down bug with * shutdown() followed by close(). * Andi Kleen : Make poll agree with SIGIO * Salvatore Sanfilippo : Support SO_LINGER with linger == 1 and * lingertime == 0 (RFC 793 ABORT Call) * Hirokazu Takahashi : Use copy_from_user() instead of * csum_and_copy_from_user() if possible. * * Description of States: * * TCP_SYN_SENT sent a connection request, waiting for ack * * TCP_SYN_RECV received a connection request, sent ack, * waiting for final ack in three-way handshake. * * TCP_ESTABLISHED connection established * * TCP_FIN_WAIT1 our side has shutdown, waiting to complete * transmission of remaining buffered data * * TCP_FIN_WAIT2 all buffered data sent, waiting for remote * to shutdown * * TCP_CLOSING both sides have shutdown but we still have * data we have to finish sending * * TCP_TIME_WAIT timeout to catch resent junk before entering * closed, can only be entered from FIN_WAIT2 * or CLOSING. Required because the other end * may not have gotten our last ACK causing it * to retransmit the data packet (which we ignore) * * TCP_CLOSE_WAIT remote side has shutdown and is waiting for * us to finish writing our data and to shutdown * (we have to close() to move on to LAST_ACK) * * TCP_LAST_ACK out side has shutdown after remote has * shutdown. There may still be data in our * buffer that we have to finish sending * * TCP_CLOSE socket is finished
*/
/* * Current number of TCP sockets.
*/ struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
EXPORT_IPV6_MOD(tcp_sockets_allocated);
/* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. * All the __sk_mem_schedule() is of this nature: accounting * is strict, actions are advisory and have some latency.
*/ unsignedlong tcp_memory_pressure __read_mostly;
EXPORT_SYMBOL_GPL(tcp_memory_pressure);
if (!READ_ONCE(tcp_memory_pressure)) return;
val = xchg(&tcp_memory_pressure, 0); if (val)
NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURESCHRONO,
jiffies_to_msecs(jiffies - val));
}
EXPORT_IPV6_MOD_GPL(tcp_leave_memory_pressure);
/* Convert seconds to retransmits based on initial and max timeout */ static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
{
u8 res = 0;
if (seconds > 0) { int period = timeout;
res = 1; while (seconds > period && res < 255) {
res++;
timeout <<= 1; if (timeout > rto_max)
timeout = rto_max;
period += timeout;
}
} return res;
}
/* Convert retransmits to seconds based on initial and max timeout */ staticint retrans_to_secs(u8 retrans, int timeout, int rto_max)
{ int period = 0;
if (retrans > 0) {
period = timeout; while (--retrans) {
timeout <<= 1; if (timeout > rto_max)
timeout = rto_max;
period += timeout;
}
} return period;
}
/* Address-family independent initialization for a tcp_sock. * * NOTE: A lot of things set to zero explicitly by call to * sk_alloc() so need not be done here.
*/ void tcp_init_sock(struct sock *sk)
{ struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int rto_min_us, rto_max_ms;
/* So many TCP implementations out there (incorrectly) count the * initial SYN frame in their delayed-ACK and congestion control * algorithms that we must have the following bandaid to talk * efficiently to them. -DaveM
*/
tcp_snd_cwnd_set(tp, TCP_INIT_CWND);
/* There's a bubble in the pipe until at least the first ACK. */
tp->app_limited = ~0U;
tp->rate_app_limited = 1;
/* See draft-stevens-tcpca-spec-01 for discussion of the * initialization of these values.
*/
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_clamp = ~0;
tp->mss_cache = TCP_MSS_DEFAULT;
staticbool tcp_stream_is_readable(struct sock *sk, int target)
{ if (tcp_epollin_ready(sk, target)) returntrue; return sk_is_readable(sk);
}
/* * Wait for a TCP event. * * Note that we don't need to lock the socket, as the upper poll layers * take care of normal races (between the test and the event) and we don't * go look at any of the socket buffers directly.
*/
__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
__poll_t mask; struct sock *sk = sock->sk; conststruct tcp_sock *tp = tcp_sk(sk);
u8 shutdown; int state;
sock_poll_wait(file, sock, wait);
state = inet_sk_state_load(sk); if (state == TCP_LISTEN) return inet_csk_listen_poll(sk);
/* Socket is not locked. We are protected from async events * by poll logic and correct handling of state changes * made by other threads is impossible in any case.
*/
mask = 0;
/* * EPOLLHUP is certainly not done right. But poll() doesn't * have a notion of HUP in just one direction, and for a * socket the read side is more interesting. * * Some poll() documentation says that EPOLLHUP is incompatible * with the EPOLLOUT/POLLWR flags, so somebody should check this * all. But careful, it tends to be safer to return too many * bits than too few, and you can easily break real applications * if you don't tell them that something has hung up! * * Check-me. * * Check number 1. EPOLLHUP is _UNMASKABLE_ event (see UNIX98 and * our fs/select.c). It means that after we received EOF, * poll always returns immediately, making impossible poll() on write() * in state CLOSE_WAIT. One solution is evident --- to set EPOLLHUP * if and only if shutdown has been made in both directions. * Actually, it is interesting to look how Solaris and DUX * solve this dilemma. I would prefer, if EPOLLHUP were maskable, * then we could set it on SND_SHUTDOWN. BTW examples given * in Stevens' books assume exactly this behaviour, it explains * why EPOLLHUP is incompatible with EPOLLOUT. --ANK * * NOTE. Check for TCP_CLOSE is added. The goal is to prevent * blocking on fresh not-connected or disconnected socket. --ANK
*/
shutdown = READ_ONCE(sk->sk_shutdown); if (shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
mask |= EPOLLHUP; if (shutdown & RCV_SHUTDOWN)
mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
/* Connected or passive Fast Open socket? */ if (state != TCP_SYN_SENT &&
(state != TCP_SYN_RECV || rcu_access_pointer(tp->fastopen_rsk))) { int target = sock_rcvlowat(sk, 0, INT_MAX);
u16 urg_data = READ_ONCE(tp->urg_data);
if (unlikely(urg_data) &&
READ_ONCE(tp->urg_seq) == READ_ONCE(tp->copied_seq) &&
!sock_flag(sk, SOCK_URGINLINE))
target++;
if (tcp_stream_is_readable(sk, target))
mask |= EPOLLIN | EPOLLRDNORM;
if (!(shutdown & SEND_SHUTDOWN)) { if (__sk_stream_is_writeable(sk, 1)) {
mask |= EPOLLOUT | EPOLLWRNORM;
} else { /* send SIGIO later */
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
/* Race breaker. If space is freed after * wspace test but before the flags are set, * IO signal will be lost. Memory barrier * pairs with the input side.
*/
smp_mb__after_atomic(); if (__sk_stream_is_writeable(sk, 1))
mask |= EPOLLOUT | EPOLLWRNORM;
}
} else
mask |= EPOLLOUT | EPOLLWRNORM;
if (urg_data & TCP_URG_VALID)
mask |= EPOLLPRI;
} elseif (state == TCP_SYN_SENT &&
inet_test_bit(DEFER_CONNECT, sk)) { /* Active TCP fastopen socket with defer_connect * Return EPOLLOUT so application can call write() * in order for kernel to generate SYN+data
*/
mask |= EPOLLOUT | EPOLLWRNORM;
} /* This barrier is coupled with smp_wmb() in tcp_done_with_error() */
smp_rmb(); if (READ_ONCE(sk->sk_err) ||
!skb_queue_empty_lockless(&sk->sk_error_queue))
mask |= EPOLLERR;
return mask;
}
EXPORT_SYMBOL(tcp_poll);
int tcp_ioctl(struct sock *sk, int cmd, int *karg)
{ struct tcp_sock *tp = tcp_sk(sk); int answ; bool slow;
switch (cmd) { case SIOCINQ: if (sk->sk_state == TCP_LISTEN) return -EINVAL;
slow = lock_sock_fast(sk);
answ = tcp_inq(sk);
unlock_sock_fast(sk, slow); break; case SIOCATMARK:
answ = READ_ONCE(tp->urg_data) &&
READ_ONCE(tp->urg_seq) == READ_ONCE(tp->copied_seq); break; case SIOCOUTQ: if (sk->sk_state == TCP_LISTEN) return -EINVAL;
if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
answ = 0; else
answ = READ_ONCE(tp->write_seq) - tp->snd_una; break; case SIOCOUTQNSD: if (sk->sk_state == TCP_LISTEN) return -EINVAL;
staticinlinevoid tcp_mark_urg(struct tcp_sock *tp, int flags)
{ if (flags & MSG_OOB)
tp->snd_up = tp->write_seq;
}
/* If a not yet filled skb is pushed, do not send it if * we have data packets in Qdisc or NIC queues : * Because TX completion will happen shortly, it gives a chance * to coalesce future sendmsg() payload into this skb, without * need for a timer, and with no latency trade off. * As packets containing data payload have a bigger truesize * than pure acks (dataless) packets, the last checks prevent * autocorking if we only have an ACK in Qdisc/NIC queues, * or if TX completion was delayed after we processed ACK packet.
*/ staticbool tcp_should_autocork(struct sock *sk, struct sk_buff *skb, int size_goal)
{ return skb->len < size_goal &&
READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_autocorking) &&
!tcp_rtx_queue_empty(sk) &&
refcount_read(&sk->sk_wmem_alloc) > skb->truesize &&
tcp_skb_can_collapse_to(skb);
}
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle, int size_goal)
{ struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb;
skb = tcp_write_queue_tail(sk); if (!skb) return; if (!(flags & MSG_MORE) || forced_push(tp))
tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags);
if (tcp_should_autocork(sk, skb, size_goal)) {
/* avoid atomic op if TSQ_THROTTLED bit is already set */ if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
smp_mb__after_atomic();
} /* It is possible TX completion already happened * before we set TSQ_THROTTLED.
*/ if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize) return;
}
/** * tcp_splice_read - splice data from TCP socket to a pipe * @sock: socket to splice from * @ppos: position (not valid) * @pipe: pipe to splice to * @len: number of bytes to splice * @flags: splice modifier flags * * Description: * Will read pages from given socket and fill them into a pipe. *
**/
ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsignedint flags)
{ struct sock *sk = sock->sk; struct tcp_splice_state tss = {
.pipe = pipe,
.len = len,
.flags = flags,
}; long timeo;
ssize_t spliced; int ret;
sock_rps_record_flow(sk); /* * We can't seek on a socket input
*/ if (unlikely(*ppos)) return -ESPIPE;
ret = spliced = 0;
lock_sock(sk);
timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); while (tss.len) {
ret = __tcp_splice_read(sk, &tss); if (ret < 0) break; elseif (!ret) { if (spliced) break; if (sock_flag(sk, SOCK_DONE)) break; if (sk->sk_err) {
ret = sock_error(sk); break;
} if (sk->sk_shutdown & RCV_SHUTDOWN) break; if (sk->sk_state == TCP_CLOSE) { /* * This occurs when user tries to read * from never connected socket.
*/
ret = -ENOTCONN; break;
} if (!timeo) {
ret = -EAGAIN; break;
} /* if __tcp_splice_read() got nothing while we have * an skb in receive queue, we do not want to loop. * This might happen with URG data.
*/ if (!skb_queue_empty(&sk->sk_receive_queue)) break;
ret = sk_wait_data(sk, &timeo, NULL); if (ret < 0) break; if (signal_pending(current)) {
ret = sock_intr_errno(timeo); break;
} continue;
}
tss.len -= ret;
spliced += ret;
if (!tss.len || !timeo) break;
release_sock(sk);
lock_sock(sk);
/* In some cases, sendmsg() could have added an skb to the write queue, * but failed adding payload on it. We need to remove it to consume less * memory, but more importantly be able to generate EPOLLOUT for Edge Trigger * epoll() users. Another reason is that tcp_write_xmit() does not like * finding an empty skb in the write queue.
*/ void tcp_remove_empty_skb(struct sock *sk)
{ struct sk_buff *skb = tcp_write_queue_tail(sk);
if (skb && TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
tcp_unlink_write_queue(skb, sk); if (tcp_write_queue_empty(sk))
tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
tcp_wmem_free_skb(sk, skb);
}
}
/* skb changing from pure zc to mixed, must charge zc */ staticint tcp_downgrade_zcopy_pure(struct sock *sk, struct sk_buff *skb)
{ if (unlikely(skb_zcopy_pure(skb))) {
u32 extra = skb->truesize -
SKB_TRUESIZE(skb_end_offset(skb));
int tcp_wmem_schedule(struct sock *sk, int copy)
{ int left;
if (likely(sk_wmem_schedule(sk, copy))) return copy;
/* We could be in trouble if we have nothing queued. * Use whatever is left in sk->sk_forward_alloc and tcp_wmem[0] * to guarantee some progress.
*/
left = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_wmem[0]) - sk->sk_wmem_queued; if (left > 0)
sk_forced_mem_schedule(sk, min(left, copy)); return min(copy, sk->sk_forward_alloc);
}
if (!(READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_fastopen) &
TFO_CLIENT_ENABLE) ||
(uaddr && msg->msg_namelen >= sizeof(uaddr->sa_family) &&
uaddr->sa_family == AF_UNSPEC)) return -EOPNOTSUPP; if (tp->fastopen_req) return -EALREADY; /* Another Fast Open is in progress */
if (inet_test_bit(DEFER_CONNECT, sk)) {
err = tcp_connect(sk); /* Same failure procedure as in tcp_v4/6_connect */ if (err) {
tcp_set_state(sk, TCP_CLOSE);
inet->inet_dport = 0;
sk->sk_route_caps = 0;
}
}
flags = (msg->msg_flags & MSG_DONTWAIT) ? O_NONBLOCK : 0;
err = __inet_stream_connect(sk->sk_socket, uaddr,
msg->msg_namelen, flags, 1); /* fastopen_req could already be freed in __inet_stream_connect * if the connection times out or gets rst
*/ if (tp->fastopen_req) {
*copied = tp->fastopen_req->copied;
tcp_free_fastopen_req(tp);
inet_clear_bit(DEFER_CONNECT, sk);
} return err;
}
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
{ struct net_devmem_dmabuf_binding *binding = NULL; struct tcp_sock *tp = tcp_sk(sk); struct ubuf_info *uarg = NULL; struct sk_buff *skb; struct sockcm_cookie sockc; int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; int process_backlog = 0; int sockc_err = 0; int zc = 0; long timeo;
flags = msg->msg_flags;
sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags) }; if (msg->msg_controllen) {
sockc_err = sock_cmsg_send(sk, msg, &sockc); /* Don't return error until MSG_FASTOPEN has been processed; * that may succeed even if the cmsg is invalid.
*/
}
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
/* Wait for a connection to finish. One exception is TCP Fast Open * (passive side) where data is allowed to be sent before a connection * is fully established.
*/ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
!tcp_passive_fastopen(sk)) {
err = sk_stream_wait_connect(sk, &timeo); if (err != 0) goto do_error;
}
if (unlikely(tp->repair)) { if (tp->repair_queue == TCP_RECV_QUEUE) {
copied = tcp_send_rcvq(sk, msg, size); goto out_nopush;
}
err = -EINVAL; if (tp->repair_queue == TCP_NO_QUEUE) goto out_err;
/* 'common' sending to sendq */
}
if (sockc_err) {
err = sockc_err; goto out_err;
}
/* This should be in poll */
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
/* All packets are restored as if they have * already been sent. skb_mstamp_ns isn't set to * avoid wrong rtt estimation.
*/ if (tp->repair)
TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED;
}
/* Try to append data to the end of skb. */ if (copy > msg_data_left(msg))
copy = msg_data_left(msg);
if (zc == 0) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk);
if (!sk_page_frag_refill(sk, pfrag)) goto wait_for_space;
if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) { if (i >= READ_ONCE(net_hotdata.sysctl_max_skb_frags)) {
tcp_mark_push(tp, skb); goto new_segment;
}
merge = false;
}
err = sk_stream_wait_memory(sk, &timeo); if (err != 0) goto do_error;
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
out: if (copied) {
tcp_tx_timestamp(sk, &sockc);
tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
}
out_nopush: /* msg->msg_ubuf is pinned by the caller so we don't take extra refs */ if (uarg && !msg->msg_ubuf)
net_zcopy_put(uarg); if (binding)
net_devmem_dmabuf_binding_put(binding); return copied + copied_syn;
do_error:
tcp_remove_empty_skb(sk);
if (copied + copied_syn) goto out;
out_err: /* msg->msg_ubuf is pinned by the caller so we don't take extra refs */ if (uarg && !msg->msg_ubuf)
net_zcopy_put_abort(uarg, true);
err = sk_stream_error(sk, flags, err); /* make sure we wake any epoll edge trigger waiter */ if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
sk->sk_write_space(sk);
tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
} if (binding)
net_devmem_dmabuf_binding_put(binding);
/* * Handle reading urgent data. BSD has very simple semantics for * this, no blocking and very strange errors 8)
*/
staticint tcp_recv_urg(struct sock *sk, struct msghdr *msg, int len, int flags)
{ struct tcp_sock *tp = tcp_sk(sk);
/* No URG data to read. */ if (sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data ||
tp->urg_data == TCP_URG_READ) return -EINVAL; /* Yes this is right ! */
if (sk->sk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DONE)) return -ENOTCONN;
if (tp->urg_data & TCP_URG_VALID) { int err = 0; char c = tp->urg_data;
if (!(flags & MSG_PEEK))
WRITE_ONCE(tp->urg_data, TCP_URG_READ);
if (len > 0) { if (!(flags & MSG_TRUNC))
err = memcpy_to_msg(msg, &c, 1);
len = 1;
} else
msg->msg_flags |= MSG_TRUNC;
return err ? -EFAULT : len;
}
if (sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) return 0;
/* Fixed the recv(..., MSG_OOB) behaviour. BSD docs and * the available implementations agree in this case: * this call should never block, independent of the * blocking state of the socket. * Mike <pall@rz.uni-karlsruhe.de>
*/ return -EAGAIN;
}
staticint tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
{ struct sk_buff *skb; int copied = 0, err = 0;
/* Clean up the receive buffer for full frames taken by the user, * then send an ACK if necessary. COPIED is the number of bytes * tcp_recvmsg has given to the user so far, it speeds up the * calculation of whether or not we must ACK for the sake of * a window update.
*/ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
{ struct tcp_sock *tp = tcp_sk(sk); bool time_to_ack = false;
if (inet_csk_ack_scheduled(sk)) { conststruct inet_connection_sock *icsk = inet_csk(sk);
if (/* Once-per-two-segments ACK was not sent by tcp_input.c */
tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || /* * If this read emptied read buffer, we send ACK, if * connection is not bidirectional, user drained * receive buffer and there was a small segment * in queue.
*/
(copied > 0 &&
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
!inet_csk_in_pingpong_mode(sk))) &&
!atomic_read(&sk->sk_rmem_alloc)))
time_to_ack = true;
}
/* We send an ACK if we can now advertise a non-zero window * which has been raised "significantly". * * Even if window raised up to infinity, do not send window open ACK * in states, where we will not receive more. It is useless.
*/ if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) {
__u32 rcv_window_now = tcp_receive_window(tp);
/* Optimize, __tcp_select_window() is not cheap. */ if (2*rcv_window_now <= tp->window_clamp) {
__u32 new_window = __tcp_select_window(sk);
/* Send ACK now, if this read freed lots of space * in our buffer. Certainly, new_window is new window. * We can advertise it now, if it is not less than current one. * "Lots" means "at least twice" here.
*/ if (new_window && new_window >= 2 * rcv_window_now)
time_to_ack = true;
}
} if (time_to_ack)
tcp_send_ack(sk);
}
while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
offset = seq - TCP_SKB_CB(skb)->seq; if (unlikely(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)) {
pr_err_once("%s: found a SYN, please report !\n", __func__);
offset--;
} if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) {
*off = offset; return skb;
} /* This looks weird, but this can happen if TCP collapsing * splitted a fat GRO packet, while we released socket lock * in skb_splice_bits()
*/
tcp_eat_recv_skb(sk, skb);
} return NULL;
}
EXPORT_SYMBOL(tcp_recv_skb);
/* * This routine provides an alternative to tcp_recvmsg() for routines * that would like to handle copying from skbuffs directly in 'sendfile' * fashion. * Note: * - It is assumed that the socket was locked by the caller. * - The routine does not block. * - At present, there is no support for reading OOB data * or for 'peeking' the socket using this routine * (although both would be easy to implement).
*/ staticint __tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t recv_actor, bool noack,
u32 *copied_seq)
{ struct sk_buff *skb; struct tcp_sock *tp = tcp_sk(sk);
u32 seq = *copied_seq;
u32 offset; int copied = 0;
if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { if (offset < skb->len) { int used;
size_t len;
len = skb->len - offset; /* Stop reading if we hit a patch of urgent data */ if (unlikely(tp->urg_data)) {
u32 urg_offset = tp->urg_seq - seq; if (urg_offset < len)
len = urg_offset; if (!len) break;
}
used = recv_actor(desc, skb, offset, len); if (used <= 0) { if (!copied)
copied = used; break;
} if (WARN_ON_ONCE(used > len))
used = len;
seq += used;
copied += used;
offset += used;
/* If recv_actor drops the lock (e.g. TCP splice * receive) the skb pointer might be invalid when * getting here: tcp_collapse might have deleted it * while aggregating skbs from the socket queue.
*/
skb = tcp_recv_skb(sk, seq - 1, &offset); if (!skb) break; /* TCP coalescing might have appended data to the skb. * Try to splice more frags
*/ if (offset + 1 != skb->len) continue;
} if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
tcp_eat_recv_skb(sk, skb);
++seq; break;
}
tcp_eat_recv_skb(sk, skb); if (!desc->count) break;
WRITE_ONCE(*copied_seq, seq);
}
WRITE_ONCE(*copied_seq, seq);
if (noack) goto out;
tcp_rcv_space_adjust(sk);
/* Clean up data we have read: This will do ACK frames. */ if (copied > 0) {
tcp_recv_skb(sk, seq, &offset);
tcp_cleanup_rbuf(sk, copied);
}
out: return copied;
}
/* Clean up data we have read: This will do ACK frames. */ if (left != len)
tcp_cleanup_rbuf(sk, len - left);
}
EXPORT_SYMBOL(tcp_read_done);
int tcp_peek_len(struct socket *sock)
{ return tcp_inq(sock->sk);
}
EXPORT_IPV6_MOD(tcp_peek_len);
/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */ int tcp_set_rcvlowat(struct sock *sk, int val)
{ struct tcp_sock *tp = tcp_sk(sk); int space, cap;
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
cap = sk->sk_rcvbuf >> 1; else
cap = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]) >> 1;
val = min(val, cap);
WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
/* Check if we need to signal EPOLLIN right now */
tcp_data_ready(sk);
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) return 0;
space = tcp_space_from_win(sk, val); if (space > sk->sk_rcvbuf) {
WRITE_ONCE(sk->sk_rcvbuf, space);
/* worst case: skip to next skb. try to improve on this case below */
zc->recv_skip_hint = skb->len - offset;
/* Find the frag containing this offset (and how far into that frag) */
frag = skb_advance_to_frag(skb, offset, &frag_offset); if (!frag) return;
if (frag_offset) { struct skb_shared_info *info = skb_shinfo(skb);
/* We read part of the last frag, must recvmsg() rest of skb. */ if (frag == &info->frags[info->nr_frags - 1]) return;
/* Else, we must at least read the remainder in this frag. */
partial_frag_remainder = skb_frag_size(frag) - frag_offset;
zc->recv_skip_hint -= partial_frag_remainder;
++frag;
}
/* partial_frag_remainder: If part way through a frag, must read rest. * mappable_offset: Bytes till next mappable frag, *not* counting bytes * in partial_frag_remainder.
*/
mappable_offset = find_next_mappable_frag(frag, zc->recv_skip_hint);
zc->recv_skip_hint = mappable_offset + partial_frag_remainder;
}
staticint tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int flags, struct scm_timestamping_internal *tss, int *cmsg_flags); staticint receive_fallback_to_copy(struct sock *sk, struct tcp_zerocopy_receive *zc, int inq, struct scm_timestamping_internal *tss)
{ unsignedlong copy_address = (unsignedlong)zc->copybuf_address; struct msghdr msg = {}; int err;
zc->length = 0;
zc->recv_skip_hint = 0;
if (copy_address != zc->copybuf_address) return -EINVAL;
if (!err) { unsignedlong leftover_pages = pages_remaining; int bytes_mapped;
/* We called zap_page_range_single, try to reinsert. */
err = vm_insert_pages(vma, *address,
pending_pages,
&pages_remaining);
bytes_mapped = PAGE_SIZE * (leftover_pages - pages_remaining);
*seq += bytes_mapped;
*address += bytes_mapped;
} if (err) { /* Either we were unable to zap, OR we zapped, retried an * insert, and still had an issue. Either ways, pages_remaining * is the number of pages we were unable to map, and we unroll * some state we speculatively touched before.
*/ constint bytes_not_mapped = PAGE_SIZE * pages_remaining;
err = vm_insert_pages(vma, *address, pages, &pages_remaining);
pages_mapped = pages_to_map - (unsignedint)pages_remaining;
bytes_mapped = PAGE_SIZE * pages_mapped; /* Even if vm_insert_pages fails, it may have partially succeeded in * mapping (some but not all of the pages).
*/
*seq += bytes_mapped;
*address += bytes_mapped;
if (likely(!err)) return 0;
/* Error: maybe zap and retry + rollback state for failed inserts. */ return tcp_zerocopy_vm_insert_batch_error(vma, pages + pages_mapped,
pages_remaining, address, length, seq, zc, total_bytes_to_map,
err);
}
mappable_offset = find_next_mappable_frag(frags,
zc->recv_skip_hint); if (mappable_offset) {
zc->recv_skip_hint = mappable_offset; break;
}
page = skb_frag_page(frags); if (WARN_ON_ONCE(!page)) break;
prefetchw(page);
pages[pages_to_map++] = page;
length += PAGE_SIZE;
zc->recv_skip_hint -= PAGE_SIZE;
frags++; if (pages_to_map == TCP_ZEROCOPY_PAGE_BATCH_SIZE ||
zc->recv_skip_hint < PAGE_SIZE) { /* Either full batch, or we're about to go to next skb * (and we cannot unroll failed ops across skbs).
*/
ret = tcp_zerocopy_vm_insert_batch(vma, pages,
pages_to_map,
&address, &length,
&seq, zc,
total_bytes_to_map); if (ret) goto out;
pages_to_map = 0;
}
} if (pages_to_map) {
ret = tcp_zerocopy_vm_insert_batch(vma, pages, pages_to_map,
&address, &length, &seq,
zc, total_bytes_to_map);
}
out: if (mmap_locked)
mmap_read_unlock(current->mm); else
vma_end_read(vma); /* Try to copy straggler data. */ if (!ret)
copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
if (length + copylen) {
WRITE_ONCE(tp->copied_seq, seq);
tcp_rcv_space_adjust(sk);
/* Clean up data we have read: This will do ACK frames. */
tcp_recv_skb(sk, seq, &offset);
tcp_cleanup_rbuf(sk, length + copylen);
ret = 0; if (length == zc->length)
zc->recv_skip_hint = 0;
} else { if (!zc->recv_skip_hint && sock_flag(sk, SOCK_DONE))
ret = -EIO;
}
zc->length = length; return ret;
} #endif
/* Similar to __sock_recv_timestamp, but does not require an skb */ void tcp_recv_timestamp(struct msghdr *msg, conststruct sock *sk, struct scm_timestamping_internal *tss)
{ int new_tstamp = sock_flag(sk, SOCK_TSTAMP_NEW);
u32 tsflags = READ_ONCE(sk->sk_tsflags); bool has_timestamping = false;
staticvoid tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
{ int i;
/* Commit part that has been copied to user space. */ for (i = 0; i < p->idx; i++)
__xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
(__force void *)p->netmems[i], GFP_KERNEL); /* Rollback what has been pre-allocated and is no longer needed. */ for (; i < p->max; i++)
__xa_erase(&sk->sk_user_frags, p->tokens[i]);
for (k = 0; k < max_frags; k++) {
err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); if (err) break;
}
xa_unlock_bh(&sk->sk_user_frags);
p->max = k;
p->idx = 0; return k ? 0 : err;
}
/* On error, returns the -errno. On success, returns number of bytes sent to the * user. May not consume all of @remaining_len.
*/ staticint tcp_recvmsg_dmabuf(struct sock *sk, conststruct sk_buff *skb, unsignedint offset, struct msghdr *msg, int remaining_len)
{ struct dmabuf_cmsg dmabuf_cmsg = { 0 }; struct tcp_xa_pool tcp_xa_pool; unsignedint start; int i, copy, n; int sent = 0; int err = 0;
n = copy_to_iter(skb->data + offset, copy,
&msg->msg_iter); if (n != copy) {
err = -EFAULT; goto out;
}
offset += copy;
remaining_len -= copy;
/* First a dmabuf_cmsg for # bytes copied to user * buffer.
*/
memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg));
dmabuf_cmsg.frag_size = copy;
err = put_cmsg_notrunc(msg, SOL_SOCKET,
SO_DEVMEM_LINEAR, sizeof(dmabuf_cmsg),
&dmabuf_cmsg); if (err) goto out;
sent += copy;
if (remaining_len == 0) goto out;
}
/* after that, send information of dmabuf pages through a * sequence of cmsg
*/ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; struct net_iov *niov;
u64 frag_offset; int end;
/* !skb_frags_readable() should indicate that ALL the * frags in this skb are dmabuf net_iovs. We're checking * for that flag above, but also check individual frags * here. If the tcp stack is not setting * skb_frags_readable() correctly, we still don't want * to crash here.
*/ if (!skb_frag_net_iov(frag)) {
net_err_ratelimited("Found non-dmabuf skb with net_iov");
err = -ENODEV; goto out;
}
/* Will perform the exchange later */
dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
tcp_xa_pool_commit(sk, &tcp_xa_pool); if (!remaining_len) goto out;
/* if remaining_len is not satisfied yet, we need to go to the * next frag in the frag_list to satisfy remaining_len.
*/
skb = skb_shinfo(skb)->frag_list ?: skb->next;
offset = offset - start;
} while (skb);
if (remaining_len) {
err = -EFAULT; goto out;
}
out:
tcp_xa_pool_commit(sk, &tcp_xa_pool); if (!sent)
sent = err;
return sent;
}
/* * This routine copies from a sock struct into the user buffer. * * Technical note: in 2.3 we work on _locked_ socket, so that * tricks with *seq access order and skb->users are not required. * Probably, code can be easily improved even more.
*/
staticint tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int flags, struct scm_timestamping_internal *tss, int *cmsg_flags)
{ struct tcp_sock *tp = tcp_sk(sk); int last_copied_dmabuf = -1; /* uninitialized */ int copied = 0;
u32 peek_seq;
u32 *seq; unsignedlong used; int err; int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last;
u32 peek_offset = 0;
u32 urg_hole = 0;
err = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) goto out;
/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */ if (unlikely(tp->urg_data) && tp->urg_seq == *seq) { if (copied) break; if (signal_pending(current)) {
copied = timeo ? sock_intr_errno(timeo) : -EAGAIN; break;
}
}
/* Next get a buffer. */
last = skb_peek_tail(&sk->sk_receive_queue);
skb_queue_walk(&sk->sk_receive_queue, skb) {
last = skb; /* Now that we have two receive queues this * shouldn't happen.
*/ if (WARN(before(*seq, TCP_SKB_CB(skb)->seq), "TCP recvmsg seq # bug: copied %X, seq %X, rcvnxt %X, fl %X\n",
*seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt,
flags)) break;
found_ok_skb: /* Ok so how much can we use? */
used = skb->len - offset; if (len < used)
used = len;
/* Do we have urgent data here? */ if (unlikely(tp->urg_data)) {
u32 urg_offset = tp->urg_seq - *seq; if (urg_offset < used) { if (!urg_offset) { if (!sock_flag(sk, SOCK_URGINLINE)) {
WRITE_ONCE(*seq, *seq + 1);
urg_hole++;
offset++;
used--; if (!used) goto skip_copy;
}
} else
used = urg_offset;
}
}
if (!(flags & MSG_TRUNC)) { if (last_copied_dmabuf != -1 &&
last_copied_dmabuf != !skb_frags_readable(skb)) break;
if (skb_frags_readable(skb)) {
err = skb_copy_datagram_msg(skb, offset, msg,
used); if (err) { /* Exception. Bailout! */ if (!copied)
copied = -EFAULT; break;
}
} else { if (!(flags & MSG_SOCK_DEVMEM)) { /* dmabuf skbs can only be received * with the MSG_SOCK_DEVMEM flag.
*/ if (!copied)
copied = -EFAULT;
break;
}
err = tcp_recvmsg_dmabuf(sk, skb, offset, msg,
used); if (err < 0) { if (!copied)
copied = err;
break;
}
used = err;
}
}
last_copied_dmabuf = !skb_frags_readable(skb);
WRITE_ONCE(*seq, *seq + used);
copied += used;
len -= used; if (flags & MSG_PEEK)
sk_peek_offset_fwd(sk, used); else
sk_peek_offset_bwd(sk, used);
tcp_rcv_space_adjust(sk);
lock_sock(sk);
ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
release_sock(sk);
if ((cmsg_flags || msg->msg_get_inq) && ret >= 0) { if (cmsg_flags & TCP_CMSG_TS)
tcp_recv_timestamp(msg, sk, &tss); if (msg->msg_get_inq) {
msg->msg_inq = tcp_inq_hint(sk); if (cmsg_flags & TCP_CMSG_INQ)
put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(msg->msg_inq), &msg->msg_inq);
}
} return ret;
}
EXPORT_IPV6_MOD(tcp_recvmsg);
void tcp_set_state(struct sock *sk, int state)
{ int oldstate = sk->sk_state;
/* We defined a new enum for TCP states that are exported in BPF * so as not force the internal TCP states to be frozen. The * following checks will detect if an internal state value ever * differs from the BPF value. If this ever happens, then we will * need to remap the internal value to the BPF value before calling * tcp_call_bpf_2arg.
*/
BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
BUILD_BUG_ON((int)BPF_TCP_BOUND_INACTIVE != (int)TCP_BOUND_INACTIVE);
BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
/* bpf uapi header bpf.h defines an anonymous enum with values * BPF_TCP_* used by bpf programs. Currently gcc built vmlinux * is able to emit this enum in DWARF due to the above BUILD_BUG_ON. * But clang built vmlinux does not have this enum in DWARF * since clang removes the above code before generating IR/debuginfo. * Let us explicitly emit the type debuginfo to ensure the * above-mentioned anonymous enum in the vmlinux DWARF and hence BTF * regardless of which compiler is used.
*/
BTF_TYPE_EMIT_ENUM(BPF_TCP_ESTABLISHED);
if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
switch (state) { case TCP_ESTABLISHED: if (oldstate != TCP_ESTABLISHED)
TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB); break; case TCP_CLOSE_WAIT: if (oldstate == TCP_SYN_RECV)
TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB); break;
case TCP_CLOSE: if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
TCP_INC_STATS(sock_net(sk), TCP_MIB_ESTABRESETS);
/* Change state AFTER socket is unhashed to avoid closed * socket sitting in hash tables.
*/
inet_sk_state_store(sk, state);
}
EXPORT_SYMBOL_GPL(tcp_set_state);
/* * State processing on a close. This implements the state shift for * sending our FIN frame. Note that we only send a FIN for some * states. A shutdown() may have already sent the FIN, or we may be * closed.
*/
staticint tcp_close_state(struct sock *sk)
{ int next = (int)new_state[sk->sk_state]; int ns = next & TCP_STATE_MASK;
tcp_set_state(sk, ns);
return next & TCP_ACTION_FIN;
}
/* * Shutdown the sending side of a connection. Much like close except * that we don't receive shut down or sock_set_flag(sk, SOCK_DEAD).
*/
void tcp_shutdown(struct sock *sk, int how)
{ /* We need to grab some memory, and put together a FIN, * and then put it into the queue to be sent. * Tim MacKenzie(tym@dibbler.cs.monash.edu.au) 4 Dec '92.
*/ if (!(how & SEND_SHUTDOWN)) return;
/* If we've already sent a FIN, or it's a closed state, skip this. */ if ((1 << sk->sk_state) &
(TCPF_ESTABLISHED | TCPF_SYN_SENT |
TCPF_CLOSE_WAIT)) { /* Clear out any half completed packets. FIN if needed. */ if (tcp_close_state(sk))
tcp_send_fin(sk);
}
}
EXPORT_IPV6_MOD(tcp_shutdown);
int tcp_orphan_count_sum(void)
{ int i, total = 0;
for_each_possible_cpu(i)
total += per_cpu(tcp_orphan_count, i);
if (too_many_orphans)
net_info_ratelimited("too many orphaned sockets\n"); if (out_of_socket_memory)
net_info_ratelimited("out of memory -- consider tuning tcp_mem\n"); return too_many_orphans || out_of_socket_memory;
}
void __tcp_close(struct sock *sk, long timeout)
{ bool data_was_unread = false; struct sk_buff *skb; int state;
WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
if (sk->sk_state == TCP_LISTEN) {
tcp_set_state(sk, TCP_CLOSE);
/* Special case. */
inet_csk_listen_stop(sk);
goto adjudge_to_death;
}
/* We need to flush the recv. buffs. We do this only on the * descriptor close, not protocol-sourced closes, because the * reader process may not have drained the data yet!
*/ while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
end_seq--; if (after(end_seq, tcp_sk(sk)->copied_seq))
data_was_unread = true;
__kfree_skb(skb);
}
/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */ if (sk->sk_state == TCP_CLOSE) goto adjudge_to_death;
/* As outlined in RFC 2525, section 2.17, we send a RST here because * data was lost. To witness the awful effects of the old behavior of * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk * GET in an FTP client, suspend the process, wait for the client to * advertise a zero window, then kill -9 the FTP client, wheee... * Note: timeout is always zero in such a case.
*/ if (unlikely(tcp_sk(sk)->repair)) {
sk->sk_prot->disconnect(sk, 0);
} elseif (data_was_unread) { /* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
tcp_send_active_reset(sk, sk->sk_allocation,
SK_RST_REASON_TCP_ABORT_ON_CLOSE);
} elseif (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { /* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
} elseif (tcp_close_state(sk)) { /* We FIN if the application ate all the data before * zapping the connection.
*/
/* RED-PEN. Formally speaking, we have broken TCP state * machine. State transitions: * * TCP_ESTABLISHED -> TCP_FIN_WAIT1 * TCP_SYN_RECV -> TCP_FIN_WAIT1 (it is difficult) * TCP_CLOSE_WAIT -> TCP_LAST_ACK * * are legal only when FIN has been sent (i.e. in window), * rather than queued out of window. Purists blame. * * F.e. "RFC state" is ESTABLISHED, * if Linux state is FIN-WAIT-1, but FIN is still not sent. * * The visible declinations are that sometimes * we enter time-wait state, when it is not required really * (harmless), do not send active resets, when they are * required by specs (TCP_ESTABLISHED, TCP_CLOSE_WAIT, when * they look as CLOSING or LAST_ACK for Linux) * Probably, I missed some more holelets. * --ANK * XXX (TFO) - To start off we don't support SYN+ACK+FIN * in a single packet! (May consider it later but will * probably need API support or TCP_CORK SYN-ACK until * data is written and socket is closed.)
*/
tcp_send_fin(sk);
}
sk_stream_wait_close(sk, timeout);
adjudge_to_death:
state = sk->sk_state;
sock_hold(sk);
sock_orphan(sk);
local_bh_disable();
bh_lock_sock(sk); /* remove backlog if any, without releasing ownership. */
__release_sock(sk);
this_cpu_inc(tcp_orphan_count);
/* Have we already been destroyed by a softirq or backlog? */ if (state != TCP_CLOSE && sk->sk_state == TCP_CLOSE) goto out;
/* This is a (useful) BSD violating of the RFC. There is a * problem with TCP as specified in that the other end could * keep a socket open forever with no application left this end. * We use a 1 minute timeout (about the same as BSD) then kill * our end. If they send after that then tough - BUT: long enough * that we won't make the old 4*rto = almost no time - whoops * reset mistake. * * Nope, it was not mistake. It is really desired behaviour * f.e. on http servers, when such sockets are useless, but * consume significant resources. Let's do it with special * linger2 option. --ANK
*/
if (tmo > TCP_TIMEWAIT_LEN) {
tcp_reset_keepalive_timer(sk,
tmo - TCP_TIMEWAIT_LEN);
} else {
tcp_time_wait(sk, TCP_FIN_WAIT2, tmo); goto out;
}
}
} if (sk->sk_state != TCP_CLOSE) { if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
tcp_send_active_reset(sk, GFP_ATOMIC,
SK_RST_REASON_TCP_ABORT_ON_MEMORY);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} elseif (!check_net(sock_net(sk))) { /* Not possible to send reset; just close */
tcp_set_state(sk, TCP_CLOSE);
}
}
if (sk->sk_state == TCP_CLOSE) { struct request_sock *req;
req = rcu_dereference_protected(tcp_sk(sk)->fastopen_rsk,
lockdep_sock_is_held(sk)); /* We could get here with a non-NULL req if the socket is * aborted (e.g., closed with unread data) before 3WHS * finishes.
*/ if (req)
reqsk_fastopen_remove(sk, req, false);
inet_csk_destroy_sock(sk);
} /* Otherwise, socket is reprieved until protocol close. */
out:
bh_unlock_sock(sk);
local_bh_enable();
}
void tcp_close(struct sock *sk, long timeout)
{
lock_sock(sk);
__tcp_close(sk, timeout);
release_sock(sk); if (!sk->sk_net_refcnt)
inet_csk_clear_xmit_timers_sync(sk);
sock_put(sk);
}
EXPORT_SYMBOL(tcp_close);
/* These states need RST on ABORT according to RFC793 */
p = rb_next(p); /* Since we are deleting whole queue, no need to * list_del(&skb->tcp_tsorted_anchor)
*/
tcp_rtx_queue_unlink(skb, sk);
tcp_wmem_free_skb(sk, skb);
}
}
/* When set indicates to always queue non-full frames. Later the user clears * this option and we transmit any pending partial frames in the queue. This is * meant to be used alongside sendfile() to get properly filled frames when the * user (for example) must write out headers with a write() call first and then * use sendfile to send out the data parts. * * TCP_CORK can be set together with TCP_NODELAY and it is stronger than * TCP_NODELAY.
*/ void __tcp_sock_set_cork(struct sock *sk, bool on)
{ struct tcp_sock *tp = tcp_sk(sk);
/* TCP_NODELAY is weaker than TCP_CORK, so that this option on corked socket is * remembered, but it is not activated until cork is cleared. * * However, when TCP_NODELAY is set we make an explicit push, which overrides * even TCP_CORK for currently queued segments.
*/ void __tcp_sock_set_nodelay(struct sock *sk, bool on)
{ if (on) {
tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
} else {
tcp_sk(sk)->nonagle &= ~TCP_NAGLE_OFF;
}
}
int tcp_sock_set_user_timeout(struct sock *sk, int val)
{ /* Cap the max time in ms TCP will retry or probe the window * before giving up and aborting (ETIMEDOUT) a connection.
*/ if (val < 0) return -EINVAL;
if (new_window_clamp == old_window_clamp) return 0;
WRITE_ONCE(tp->window_clamp, new_window_clamp);
/* Need to apply the reserved mem provisioning only * when shrinking the window clamp.
*/ if (new_window_clamp < old_window_clamp) {
__tcp_adjust_rcv_ssthresh(sk, new_window_clamp);
} else {
new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
} return 0;
}
int tcp_sock_set_maxseg(struct sock *sk, int val)
{ /* Values greater than interface MTU won't take effect. However * at the point when this call is done we typically don't yet * know which interface is going to be used
*/ if (val && (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW)) return -EINVAL;
tcp_sk(sk)->rx_opt.user_mss = val; return 0;
}
/* * Socket option code for TCP.
*/ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsignedint optlen)
{ struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); struct net *net = sock_net(sk); int val; int err = 0;
/* These are data/string values, all the others are ints */ switch (optname) { case TCP_CONGESTION: { char name[TCP_CA_NAME_MAX];
if (optlen < 1) return -EINVAL;
val = strncpy_from_sockptr(name, optval,
min_t(long, TCP_CA_NAME_MAX-1, optlen)); if (val < 0) return -EFAULT;
name[val] = 0;
/* Allow a backup key as well to facilitate key rotation * First key is the active one.
*/ if (optlen != TCP_FASTOPEN_KEY_LENGTH &&
optlen != TCP_FASTOPEN_KEY_BUF_LENGTH) return -EINVAL;
if (copy_from_sockptr(key, optval, optlen)) return -EFAULT;
if (optlen == TCP_FASTOPEN_KEY_BUF_LENGTH)
backup_key = key + TCP_FASTOPEN_KEY_LENGTH;
if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT;
/* Handle options that can be set without locking the socket. */ switch (optname) { case TCP_SYNCNT: return tcp_sock_set_syncnt(sk, val); case TCP_USER_TIMEOUT: return tcp_sock_set_user_timeout(sk, val); case TCP_KEEPINTVL: return tcp_sock_set_keepintvl(sk, val); case TCP_KEEPCNT: return tcp_sock_set_keepcnt(sk, val); case TCP_LINGER2: if (val < 0)
WRITE_ONCE(tp->linger2, -1); elseif (val > TCP_FIN_TIMEOUT_MAX / HZ)
WRITE_ONCE(tp->linger2, TCP_FIN_TIMEOUT_MAX); else
WRITE_ONCE(tp->linger2, val * HZ); return 0; case TCP_DEFER_ACCEPT: /* Translate value in seconds to number of retransmits */
WRITE_ONCE(icsk->icsk_accept_queue.rskq_defer_accept,
secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
TCP_RTO_MAX / HZ)); return 0; case TCP_RTO_MAX_MS: if (val < MSEC_PER_SEC || val > TCP_RTO_MAX_SEC * MSEC_PER_SEC) return -EINVAL;
WRITE_ONCE(inet_csk(sk)->icsk_rto_max, msecs_to_jiffies(val)); return 0; case TCP_RTO_MIN_US: { int rto_min = usecs_to_jiffies(val);
if (rto_min > TCP_RTO_MIN || rto_min < TCP_TIMEOUT_MIN) return -EINVAL;
WRITE_ONCE(inet_csk(sk)->icsk_rto_min, rto_min); return 0;
} case TCP_DELACK_MAX_US: { int delack_max = usecs_to_jiffies(val);
case TCP_CORK:
__tcp_sock_set_cork(sk, val); break;
case TCP_KEEPIDLE:
err = tcp_sock_set_keepidle_locked(sk, val); break; case TCP_SAVE_SYN: /* 0: disable, 1: enable, 2: start from ether_header */ if (val < 0 || val > 2)
err = -EINVAL; else
tp->save_syn = val; break;
case TCP_WINDOW_CLAMP:
err = tcp_set_window_clamp(sk, val); break;
case TCP_QUICKACK:
__tcp_sock_set_quickack(sk, val); break;
case TCP_AO_REPAIR: if (!tcp_can_repair_sock(sk)) {
err = -EPERM; break;
}
err = tcp_ao_set_repair(sk, optval, optlen); break; #ifdef CONFIG_TCP_AO case TCP_AO_ADD_KEY: case TCP_AO_DEL_KEY: case TCP_AO_INFO: { /* If this is the first TCP-AO setsockopt() on the socket, * sk_state has to be LISTEN or CLOSE. Allow TCP_REPAIR * in any state.
*/ if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) goto ao_parse; if (rcu_dereference_protected(tcp_sk(sk)->ao_info,
lockdep_sock_is_held(sk))) goto ao_parse; if (tp->repair) goto ao_parse;
err = -EISCONN; break;
ao_parse:
err = tp->af_specific->ao_parse(sk, optname, optval, optlen); break;
} #endif #ifdef CONFIG_TCP_MD5SIG case TCP_MD5SIG: case TCP_MD5SIG_EXT:
err = tp->af_specific->md5_parse(sk, optname, optval, optlen); break; #endif case TCP_FASTOPEN: if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN))) {
tcp_fastopen_init_key_once(net);
fastopen_queue_tune(sk, val);
} else {
err = -EINVAL;
} break; case TCP_FASTOPEN_CONNECT: if (val > 1 || val < 0) {
err = -EINVAL;
} elseif (READ_ONCE(net->ipv4.sysctl_tcp_fastopen) &
TFO_CLIENT_ENABLE) { if (sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val; else
err = -EINVAL;
} else {
err = -EOPNOTSUPP;
} break; case TCP_FASTOPEN_NO_COOKIE: if (val > 1 || val < 0)
err = -EINVAL; elseif (!((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)))
err = -EINVAL; else
tp->fastopen_no_cookie = val; break; case TCP_TIMESTAMP: if (!tp->repair) {
err = -EPERM; break;
} /* val is an opaque field, * and low order bit contains usec_ts enable bit. * Its a best effort, and we do not care if user makes an error.
*/
tp->tcp_usec_ts = val & 1;
WRITE_ONCE(tp->tsoffset, val - tcp_clock_ts(tp->tcp_usec_ts)); break; case TCP_REPAIR_WINDOW:
err = tcp_repair_set_window(tp, optval, optlen); break; case TCP_NOTSENT_LOWAT:
WRITE_ONCE(tp->notsent_lowat, val);
sk->sk_write_space(sk); break; case TCP_INQ: if (val > 1 || val < 0)
err = -EINVAL; else
tp->recvmsg_inq = val; break; case TCP_TX_DELAY: if (val)
tcp_enable_tx_delay();
WRITE_ONCE(tp->tcp_tx_delay, val); break; default:
err = -ENOPROTOOPT; break;
}
sockopt_release_sock(sk); return err;
}
int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsignedint optlen)
{ conststruct inet_connection_sock *icsk = inet_csk(sk);
if (level != SOL_TCP) /* Paired with WRITE_ONCE() in do_ipv6_setsockopt() and tcp_v6_connect() */ return READ_ONCE(icsk->icsk_af_ops)->setsockopt(sk, level, optname,
optval, optlen); return do_tcp_setsockopt(sk, level, optname, optval, optlen);
}
EXPORT_IPV6_MOD(tcp_setsockopt);
/* segs_in and data_segs_in can be updated from tcp_segs_in() from BH */
info->tcpi_segs_in = READ_ONCE(tp->segs_in);
info->tcpi_data_segs_in = READ_ONCE(tp->data_segs_in);
int do_tcp_getsockopt(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen)
{ struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct net *net = sock_net(sk); int val, len;
if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT;
if (len < 0) return -EINVAL;
len = min_t(unsignedint, len, sizeof(int));
switch (optname) { case TCP_MAXSEG:
val = tp->mss_cache; if (tp->rx_opt.user_mss &&
((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)))
val = tp->rx_opt.user_mss; if (tp->repair)
val = tp->rx_opt.mss_clamp; break; case TCP_NODELAY:
val = !!(tp->nonagle&TCP_NAGLE_OFF); break; case TCP_CORK:
val = !!(tp->nonagle&TCP_NAGLE_CORK); break; case TCP_KEEPIDLE:
val = keepalive_time_when(tp) / HZ; break; case TCP_KEEPINTVL:
val = keepalive_intvl_when(tp) / HZ; break; case TCP_KEEPCNT:
val = keepalive_probes(tp); break; case TCP_SYNCNT:
val = READ_ONCE(icsk->icsk_syn_retries) ? :
READ_ONCE(net->ipv4.sysctl_tcp_syn_retries); break; case TCP_LINGER2:
val = READ_ONCE(tp->linger2); if (val >= 0)
val = (val ? : READ_ONCE(net->ipv4.sysctl_tcp_fin_timeout)) / HZ; break; case TCP_DEFER_ACCEPT:
val = READ_ONCE(icsk->icsk_accept_queue.rskq_defer_accept);
val = retrans_to_secs(val, TCP_TIMEOUT_INIT / HZ,
TCP_RTO_MAX / HZ); break; case TCP_WINDOW_CLAMP:
val = READ_ONCE(tp->window_clamp); break; case TCP_INFO: { struct tcp_info info;
if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT;
tcp_get_info(sk, &info);
len = min_t(unsignedint, len, sizeof(info)); if (copy_to_sockptr(optlen, &len, sizeof(int))) return -EFAULT; if (copy_to_sockptr(optval, &info, len)) return -EFAULT; return 0;
} case TCP_CC_INFO: { conststruct tcp_congestion_ops *ca_ops; union tcp_cc_info info;
size_t sz = 0; int attr;
if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT;
if (copy_to_sockptr(optval, &opt, len)) return -EFAULT; return 0;
} case TCP_QUEUE_SEQ: if (tp->repair_queue == TCP_SEND_QUEUE)
val = tp->write_seq; elseif (tp->repair_queue == TCP_RECV_QUEUE)
val = tp->rcv_nxt; else return -EINVAL; break;
case TCP_USER_TIMEOUT:
val = READ_ONCE(icsk->icsk_user_timeout); break;
case TCP_FASTOPEN:
val = READ_ONCE(icsk->icsk_accept_queue.fastopenq.max_qlen); break;
case TCP_FASTOPEN_CONNECT:
val = tp->fastopen_connect; break;
case TCP_FASTOPEN_NO_COOKIE:
val = tp->fastopen_no_cookie; break;
case TCP_TX_DELAY:
val = READ_ONCE(tp->tcp_tx_delay); break;
case TCP_TIMESTAMP:
val = tcp_clock_ts(tp->tcp_usec_ts) + READ_ONCE(tp->tsoffset); if (tp->tcp_usec_ts)
val |= 1; else
val &= ~1; break; case TCP_NOTSENT_LOWAT:
val = READ_ONCE(tp->notsent_lowat); break; case TCP_INQ:
val = tp->recvmsg_inq; break; case TCP_SAVE_SYN:
val = tp->save_syn; break; case TCP_SAVED_SYN: { if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT;
sockopt_lock_sock(sk); if (tp->saved_syn) { if (len < tcp_saved_syn_len(tp->saved_syn)) {
len = tcp_saved_syn_len(tp->saved_syn); if (copy_to_sockptr(optlen, &len, sizeof(int))) {
sockopt_release_sock(sk); return -EFAULT;
}
sockopt_release_sock(sk); return -EINVAL;
}
len = tcp_saved_syn_len(tp->saved_syn); if (copy_to_sockptr(optlen, &len, sizeof(int))) {
sockopt_release_sock(sk); return -EFAULT;
} if (copy_to_sockptr(optval, tp->saved_syn->data, len)) {
sockopt_release_sock(sk); return -EFAULT;
}
tcp_saved_syn_free(tp);
sockopt_release_sock(sk);
} else {
sockopt_release_sock(sk);
len = 0; if (copy_to_sockptr(optlen, &len, sizeof(int))) return -EFAULT;
} return 0;
} #ifdef CONFIG_MMU case TCP_ZEROCOPY_RECEIVE: { struct scm_timestamping_internal tss; struct tcp_zerocopy_receive zc = {}; int err;
if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT; if (len < 0 ||
len < offsetofend(struct tcp_zerocopy_receive, length)) return -EINVAL; if (unlikely(len > sizeof(zc))) {
err = check_zeroed_sockptr(optval, sizeof(zc),
len - sizeof(zc)); if (err < 1) return err == 0 ? -EINVAL : err;
len = sizeof(zc); if (copy_to_sockptr(optlen, &len, sizeof(int))) return -EFAULT;
} if (copy_from_sockptr(&zc, optval, len)) return -EFAULT; if (zc.reserved) return -EINVAL; if (zc.msg_flags & ~(TCP_VALID_ZC_MSG_FLAGS)) return -EINVAL;
sockopt_lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
sockopt_release_sock(sk); if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags)) goto zerocopy_rcv_cmsg; switch (len) { case offsetofend(struct tcp_zerocopy_receive, msg_flags): goto zerocopy_rcv_cmsg; case offsetofend(struct tcp_zerocopy_receive, msg_controllen): case offsetofend(struct tcp_zerocopy_receive, msg_control): case offsetofend(struct tcp_zerocopy_receive, flags): case offsetofend(struct tcp_zerocopy_receive, copybuf_len): case offsetofend(struct tcp_zerocopy_receive, copybuf_address): case offsetofend(struct tcp_zerocopy_receive, err): goto zerocopy_rcv_sk_err; case offsetofend(struct tcp_zerocopy_receive, inq): goto zerocopy_rcv_inq; case offsetofend(struct tcp_zerocopy_receive, length): default: goto zerocopy_rcv_out;
}
zerocopy_rcv_cmsg: if (zc.msg_flags & TCP_CMSG_TS)
tcp_zc_finalize_rx_tstamp(sk, &zc, &tss); else
zc.msg_flags = 0;
zerocopy_rcv_sk_err: if (!err)
zc.err = sock_error(sk);
zerocopy_rcv_inq:
zc.inq = tcp_inq_hint(sk);
zerocopy_rcv_out: if (!err && copy_to_sockptr(optval, &zc, len))
err = -EFAULT; return err;
} #endif case TCP_AO_REPAIR: if (!tcp_can_repair_sock(sk)) return -EPERM; return tcp_ao_get_repair(sk, optval, optlen); case TCP_AO_GET_KEYS: case TCP_AO_INFO: { int err;
return err;
} case TCP_IS_MPTCP:
val = 0; break; case TCP_RTO_MAX_MS:
val = jiffies_to_msecs(tcp_rto_max(sk)); break; case TCP_RTO_MIN_US:
val = jiffies_to_usecs(READ_ONCE(inet_csk(sk)->icsk_rto_min)); break; case TCP_DELACK_MAX_US:
val = jiffies_to_usecs(READ_ONCE(inet_csk(sk)->icsk_delack_max)); break; default: return -ENOPROTOOPT;
}
if (copy_to_sockptr(optlen, &len, sizeof(int))) return -EFAULT; if (copy_to_sockptr(optval, &val, len)) return -EFAULT; return 0;
}
bool tcp_bpf_bypass_getsockopt(int level, int optname)
{ /* TCP do_tcp_getsockopt has optimized getsockopt implementation * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE.
*/ if (level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE) returntrue;
int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen)
{ struct inet_connection_sock *icsk = inet_csk(sk);
if (level != SOL_TCP) /* Paired with WRITE_ONCE() in do_ipv6_setsockopt() and tcp_v6_connect() */ return READ_ONCE(icsk->icsk_af_ops)->getsockopt(sk, level, optname,
optval, optlen); return do_tcp_getsockopt(sk, level, optname, USER_SOCKPTR(optval),
USER_SOCKPTR(optlen));
}
EXPORT_IPV6_MOD(tcp_getsockopt);
#ifdef CONFIG_TCP_MD5SIG int tcp_md5_sigpool_id = -1;
EXPORT_IPV6_MOD_GPL(tcp_md5_sigpool_id);
int tcp_md5_alloc_sigpool(void)
{
size_t scratch_size; int ret;
scratch_size = sizeof(union tcp_md5sum_block) + sizeof(struct tcphdr);
ret = tcp_sigpool_alloc_ahash("md5", scratch_size); if (ret >= 0) { /* As long as any md5 sigpool was allocated, the return * id would stay the same. Re-write the id only for the case * when previously all MD5 keys were deleted and this call * allocates the first MD5 key, which may return a different * sigpool id than was used previously.
*/
WRITE_ONCE(tcp_md5_sigpool_id, ret); /* Avoids the compiler potentially being smart here */ return 0;
} return ret;
}
/* We use data_race() because tcp_md5_do_add() might change * key->key under us
*/ return data_race(crypto_ahash_update(hp->req));
}
EXPORT_IPV6_MOD(tcp_md5_hash_key);
/* Called with rcu_read_lock() */ staticenum skb_drop_reason
tcp_inbound_md5_hash(conststruct sock *sk, conststruct sk_buff *skb, constvoid *saddr, constvoid *daddr, int family, int l3index, const __u8 *hash_location)
{ /* This gets called for each TCP segment that has TCP-MD5 option. * We have 3 drop cases: * o No MD5 hash and one expected. * o MD5 hash and we're not expecting one. * o MD5 hash and its wrong.
*/ conststruct tcp_sock *tp = tcp_sk(sk); struct tcp_md5sig_key *key;
u8 newhash[16]; int genhash;
/* Check the signature. * To support dual stack listeners, we need to handle * IPv4-mapped case.
*/ if (family == AF_INET)
genhash = tcp_v4_md5_hash_skb(newhash, key, NULL, skb); else
genhash = tp->af_specific->calc_md5_hash(newhash, key,
NULL, skb); if (genhash || memcmp(hash_location, newhash, 16) != 0) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5FAILURE);
trace_tcp_hash_md5_mismatch(sk, skb); return SKB_DROP_REASON_TCP_MD5FAILURE;
} return SKB_NOT_DROPPED_YET;
} #else staticinlineenum skb_drop_reason
tcp_inbound_md5_hash(conststruct sock *sk, conststruct sk_buff *skb, constvoid *saddr, constvoid *daddr, int family, int l3index, const __u8 *hash_location)
{ return SKB_NOT_DROPPED_YET;
}
#endif
/* Called with rcu_read_lock() */ enum skb_drop_reason
tcp_inbound_hash(struct sock *sk, conststruct request_sock *req, conststruct sk_buff *skb, constvoid *saddr, constvoid *daddr, int family, int dif, int sdif)
{ conststruct tcphdr *th = tcp_hdr(skb); conststruct tcp_ao_hdr *aoh; const __u8 *md5_location; int l3index;
/* Invalid option or two times meet any of auth options */ if (tcp_parse_auth_options(th, &md5_location, &aoh)) {
trace_tcp_hash_bad_header(sk, skb); return SKB_DROP_REASON_TCP_AUTH_HDR;
}
if (req) { if (tcp_rsk_used_ao(req) != !!aoh) {
u8 keyid, rnext, maclen;
/* sdif set, means packet ingressed via a device * in an L3 domain and dif is set to the l3mdev
*/
l3index = sdif ? dif : 0;
/* Fast path: unsigned segments */ if (likely(!md5_location && !aoh)) { /* Drop if there's TCP-MD5 or TCP-AO key with any rcvid/sndid * for the remote peer. On TCP-AO established connection * the last key is impossible to remove, so there's * always at least one current_key.
*/ if (tcp_ao_required(sk, saddr, family, l3index, true)) {
trace_tcp_hash_ao_required(sk, skb); return SKB_DROP_REASON_TCP_AONOTFOUND;
} if (unlikely(tcp_md5_do_lookup(sk, l3index, saddr, family))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMD5NOTFOUND);
trace_tcp_hash_md5_required(sk, skb); return SKB_DROP_REASON_TCP_MD5NOTFOUND;
} return SKB_NOT_DROPPED_YET;
}
if (aoh) return tcp_inbound_ao_hash(sk, skb, family, req, l3index, aoh);
/* We might be called with a new socket, after * inet_csk_prepare_forced_close() has been called * so we can not use lockdep_sock_is_held(sk)
*/
req = rcu_dereference_protected(tcp_sk(sk)->fastopen_rsk, 1);
if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
tcp_set_state(sk, TCP_CLOSE);
tcp_clear_xmit_timers(sk); if (req)
reqsk_fastopen_remove(sk, req, false);
WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_state_change(sk); else
inet_csk_destroy_sock(sk);
}
EXPORT_SYMBOL_GPL(tcp_done);
int tcp_abort(struct sock *sk, int err)
{ int state = inet_sk_state_load(sk);
if (state == TCP_NEW_SYN_RECV) { struct request_sock *req = inet_reqsk(sk);
/* Size and allocate the main established and bind bucket * hash tables. * * The methodology is similar to that of the buffer cache.
*/
tcp_hashinfo.ehash =
alloc_large_system_hash("TCP established", sizeof(struct inet_ehash_bucket),
thash_entries,
17, /* one slot per 128 KB of memory */
0,
NULL,
&tcp_hashinfo.ehash_mask,
0,
thash_entries ? 0 : 512 * 1024); for (i = 0; i <= tcp_hashinfo.ehash_mask; i++)
INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
if (inet_ehash_locks_alloc(&tcp_hashinfo))
panic("TCP: failed to alloc ehash_locks");
tcp_hashinfo.bhash =
alloc_large_system_hash("TCP bind",
2 * sizeof(struct inet_bind_hashbucket),
tcp_hashinfo.ehash_mask + 1,
17, /* one slot per 128 KB of memory */
0,
&tcp_hashinfo.bhash_size,
NULL,
0,
64 * 1024);
tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
tcp_hashinfo.bhash2 = tcp_hashinfo.bhash + tcp_hashinfo.bhash_size; for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
spin_lock_init(&tcp_hashinfo.bhash[i].lock);
INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
spin_lock_init(&tcp_hashinfo.bhash2[i].lock);
INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
}
tcp_init_mem(); /* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = nr_free_buffer_pages() << (PAGE_SHIFT - 7);
max_wshare = min(4UL*1024*1024, limit);
max_rshare = min(32UL*1024*1024, limit);
¤ Diese beiden folgenden Angebotsgruppen bietet das Unternehmen0.66Angebot
(Wie Sie bei der Firma Beratungs- und Dienstleistungen beauftragen können 2026-04-25)
¤
Wie Sie bei der Firma Beratungs- und Dienstleistungen beauftragen können
Die Informationen auf dieser Webseite wurden
nach bestem Wissen sorgfältig zusammengestellt. Es wird jedoch weder Vollständigkeit, noch Richtigkeit,
noch Qualität der bereit gestellten Informationen zugesichert.
Bemerkung:
Die farbliche Syntaxdarstellung und die Messung sind noch experimentell.